So it seems that agents that implement neural network based vision systems may possibly be disincentivized from explicitly learning loopholes about object-level concepts—since this would require it to build more complex models about the environment.
This argument works well if the undesired models are also the complex ones (such as don’t-break-vase vs don’t-break-white-vase). On the other hand, it fails terribly if the reverse is the case.
For example, if humans are manually pressing the reward button, then the hypothesis that reward comes from humans pressing the button (which is the undesirable hypothesis—it leads to wireheading) will often be the simplest one that fits the data.
The danger then arises when the agent is capable of explicitly computing its own reward function. If it knows that its reward function is based on fungible concepts, it could be incentivized to alter those concepts—even if doing so would come at a cost—if it had certainty that doing so would result in a great enough reward.
This kind of confuses two levels. If the AI has high confidence that it is being rewarded for increasing human happiness, then when it considers modifying its own concept of human happiness to something easier to satisfy, then it will ask itself the question “will such a self-modification improve human happiness?” using its current concept of human happiness.
That’s not to say that I’m not worried about such problems; there can easily be bad designs which cause the AI to make this sort of mistake.
I also would have expected you to agree w/ my above comment when you originally wrote the post; I just happened to see tristanm’s old comment and replied.
However, now I’m interested in hearing about what ideas from this post you don’t endorse!
This argument works well if the undesired models are also the complex ones (such as don’t-break-vase vs don’t-break-white-vase). On the other hand, it fails terribly if the reverse is the case.
For example, if humans are manually pressing the reward button, then the hypothesis that reward comes from humans pressing the button (which is the undesirable hypothesis—it leads to wireheading) will often be the simplest one that fits the data.
This kind of confuses two levels. If the AI has high confidence that it is being rewarded for increasing human happiness, then when it considers modifying its own concept of human happiness to something easier to satisfy, then it will ask itself the question “will such a self-modification improve human happiness?” using its current concept of human happiness.
That’s not to say that I’m not worried about such problems; there can easily be bad designs which cause the AI to make this sort of mistake.
I agree with you; this is an old post that I don’t really agree with any more.
I also would have expected you to agree w/ my above comment when you originally wrote the post; I just happened to see tristanm’s old comment and replied.
However, now I’m interested in hearing about what ideas from this post you don’t endorse!