Well we understand fairly well now that in gradient descent-based computer vision systems, the model that eventually gets learned about the environment can pretty well generalize to new examples that have never been seen (see here, and response here).
In other words, the “over-fitting” problem tends to be avoided even in highly over-parameterized networks. This seems to be because:
Models that memorize specific examples tend to be more complex than models that generalize.
Gradient descent-based optimizers have to do more work (they have to move a point in parameter space across a longer distance) to learn a complex model with more curvature.
Even though neural nets are capable of learning functions of nearly arbitrary complexity, they will first try to fit a relatively smooth function, or close to the simplest hypothesis that fits the data.
So it seems that agents that implement neural network based vision systems may possibly be disincentivized from explicitly learning loopholes about object-level concepts—since this would require it to build more complex models about the environment. The danger then arises when the agent is capable of explicitly computing its own reward function. If it knows that its reward function is based on fungible concepts, it could be incentivized to alter those concepts—even if doing so would come at a cost—if it had certainty that doing so would result in a great enough reward.
So what if it didn’t have certainty about its reward function? Then it might need to model the reward function using similar techniques, perhaps also a gradient-based optimizer. That might be an acceptable solution insofar as it is also incentive to learn the simplest model that fits what it has observed about its reward function. Considering actions that are complex or seem to get very close to actions that have large cost would then become very unsafe bets due to that uncertainty. Actions that are close together in function space will be given similar cost values if the function that maps actions to rewards is fairly smooth.
Of course I don’t know if any of the above remains true in the regime where model capacity and optimization power are no longer an issue.
So it seems that agents that implement neural network based vision systems may possibly be disincentivized from explicitly learning loopholes about object-level concepts—since this would require it to build more complex models about the environment.
This argument works well if the undesired models are also the complex ones (such as don’t-break-vase vs don’t-break-white-vase). On the other hand, it fails terribly if the reverse is the case.
For example, if humans are manually pressing the reward button, then the hypothesis that reward comes from humans pressing the button (which is the undesirable hypothesis—it leads to wireheading) will often be the simplest one that fits the data.
The danger then arises when the agent is capable of explicitly computing its own reward function. If it knows that its reward function is based on fungible concepts, it could be incentivized to alter those concepts—even if doing so would come at a cost—if it had certainty that doing so would result in a great enough reward.
This kind of confuses two levels. If the AI has high confidence that it is being rewarded for increasing human happiness, then when it considers modifying its own concept of human happiness to something easier to satisfy, then it will ask itself the question “will such a self-modification improve human happiness?” using its current concept of human happiness.
That’s not to say that I’m not worried about such problems; there can easily be bad designs which cause the AI to make this sort of mistake.
I also would have expected you to agree w/ my above comment when you originally wrote the post; I just happened to see tristanm’s old comment and replied.
However, now I’m interested in hearing about what ideas from this post you don’t endorse!
Well we understand fairly well now that in gradient descent-based computer vision systems, the model that eventually gets learned about the environment can pretty well generalize to new examples that have never been seen (see here, and response here).
In other words, the “over-fitting” problem tends to be avoided even in highly over-parameterized networks. This seems to be because:
Models that memorize specific examples tend to be more complex than models that generalize.
Gradient descent-based optimizers have to do more work (they have to move a point in parameter space across a longer distance) to learn a complex model with more curvature.
Even though neural nets are capable of learning functions of nearly arbitrary complexity, they will first try to fit a relatively smooth function, or close to the simplest hypothesis that fits the data.
So it seems that agents that implement neural network based vision systems may possibly be disincentivized from explicitly learning loopholes about object-level concepts—since this would require it to build more complex models about the environment. The danger then arises when the agent is capable of explicitly computing its own reward function. If it knows that its reward function is based on fungible concepts, it could be incentivized to alter those concepts—even if doing so would come at a cost—if it had certainty that doing so would result in a great enough reward.
So what if it didn’t have certainty about its reward function? Then it might need to model the reward function using similar techniques, perhaps also a gradient-based optimizer. That might be an acceptable solution insofar as it is also incentive to learn the simplest model that fits what it has observed about its reward function. Considering actions that are complex or seem to get very close to actions that have large cost would then become very unsafe bets due to that uncertainty. Actions that are close together in function space will be given similar cost values if the function that maps actions to rewards is fairly smooth.
Of course I don’t know if any of the above remains true in the regime where model capacity and optimization power are no longer an issue.
This argument works well if the undesired models are also the complex ones (such as don’t-break-vase vs don’t-break-white-vase). On the other hand, it fails terribly if the reverse is the case.
For example, if humans are manually pressing the reward button, then the hypothesis that reward comes from humans pressing the button (which is the undesirable hypothesis—it leads to wireheading) will often be the simplest one that fits the data.
This kind of confuses two levels. If the AI has high confidence that it is being rewarded for increasing human happiness, then when it considers modifying its own concept of human happiness to something easier to satisfy, then it will ask itself the question “will such a self-modification improve human happiness?” using its current concept of human happiness.
That’s not to say that I’m not worried about such problems; there can easily be bad designs which cause the AI to make this sort of mistake.
I agree with you; this is an old post that I don’t really agree with any more.
I also would have expected you to agree w/ my above comment when you originally wrote the post; I just happened to see tristanm’s old comment and replied.
However, now I’m interested in hearing about what ideas from this post you don’t endorse!