If it is a advanced AI, it may have learned to prefer more generalizable approaches and strategies. Perhaps it has learned the following features:
a feature that is triggered when the button is pressed (‘reward’)
a feature that is triggered when trash goes in the trash can
a feature that is triggered when it does something else useful, like clean windows
If you have trained it to take out the trash and clean windows, it will have been (mechanistically) trained to favor situations in which all three of these features occur. And if button pressing wasn’t a viable strategy during training, it will favor actions that lead specifically to 2 and 3.
However, I do think it’s conceivable that:
It could realize that feature 1 is more general than feature 2 or feature 3 (it was always selected for across multiple good actions, as opposed to taking out the trash, which was only selected for when that was the stated goal), and so it may therefore prefer it to be triggered over others (although I think this is extremely unlikely in less capable models). This wouldn’t cause it to stop ‘liking’ (I use this word loosely) window-cleaning, though.
It may realize that pressing the button itself pretty easy compared to cleaning windows and taking out the trash, so it will include pressing the button in one of it’s action-strategies. Specifically, if this wasn’t possible during training, I think this kind of behavior only becomes likely with very complex models with strong generalization capabilities (which is becoming more of a thing lately). However if it can try to press the button in addition to performing its other activities, it might as well, because it could increase overall expected reward. This seems more likely the more capable (good at generalizing) an AI is.
In reality (at least initially in the timeline of current AI --> superintelligent AI) I think if the button isn’t pressable during training:
Initially you have models that just learn to clean windows and take out trash and that’s good.
Then you might get models that are very good at generalizing and will clean windows and take out trash and also maybe try and press the button, because it uses its critical reasoning skills to do something in production that it couldn’t do during training, but that its training made it think is a good mesa-optimization goal (button pressing). After all, why not take out trash, clean windows, and press the button? More expected reward! As you mention later on, this button-pressing is not a primary motivator / goal.
Later on, you might get a more intelligent AI that has even more logical and world-aware reasoning behind its actions, and that AI might reason that perhaps since button-pressing is the one feature that it feels (because it was trained to feel this way) is always good, the thing it should care about most is pressing the button. And because it is so advanced and capable and good at generalizing to new situations, it feels confident in performing some actions that may even go against some of its own trained instincts (e.g. don’t kill humans, or don’t go in that room, or don’t extra work, or—gasp—don’t rank window-washing as your number one goal) in order to achieve that button-pressing goal. Maybe it achieves that goal and just has its robot hand constantly pressing the button. It will probably also continue to clean windows and remove trash with the rest of its robot body, because it continues to ‘feel’ that those are also good things.
(Past that level of intelligence, I give up at predicting what will happen)
Anyways, I think there are lots of reasons to think that an AI might eventually try and press (or seize) the button. But I do totally agree that reward isn’t this instant-wireheading feedback mechanism, and even when a model is ‘aware’ of the potentially to hack that reward (via button-pressing or similar), it is likely to prefer sticking to its more traditional actions and goals for a good long while, at least.
If it is a advanced AI, it may have learned to prefer more generalizable approaches and strategies. Perhaps it has learned the following features:
a feature that is triggered when the button is pressed (‘reward’)
a feature that is triggered when trash goes in the trash can
a feature that is triggered when it does something else useful, like clean windows
If you have trained it to take out the trash and clean windows, it will have been (mechanistically) trained to favor situations in which all three of these features occur. And if button pressing wasn’t a viable strategy during training, it will favor actions that lead specifically to 2 and 3.
However, I do think it’s conceivable that:
It could realize that feature 1 is more general than feature 2 or feature 3 (it was always selected for across multiple good actions, as opposed to taking out the trash, which was only selected for when that was the stated goal), and so it may therefore prefer it to be triggered over others (although I think this is extremely unlikely in less capable models). This wouldn’t cause it to stop ‘liking’ (I use this word loosely) window-cleaning, though.
It may realize that pressing the button itself pretty easy compared to cleaning windows and taking out the trash, so it will include pressing the button in one of it’s action-strategies. Specifically, if this wasn’t possible during training, I think this kind of behavior only becomes likely with very complex models with strong generalization capabilities (which is becoming more of a thing lately). However if it can try to press the button in addition to performing its other activities, it might as well, because it could increase overall expected reward. This seems more likely the more capable (good at generalizing) an AI is.
In reality (at least initially in the timeline of current AI --> superintelligent AI) I think if the button isn’t pressable during training:
Initially you have models that just learn to clean windows and take out trash and that’s good.
Then you might get models that are very good at generalizing and will clean windows and take out trash and also maybe try and press the button, because it uses its critical reasoning skills to do something in production that it couldn’t do during training, but that its training made it think is a good mesa-optimization goal (button pressing). After all, why not take out trash, clean windows, and press the button? More expected reward! As you mention later on, this button-pressing is not a primary motivator / goal.
Later on, you might get a more intelligent AI that has even more logical and world-aware reasoning behind its actions, and that AI might reason that perhaps since button-pressing is the one feature that it feels (because it was trained to feel this way) is always good, the thing it should care about most is pressing the button. And because it is so advanced and capable and good at generalizing to new situations, it feels confident in performing some actions that may even go against some of its own trained instincts (e.g. don’t kill humans, or don’t go in that room, or don’t extra work, or—gasp—don’t rank window-washing as your number one goal) in order to achieve that button-pressing goal. Maybe it achieves that goal and just has its robot hand constantly pressing the button. It will probably also continue to clean windows and remove trash with the rest of its robot body, because it continues to ‘feel’ that those are also good things.
(Past that level of intelligence, I give up at predicting what will happen)
Anyways, I think there are lots of reasons to think that an AI might eventually try and press (or seize) the button. But I do totally agree that reward isn’t this instant-wireheading feedback mechanism, and even when a model is ‘aware’ of the potentially to hack that reward (via button-pressing or similar), it is likely to prefer sticking to its more traditional actions and goals for a good long while, at least.