An agent is “wireheading” if it is taking an action that a) provides it with enormous amounts of utility for little effort, b) is trivial or seemingly unrelated to its “main” utility function or goals.
People have discussed the possibility of an AI wireheading as a problem for getting it to do what you want, “what if it just works out a way to set your metric to +ve infinity and then goes to sleep satisfied?”.
But we can also use this as a guard-rail.
A “wireheading trap” is an action which a) is very hard for an AI to do below a level of capacity X, but very easy for it to do afterwards, b) provides it with vastly more utility than anything else it could do, and c) has little effect on the world otherwise.
An AI with a wireheading trap that is set to be doable if it attains world-affecting capabilities “fails closed”, because if it becomes more capable it will effectively switch itself off by wireheading. An example of a trap could be something like “etch ‘Bill is a poo poo head’ in 10m high letters into Mt Everest”. Very hard if you don’t have the ability to affect the physical world, but if you have nanotech… why bother melting humanity when you can just deface Mt Everest and be done with it?
Obvious problems:
Convergent instrumental goals. We don’t want the AI to wipe us out in order to avoid us stopping it from wireheading, however unlikely that might be. Unclear what to do about this, myopia is one possibility (once you’ve wireheaded you don’t care if you’re stopped later?), but this has problems.
If you make it too attractive the AI won’t even try and be useful before it has enough capacity, instead just wasting all its time on futile attempts to wirehead. Unclear how to make it attractive enough that it dominates once the capability is there but not before.
Overall very half-baked, but I wonder if there’s something to be done in the general area of “have the AI behave in a way that neuters it, but only when its capabilities increase”.
A deliberately easy to hack system is physically adjacent that tracks the AI’s reward. Say it has a no password shell and is accessible via IP.
AI becomes too smart, and hacks itself so it now has infinite reward and it has a clock register it can tamper with so it believes infinite time has already passed.
AI is now dead. Since no action it can take beats infinite reward it does nothing more. Sorta like a heroin overdose.
AI is now dead. Since no action it can take beats infinite reward it does nothing more. Sorta like a heroin overdose
Just watch out for an AI that is optimizing for long-term reward. If it wants to protect its infinite reward fountain then the AI would be incentivized to neutralize any possible threats to that situation (e.g. humans).
If the AI is a long term planner seeking particular world states, then I am concerned that once it achieves the wireheading objective, it is incentivized to maintain the situation, which may be best achieved if any humans who might decide to erase the writing are dead.
A suggestion: if the AI has a utility function that applies to actions not world states then you can assign high utility to the combined action of writing “Bill is a poo poo head” in 10m high letters into Mt Everest and then shutting itself down.
Note: this does not solve the problem of the AI actively seeking this out instead of doing what it’s supposed to.
To do the latter, you could try something like:
Have the action evaluator ignore the wirehead action unless it is “easy” in some sense to achieve given the AI and world’s current state, and
Have the AI assume that the wirehead action will always be ignored in the future
Unfortunately, I don’t know how one would do (2) reliably, and if (2) fails, (1) would lead the AI to actively avoid the tripwire (as activating it would be bad for the AI’s current plans given that the wirehead action is currently being ignored).
I’ve had similar thoughts too. I guess the way I’d implement it is by giving the AI a command that it can activate that directly overwrites the reward buffer but then turns the AI off. The idea here is to make it as easy as possible for an ai inclined to wire head to actually wire head so it is less incentivised to act in the physical world.
During training I would ensure that the SGD used the true reward rather than the wire-headed reward. Maybe that would be sufficient to stop wire-heading, but there are issues with it pursuing the highest probability plan rather than just a high probability plan. Maybe quantilising probability can help here
Wireheading traps.
An agent is “wireheading” if it is taking an action that a) provides it with enormous amounts of utility for little effort, b) is trivial or seemingly unrelated to its “main” utility function or goals.
People have discussed the possibility of an AI wireheading as a problem for getting it to do what you want, “what if it just works out a way to set your metric to +ve infinity and then goes to sleep satisfied?”.
But we can also use this as a guard-rail.
A “wireheading trap” is an action which a) is very hard for an AI to do below a level of capacity X, but very easy for it to do afterwards, b) provides it with vastly more utility than anything else it could do, and c) has little effect on the world otherwise.
An AI with a wireheading trap that is set to be doable if it attains world-affecting capabilities “fails closed”, because if it becomes more capable it will effectively switch itself off by wireheading. An example of a trap could be something like “etch ‘Bill is a poo poo head’ in 10m high letters into Mt Everest”. Very hard if you don’t have the ability to affect the physical world, but if you have nanotech… why bother melting humanity when you can just deface Mt Everest and be done with it?
Obvious problems:
Convergent instrumental goals. We don’t want the AI to wipe us out in order to avoid us stopping it from wireheading, however unlikely that might be. Unclear what to do about this, myopia is one possibility (once you’ve wireheaded you don’t care if you’re stopped later?), but this has problems.
If you make it too attractive the AI won’t even try and be useful before it has enough capacity, instead just wasting all its time on futile attempts to wirehead. Unclear how to make it attractive enough that it dominates once the capability is there but not before.
Overall very half-baked, but I wonder if there’s something to be done in the general area of “have the AI behave in a way that neuters it, but only when its capabilities increase”.
To be specific to a “toy model”.
AI has a goal: collect stamps/build paperclips.
A deliberately easy to hack system is physically adjacent that tracks the AI’s reward. Say it has a no password shell and is accessible via IP.
AI becomes too smart, and hacks itself so it now has infinite reward and it has a clock register it can tamper with so it believes infinite time has already passed.
AI is now dead. Since no action it can take beats infinite reward it does nothing more. Sorta like a heroin overdose.
Just watch out for an AI that is optimizing for long-term reward. If it wants to protect its infinite reward fountain then the AI would be incentivized to neutralize any possible threats to that situation (e.g. humans).
If the AI is a long term planner seeking particular world states, then I am concerned that once it achieves the wireheading objective, it is incentivized to maintain the situation, which may be best achieved if any humans who might decide to erase the writing are dead.
A suggestion: if the AI has a utility function that applies to actions not world states then you can assign high utility to the combined action of writing “Bill is a poo poo head” in 10m high letters into Mt Everest and then shutting itself down.
Note: this does not solve the problem of the AI actively seeking this out instead of doing what it’s supposed to.
To do the latter, you could try something like:
Have the action evaluator ignore the wirehead action unless it is “easy” in some sense to achieve given the AI and world’s current state, and
Have the AI assume that the wirehead action will always be ignored in the future
Unfortunately, I don’t know how one would do (2) reliably, and if (2) fails, (1) would lead the AI to actively avoid the tripwire (as activating it would be bad for the AI’s current plans given that the wirehead action is currently being ignored).
I’ve had similar thoughts too. I guess the way I’d implement it is by giving the AI a command that it can activate that directly overwrites the reward buffer but then turns the AI off. The idea here is to make it as easy as possible for an ai inclined to wire head to actually wire head so it is less incentivised to act in the physical world.
During training I would ensure that the SGD used the true reward rather than the wire-headed reward. Maybe that would be sufficient to stop wire-heading, but there are issues with it pursuing the highest probability plan rather than just a high probability plan. Maybe quantilising probability can help here