simon comments on Half-baked AI Safety ideas thread

simon 24 Jun 2022 17:36 UTC
3 points
If the AI is a long term planner seeking particular world states, then I am concerned that once it achieves the wireheading objective, it is incentivized to maintain the situation, which may be best achieved if any humans who might decide to erase the writing are dead.
A suggestion: if the AI has a utility function that applies to actions not world states then you can assign high utility to the combined action of writing “Bill is a poo poo head” in 10m high letters into Mt Everest and then shutting itself down.
Note: this does not solve the problem of the AI actively seeking this out instead of doing what it’s supposed to.
To do the latter, you could try something like:
1. Have the action evaluator ignore the wirehead action unless it is “easy” in some sense to achieve given the AI and world’s current state, and
2. Have the AI assume that the wirehead action will always be ignored in the future
Unfortunately, I don’t know how one would do (2) reliably, and if (2) fails, (1) would lead the AI to actively avoid the tripwire (as activating it would be bad for the AI’s current plans given that the wirehead action is currently being ignored).