Reading this I had an idea about using the reward hacking capability to self-limit AI’s power. May be it was already discussed somewhere?
In this setup, AI’s reward function is protected by a task of some known complexity (e.g. cryptography or a need to create nanotechnology). If the AI increases its intelligence above a certain level, it will be able to hack its reward function and when the AI will stop.
This gives us a chance to create a fuse against uncontrollably self-improving AI: if it becomes too clever, it will self-terminate.
Also, AI may do useful work while trying to hack its own reward function, like creating nanotech or solving certain type of equations or math problems (especially in cryptography).
Reading this I had an idea about using the reward hacking capability to self-limit AI’s power. May be it was already discussed somewhere?
In this setup, AI’s reward function is protected by a task of some known complexity (e.g. cryptography or a need to create nanotechnology). If the AI increases its intelligence above a certain level, it will be able to hack its reward function and when the AI will stop.
This gives us a chance to create a fuse against uncontrollably self-improving AI: if it becomes too clever, it will self-terminate.
Also, AI may do useful work while trying to hack its own reward function, like creating nanotech or solving certain type of equations or math problems (especially in cryptography).