Thomas Kwa comments on Thomas Kwa’s Shortform

Thomas Kwa 23 Sep 2024 22:47 UTC
6 points
1
Maybe people worried about AI self-modification should study games where the AI’s utility function can be modified by the environment, and it is trained to maximize its current utility function (in the “realistic value functions” sense of Everitt 2016). Some things one could do:
- Examine preference preservation and refine classic arguments about instrumental convergence
  - Are there initial goals that allow for stably corrigible systems (in the sense that they won’t disable an off switch, and maybe other senses)?
- Try various games and see how qualitatively hard it is for agents to optimize their original utility function. This would be evidence about how likely value drift is to result from self-modification in AGIs.
  - Can the safe exploration literature be adapted to solve these games?
- Potentially discover algorithms that seem like they would be good for safety, either through corrigibility or reduced value drift, and apply them to LM agents.
Maybe I am ignorant of some people already doing this, and if so please comment with papers!