Thanks, that’s useful to know. If you have the time, can you say some more about ‘control of an emerging AI’s preferences’? I sketch out a proposed training regimen for the preferences that we want, and argue that this regimen largely circumvents the problems of reward misspecification, goal misgeneralization, and deceptive alignment. Are you not convinced by that part? Or is there some other problem I’m missing?
Thanks, that’s useful to know. If you have the time, can you say some more about ‘control of an emerging AI’s preferences’? I sketch out a proposed training regimen for the preferences that we want, and argue that this regimen largely circumvents the problems of reward misspecification, goal misgeneralization, and deceptive alignment. Are you not convinced by that part? Or is there some other problem I’m missing?