My general thoughts on Deepmind’s strategy can be found in my comment here, as well as discussing the impact of RL agentizing an AI more generally, and short answer, I’m a little more concerned than in the case of pre-trained AIs like GPT-4 or GPT-N, and some more alignment work should go to that scenario, but the reward is likely to be densely defined such that the AI has limited opportunities for breaking it via instrumental convergence:
(BTW, I also see this as a problem for Section B2, as it’s examples rely on the analogy of evolution, but there are critical details that disallow us generalizing from “Evolution failed at aligning us to X” to “Humans can’t align AIs to X”. Also incorrectly assumes that corrigibility is anti-natural for consequentialist/Expected Utility Maximizing AIs and highly capable AIs, because GPT-4 and GPT-N do likely have a utility function that is learned as described here:
Note I have written about how I’d actually do alignment in practice, such that we can get the densely defined signal of human values/instruction following to hold yesterday:
My general thoughts on Deepmind’s strategy can be found in my comment here, as well as discussing the impact of RL agentizing an AI more generally, and short answer, I’m a little more concerned than in the case of pre-trained AIs like GPT-4 or GPT-N, and some more alignment work should go to that scenario, but the reward is likely to be densely defined such that the AI has limited opportunities for breaking it via instrumental convergence:
https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=DgLC43S7PgMuC878j
(BTW, I also see this as a problem for Section B2, as it’s examples rely on the analogy of evolution, but there are critical details that disallow us generalizing from “Evolution failed at aligning us to X” to “Humans can’t align AIs to X”. Also incorrectly assumes that corrigibility is anti-natural for consequentialist/Expected Utility Maximizing AIs and highly capable AIs, because GPT-4 and GPT-N do likely have a utility function that is learned as described here:
https://www.lesswrong.com/posts/k48vB92mjE9Z28C3s/implied-utilities-of-simulators-are-broad-dense-and-shallow
https://www.lesswrong.com/posts/vs49tuFuaMEd4iskA/one-path-to-coherence-conditionalization
I might have more to say on that post later.)
I’ve replied to AGI Ruin: A List of Lethalities here:
https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=Gcigdmuje4EacwirD
It’s a very long comment due to having me to respond to a lot of points, so get a drink and a snack while you read this comment.
I endorse the link to that other comment. We’ve got what feels like a useful discussion over there on exactly this issue.
Note I have written about how I’d actually do alignment in practice, such that we can get the densely defined signal of human values/instruction following to hold yesterday:
https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=BxNLNXhpGhxzm7heg