The part I find the most significant and concerning is his statement that the path forward is something like Gemini crossed with AlphaZero. Most of the concerns about aligning foundation model agents surround unpredictable results of using powerful RL that creates goal-directed behavior in complex, hidden ways. I find those concerns highly realistic; the List of Lethalities (LoL) focuses on these risks, and the subtle but powerful principle “humans fuck things up, especially on the first (hundred)? tries”.
My general thoughts on Deepmind’s strategy can be found in my comment here, as well as discussing the impact of RL agentizing an AI more generally, and short answer, I’m a little more concerned than in the case of pre-trained AIs like GPT-4 or GPT-N, and some more alignment work should go to that scenario, but the reward is likely to be densely defined such that the AI has limited opportunities for breaking it via instrumental convergence:
(BTW, I also see this as a problem for Section B2, as it’s examples rely on the analogy of evolution, but there are critical details that disallow us generalizing from “Evolution failed at aligning us to X” to “Humans can’t align AIs to X”. Also incorrectly assumes that corrigibility is anti-natural for consequentialist/Expected Utility Maximizing AIs and highly capable AIs, because GPT-4 and GPT-N do likely have a utility function that is learned as described here:
Note I have written about how I’d actually do alignment in practice, such that we can get the densely defined signal of human values/instruction following to hold yesterday:
The part I find the most significant and concerning is his statement that the path forward is something like Gemini crossed with AlphaZero. Most of the concerns about aligning foundation model agents surround unpredictable results of using powerful RL that creates goal-directed behavior in complex, hidden ways. I find those concerns highly realistic; the List of Lethalities (LoL) focuses on these risks, and the subtle but powerful principle “humans fuck things up, especially on the first (hundred)? tries”.
That’s why I’d much rather see us try Goals selected from learned knowledge: an alternative to RL alignment for our first alignment attempts. This might be possible even if RL is heavily involved in some elements of training; I’m trying to think through those scenario now.
My general thoughts on Deepmind’s strategy can be found in my comment here, as well as discussing the impact of RL agentizing an AI more generally, and short answer, I’m a little more concerned than in the case of pre-trained AIs like GPT-4 or GPT-N, and some more alignment work should go to that scenario, but the reward is likely to be densely defined such that the AI has limited opportunities for breaking it via instrumental convergence:
https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=DgLC43S7PgMuC878j
(BTW, I also see this as a problem for Section B2, as it’s examples rely on the analogy of evolution, but there are critical details that disallow us generalizing from “Evolution failed at aligning us to X” to “Humans can’t align AIs to X”. Also incorrectly assumes that corrigibility is anti-natural for consequentialist/Expected Utility Maximizing AIs and highly capable AIs, because GPT-4 and GPT-N do likely have a utility function that is learned as described here:
https://www.lesswrong.com/posts/k48vB92mjE9Z28C3s/implied-utilities-of-simulators-are-broad-dense-and-shallow
https://www.lesswrong.com/posts/vs49tuFuaMEd4iskA/one-path-to-coherence-conditionalization
I might have more to say on that post later.)
I’ve replied to AGI Ruin: A List of Lethalities here:
https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=Gcigdmuje4EacwirD
It’s a very long comment due to having me to respond to a lot of points, so get a drink and a snack while you read this comment.
I endorse the link to that other comment. We’ve got what feels like a useful discussion over there on exactly this issue.
Note I have written about how I’d actually do alignment in practice, such that we can get the densely defined signal of human values/instruction following to hold yesterday:
https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=BxNLNXhpGhxzm7heg