Seth Herd comments on Shane Legg’s necessary properties for every AGI Safety plan

Seth Herd 1 May 2024 22:52 UTC
7 points
0
There’s also some more in his interview with Dwarkesh Patel just before then. I wrote this brief analysis of that interview WRT alignment, and this talk seems to confirm that I was more-or-less on target.
So, to your questions, including where I’m guessing at Shane’s thinking, and where it’s mine.
This is overlapping with the standard story AFAICT, and 80% of alignment work is sort of along these lines. I think what Shane’s proposing is pretty different in an important way: it includes System 2 thinking, where almost all alignment work is about aligning the way LLMs give quick answers, analogous to human System 1 thinking.
How do we get a model that is genuinely robustly trying to obey the instruction text, instead of e.g. choosing actions on the basis of a bunch of shards of desire/drives that were historically reinforced[?]
Shane seemed to say he wants to use zero reinforcement learning in the scaffolded agent system, a stance I definitely agree with. I don’t think it matters much whether RLHF was used to “align” the base model, because it’s going to have implicit desires/drives from the predictive training of human text, anyway. Giving instructions to follow doesn’t need to have anything to do with RL; it’s just based on the world model, and putting those instructions as a central and recurring prompt for that system to produce plans and actions to carry out those instructions.
So, how we get a model to robustly obey the instruction text is by implementing system 2 thinking. This is “the obvious thing” if we think about human cognition. System 2 thinking would be applying something more like a tree of thought algorithm, which checks through predicted consequences of the action, and then makes judgments about how well those fulfill the instruction text. This is what I’ve called internal review for alignment of language model cognitive architectures.
To your second and third questions; I didn’t see answers from Shane in either the interview or that talk, but I think they’re the obvious next questions, and they’re what I’ve been working on since then. I think the answers are that the instructions will try to be as scope-limited as possible, that we’ll want to carefully check how they’re interpreted before setting the AGI any major tasks, and that we’ll want to limit autonomous action to the degree that they’re still effective.
Humans will want to remain closely in the loop to deal with inevitable bugs and unintended interpretations and consequences of instructions. I’ve written about this briefly here, and in just a few days soon be publishing a more thorough argument for why I think we’ll do this by default, and why I think it will actually work if it’s done relatively carefully and wisely. Following that, I’m going to write more on the System 2 alignment concept, and I’ll try to actually get Shane to look at it and say if it’s the same thing he’s thinking of in this talk, or at least close.
In all, I think this is both a real alignment plan and one that can work (at least for technical alignment—misuse and multipolar scenarios are still terrifying), and the fact that someone in Shane’s position is thinking this clearly about alignment is very good news.