To my ears it sounded like Shane’s solution to “alignment” was to make the models more consequentialist. I really don’t think he appreciates most of the difficulty and traps of the problems here. This type of thinking, on my model of their models, should make even alignment optimists unimpressed, since much of the reason for optimism lies in observing current language models, and interpreting their outputs as being nonconsequentialist, corrigible, and limited in scope yet broad in application.
Consequentialism is not uniformly bad. I think the specific way Shane wants to make the models more consequentialist defends against some failure modes. Deep Deceptiveness is essentially about humans being misled because the module that checks for safety of an action is shallow, and the rest of the model is smarter than it and locally pointed at goals that are more difficult to satisfy without misleading the operators. If the model in this story were fully aware of the consequences of its actions through deliberation, it could realize that modeling the human operators in a different ontology in order to route around them is still bad. (I feel like self-knowledge is more important here though.)
Deliberation also does not have to be consequentialist. The model could deliberate to ensure it’s not breaking some deontological rule, and this won’t produce instrumental pressure towards a coup.
Would be curious to hear your idea of some of the “difficulty and traps”.
Seems a good format for explaining such stuff is a dialogue, due to the probable inferential gap between my model and yours. I’d be happy to have one if you like.
I think it is about making the models more consequentialist, in the sense of making them smarter and more agentic.
I don’t see evidence that he’s ignoring the hard parts of alignment. And I’m not even sure how optimistic he is, beyond presumably thinking success in a possible.
You could be right in assuming those are his reasons for optimism. That does seem ungenerous, but it could be true. Those definitely aren’t my reasons for my limited optimism. See my linked pieces for those. Language model agents are modestly consequentialist and not automatically corrigible, but they have unique and large advantages for alignment. I’m a bit puzzled at the lack and of enthusiasm for that direction; I’ve only gotten vague criticisms along the lines of “that can’t possibly work because there are ways for it to possibly fail”. The lines of thought those critiques reference just argue that alignment is hard, not that it’s impossible or that a natural language alignment approach couldn’t work. So I’m really hoping to get some more direct engagement with these ideas.
He should be rather optimistic because otherwise he probably wouldn’t stay at DeepMind.
I also don’t remember he said much about the problems of misuse, AI proliferation, and Moloch, as well as the issue of choosing the particular ethics for the AGI, so I take this as small indirect evidence for that DeepMind have a plan similar to OpenAI’s “superalignment”, i.e., “we will create a cognitively aligned agent and will task it with solving the rest of societal and civilisational alignment and coordination issues”.
You could be right, but I didn’t hear any hints that he intends to kick those problems down the road to an aligned agent. That’s Conjecture’s CoEm plan, but I read OpenAIs Superalignment plan as even more vague: make AI better so it can help with alignment, prior to being AGI. Theirs was sort of a plan to create a plan. I like Shane’s better, in part because it’s closer to being an actual plan.
He did explicitly note that choosing the particular ethics for the AGI is an outstanding problem, but I don’t think he proposed solutions, either AI or human. I corrigibility as the central value gives as much time to solve the outer alignment problem as you want (a “long contemplation”), after the inner alignment problem is solved, but I have no idea if his thinking is similar.
I also don’t think he addressed misuse, proliferation, or competition. I can think of multiple reasons for keeping them offstage, but I suspect they just didn’t happen to make the top priority list for this relatively short interview.
That is very generous. My impression was that Shane Legg does not even know what the alignment problem is, or at least tried to give the viewer the idea that he didn’t. His “solution” to the “alignment problem” was to give the AI a better world model and the ability to reflect, which obviously isn’t an alignment solution, it’s just a capabilities enhancement. Dwarkesh Patel seemed confused for the same reason.
I have more respect than that for Shane. He has been thinking about this stuff for a long time, and my guess is he has some models in the space here (I don’t know how good they are, but I am confident he knows the rough shape of the AI Alignment problem).
To my ears it sounded like Shane’s solution to “alignment” was to make the models more consequentialist. I really don’t think he appreciates most of the difficulty and traps of the problems here. This type of thinking, on my model of their models, should make even alignment optimists unimpressed, since much of the reason for optimism lies in observing current language models, and interpreting their outputs as being nonconsequentialist, corrigible, and limited in scope yet broad in application.
Consequentialism is not uniformly bad. I think the specific way Shane wants to make the models more consequentialist defends against some failure modes. Deep Deceptiveness is essentially about humans being misled because the module that checks for safety of an action is shallow, and the rest of the model is smarter than it and locally pointed at goals that are more difficult to satisfy without misleading the operators. If the model in this story were fully aware of the consequences of its actions through deliberation, it could realize that modeling the human operators in a different ontology in order to route around them is still bad. (I feel like self-knowledge is more important here though.)
Deliberation also does not have to be consequentialist. The model could deliberate to ensure it’s not breaking some deontological rule, and this won’t produce instrumental pressure towards a coup.
Would be curious to hear your idea of some of the “difficulty and traps”.
Seems a good format for explaining such stuff is a dialogue, due to the probable inferential gap between my model and yours. I’d be happy to have one if you like.
We started a dialogue, which will live here when we post it.
I think it is about making the models more consequentialist, in the sense of making them smarter and more agentic.
I don’t see evidence that he’s ignoring the hard parts of alignment. And I’m not even sure how optimistic he is, beyond presumably thinking success in a possible.
You could be right in assuming those are his reasons for optimism. That does seem ungenerous, but it could be true. Those definitely aren’t my reasons for my limited optimism. See my linked pieces for those. Language model agents are modestly consequentialist and not automatically corrigible, but they have unique and large advantages for alignment. I’m a bit puzzled at the lack and of enthusiasm for that direction; I’ve only gotten vague criticisms along the lines of “that can’t possibly work because there are ways for it to possibly fail”. The lines of thought those critiques reference just argue that alignment is hard, not that it’s impossible or that a natural language alignment approach couldn’t work. So I’m really hoping to get some more direct engagement with these ideas.
He should be rather optimistic because otherwise he probably wouldn’t stay at DeepMind.
I also don’t remember he said much about the problems of misuse, AI proliferation, and Moloch, as well as the issue of choosing the particular ethics for the AGI, so I take this as small indirect evidence for that DeepMind have a plan similar to OpenAI’s “superalignment”, i.e., “we will create a cognitively aligned agent and will task it with solving the rest of societal and civilisational alignment and coordination issues”.
You could be right, but I didn’t hear any hints that he intends to kick those problems down the road to an aligned agent. That’s Conjecture’s CoEm plan, but I read OpenAIs Superalignment plan as even more vague: make AI better so it can help with alignment, prior to being AGI. Theirs was sort of a plan to create a plan. I like Shane’s better, in part because it’s closer to being an actual plan.
He did explicitly note that choosing the particular ethics for the AGI is an outstanding problem, but I don’t think he proposed solutions, either AI or human. I corrigibility as the central value gives as much time to solve the outer alignment problem as you want (a “long contemplation”), after the inner alignment problem is solved, but I have no idea if his thinking is similar.
I also don’t think he addressed misuse, proliferation, or competition. I can think of multiple reasons for keeping them offstage, but I suspect they just didn’t happen to make the top priority list for this relatively short interview.
That is very generous. My impression was that Shane Legg does not even know what the alignment problem is, or at least tried to give the viewer the idea that he didn’t. His “solution” to the “alignment problem” was to give the AI a better world model and the ability to reflect, which obviously isn’t an alignment solution, it’s just a capabilities enhancement. Dwarkesh Patel seemed confused for the same reason.
I have more respect than that for Shane. He has been thinking about this stuff for a long time, and my guess is he has some models in the space here (I don’t know how good they are, but I am confident he knows the rough shape of the AI Alignment problem).
See also: https://www.lesswrong.com/users/shane_legg