From what I can tell, Legg’s view is that aligning language models is mostly a function of capability. As a result, his alignment techniques are mostly focused on getting models to understand our ethical principles, and getting models to understand whether the actions they take follow our ethical principles by using deliberation. Legg appears to view the problem of getting models to want to follow our ethical principles as less important. Perhaps he thinks it will happen by default.
Dwarkesh pushed him on how we can get models to want to follow our ethical principles. Legg’s responses mostly still focused on model capabilities. The closest answer he gave as far as I can tell is that you have to “specify to the system: these are the ethical principles you should follow”, and you have to check the reasoning process the model uses to make decisions.
I agree that Legg’s focus was more on getting the model to understand our ethical principles. And he didn’t give much on how you make a system that follows any principles. My guess is that he’s thinking of something like a much-elaborated form of AutoGPT; that’s what I mean by a language model agent. You prompt it with “come up with a plan that follows these goals” and then have a review process that prompts with something like “now make sure this plan fulfills these goals”. But I’m just guessing that he’s thinking of a similar system. He might be deliberately vague so as to not share strategy with competitors, or maybe that’s just not what the interview focused on.
I think this is one reasonable interpretation of his comments. But the fact that he:
1. Didn’t say very much about a solution to the problem of making models want to follow our ethical principles, and 2. Mostly talked about model capabilities even when explicitly asked about that problem
makes me think it’s not something he spends much time thinking about, and is something he doesn’t think is especially important to focus on.
From what I can tell, Legg’s view is that aligning language models is mostly a function of capability. As a result, his alignment techniques are mostly focused on getting models to understand our ethical principles, and getting models to understand whether the actions they take follow our ethical principles by using deliberation. Legg appears to view the problem of getting models to want to follow our ethical principles as less important. Perhaps he thinks it will happen by default.
Dwarkesh pushed him on how we can get models to want to follow our ethical principles. Legg’s responses mostly still focused on model capabilities. The closest answer he gave as far as I can tell is that you have to “specify to the system: these are the ethical principles you should follow”, and you have to check the reasoning process the model uses to make decisions.
I agree that Legg’s focus was more on getting the model to understand our ethical principles. And he didn’t give much on how you make a system that follows any principles. My guess is that he’s thinking of something like a much-elaborated form of AutoGPT; that’s what I mean by a language model agent. You prompt it with “come up with a plan that follows these goals” and then have a review process that prompts with something like “now make sure this plan fulfills these goals”. But I’m just guessing that he’s thinking of a similar system. He might be deliberately vague so as to not share strategy with competitors, or maybe that’s just not what the interview focused on.
I think this is one reasonable interpretation of his comments. But the fact that he:
1. Didn’t say very much about a solution to the problem of making models want to follow our ethical principles, and
2. Mostly talked about model capabilities even when explicitly asked about that problem
makes me think it’s not something he spends much time thinking about, and is something he doesn’t think is especially important to focus on.