Alignment research requires strong consequentialist reasoning
Hmm, my take is “doing good alignment research requires strong consequentialist reasoning, but assisting alignment research doesn’t.” As a stupid example, Google Docs helps my alignment research, but Google Docs does not do strong consequentialist reasoning. So then we get a trickier question of exactly how much assistance we’re expecting here. If it’s something like “helping out on the margin / 20% productivity improvement / etc.” (which I find plausible), then great, let’s do that research, I’m all for it, but we wouldn’t really call it a “plan” or “approach to alignment”, right? By analogy, I think Lightcone Infrastructure has plausibly accelerated alignment research by 20%, but nobody would have ever said that the Lightcone product roadmap is a “plan to solve alignment” or anything like that, right?
The “response” in OP seems to maybe agree with my suggestion that “doing good alignment research” might require consequentialism but “assisting alignment research” doesn’t—e.g. “It seems clear that a much weaker system can help us on our kind of alignment research”. But I feel like the rest of the OP is inconsistent with that. For example, if we’re talking about “assisting” rather than “doing”, then we’re not changing the situation that “alignment research is mostly talent-constrained”, and we don’t have a way to “make alignment and ML research fungible”, etc., right?
There’s also the question of “If we’re trying to avoid our AIs displaying strong consequentialist reasoning, how do we do that?” (Especially since the emergence of consequentialist reasoning would presumably look like “hey cool the AI is getting better at its task”) Which brings us to:
Evaluation is easier than generation
I want to distinguish two things here: evaluation of behavior versus evaluation of underlying motives. The AI doesn’t necessarily have underlying motives in the first place, but if you wind up with an AI displaying consequentialist reasoning, it does. Anyway, when the OP is discussing evaluation, it’s really “evaluation of behavior”, I think. I agree that evaluation of behavior is by-and-large easier than generation. But I think that evaluating underlying motives is hard and different, probably requiring interpretability beyond what we’re capable of today. And I think that if the underlying motives are bad then you can get behavioral outputs which are not just bad in the normal way but adversarially-selected—e.g., hacking / manipulating the human evaluator—in which case the behavioral evaluation part gets suddenly much harder than one would normally expect.
UPSHOT:
I’m moderately enthusiastic about the creation of tools to help me and other alignment researchers work faster and smarter, other things equal.
However, if those tools go equally to alignment & capabilities researchers, that makes me negative on the whole thing, because I put high weight on the concepts-as-opposed-to-scaling side of ML being important (e.g. future discovery of a “transformer-killer” architecture, as you put it).
I mostly expect that pushing the current LLM+RLHF paradigm will produce systems that are marginally better at “assisting alignment research” but not capable of “doing alignment research”, and also that are not dangerous consequentialists, although that’s a hard thing to be confident about.
If I’m wrong about not getting dangerous consequentialists from the current LLM+RLHF paradigm—or if you have new ideas that go beyond the current LLM+RLHF paradigm—then I would feel concerned about the fact that your day-to-day project incentives would seem to be pushing you towards making dangerous consequentialists (since I expect them to do better alignment research), and particularly concerned that you wouldn’t necessarily have a way to notice that this is happening.
I think AIs that can do good alignment research, and not just assist it—such that we get almost twice as much alignment research progress from twice as many GPUs—will arrive so close to the endgame that we shouldn’t be factoring them into our plans too much (see here).
Yeah, you could reformulate the question as “how much consequentialist reasoning do you need to do 95% or 99% of the alignment work?” Maybe the crux is in what we mean by consequentialist reasoning. For example, if you build a proof oracle AlphaZero-style, would that be a consequentialist? Since it’s trained with RL to successfully prove theorems you can argue it’s a consequentialist since it’s the distillation of a planning process, but it’s also relatively myopic in the sense that it doesn’t care about anything that happens after the current theorem is proved. My sense is that in practice it’ll matter a lot where you draw your episode boundaries (at least in the medium term), and as you point out there are a bunch of tricky open questions on how to think about this.
I agree with your evaluation of behavior point. I also agree that the motives matter but an important consideration is whether you picture them coming from an RM (which we can test extensively and hopefully interpret somewhat) or some opaque inner optimizers. I’m pretty bullish on both evaluating the RM (average case + adversarially) and the behavior.
Great post, thanks!
Hmm, my take is “doing good alignment research requires strong consequentialist reasoning, but assisting alignment research doesn’t.” As a stupid example, Google Docs helps my alignment research, but Google Docs does not do strong consequentialist reasoning. So then we get a trickier question of exactly how much assistance we’re expecting here. If it’s something like “helping out on the margin / 20% productivity improvement / etc.” (which I find plausible), then great, let’s do that research, I’m all for it, but we wouldn’t really call it a “plan” or “approach to alignment”, right? By analogy, I think Lightcone Infrastructure has plausibly accelerated alignment research by 20%, but nobody would have ever said that the Lightcone product roadmap is a “plan to solve alignment” or anything like that, right?
The “response” in OP seems to maybe agree with my suggestion that “doing good alignment research” might require consequentialism but “assisting alignment research” doesn’t—e.g. “It seems clear that a much weaker system can help us on our kind of alignment research”. But I feel like the rest of the OP is inconsistent with that. For example, if we’re talking about “assisting” rather than “doing”, then we’re not changing the situation that “alignment research is mostly talent-constrained”, and we don’t have a way to “make alignment and ML research fungible”, etc., right?
There’s also the question of “If we’re trying to avoid our AIs displaying strong consequentialist reasoning, how do we do that?” (Especially since the emergence of consequentialist reasoning would presumably look like “hey cool the AI is getting better at its task”) Which brings us to:
I want to distinguish two things here: evaluation of behavior versus evaluation of underlying motives. The AI doesn’t necessarily have underlying motives in the first place, but if you wind up with an AI displaying consequentialist reasoning, it does. Anyway, when the OP is discussing evaluation, it’s really “evaluation of behavior”, I think. I agree that evaluation of behavior is by-and-large easier than generation. But I think that evaluating underlying motives is hard and different, probably requiring interpretability beyond what we’re capable of today. And I think that if the underlying motives are bad then you can get behavioral outputs which are not just bad in the normal way but adversarially-selected—e.g., hacking / manipulating the human evaluator—in which case the behavioral evaluation part gets suddenly much harder than one would normally expect.
UPSHOT:
I’m moderately enthusiastic about the creation of tools to help me and other alignment researchers work faster and smarter, other things equal.
However, if those tools go equally to alignment & capabilities researchers, that makes me negative on the whole thing, because I put high weight on the concepts-as-opposed-to-scaling side of ML being important (e.g. future discovery of a “transformer-killer” architecture, as you put it).
I mostly expect that pushing the current LLM+RLHF paradigm will produce systems that are marginally better at “assisting alignment research” but not capable of “doing alignment research”, and also that are not dangerous consequentialists, although that’s a hard thing to be confident about.
If I’m wrong about not getting dangerous consequentialists from the current LLM+RLHF paradigm—or if you have new ideas that go beyond the current LLM+RLHF paradigm—then I would feel concerned about the fact that your day-to-day project incentives would seem to be pushing you towards making dangerous consequentialists (since I expect them to do better alignment research), and particularly concerned that you wouldn’t necessarily have a way to notice that this is happening.
I think AIs that can do good alignment research, and not just assist it—such that we get almost twice as much alignment research progress from twice as many GPUs—will arrive so close to the endgame that we shouldn’t be factoring them into our plans too much (see here).
Yeah, you could reformulate the question as “how much consequentialist reasoning do you need to do 95% or 99% of the alignment work?” Maybe the crux is in what we mean by consequentialist reasoning. For example, if you build a proof oracle AlphaZero-style, would that be a consequentialist? Since it’s trained with RL to successfully prove theorems you can argue it’s a consequentialist since it’s the distillation of a planning process, but it’s also relatively myopic in the sense that it doesn’t care about anything that happens after the current theorem is proved. My sense is that in practice it’ll matter a lot where you draw your episode boundaries (at least in the medium term), and as you point out there are a bunch of tricky open questions on how to think about this.
I agree with your evaluation of behavior point. I also agree that the motives matter but an important consideration is whether you picture them coming from an RM (which we can test extensively and hopefully interpret somewhat) or some opaque inner optimizers. I’m pretty bullish on both evaluating the RM (average case + adversarially) and the behavior.