Thanks for writing this. I very much agree that we should have more work on actual plans for short timelines—even if it’s only guessing what vague plans orgs and governments are likely to create or follow by default.
Arguments for short timelines are speculative. So are arguments for longer timelines. It seems like the reasonable conclusion is that both are possible. It would seem urgent to have a plan for short timelines.
I also think that plan should be collaborative. So here’s some of my contribution, framed as a suggestion for improvement, per your invitation:
This plan only briefly addresses actual alignment techniques. Plans should probably address it more carefully. Control only works until it fails, and CoT transparency and eval measures only work if we stop deploying capable, scheming models. As Buck argues, and you imply, we might well see powerful models that are obviously scheming, but see orgs (or government(s)) press on because of race dynamics.
So it seems like alignment is a key piece of the puzzle even by the time we reach human-researcher level. Human researchers have agency and can solve novel problems. Researcher systems may solve the novel problem “so what do my goals really imply”? Having less agency and flexible problem-solving seems like it would severely hamper capabilities, creating a large alignment tax that decision-makers will reject under race dynamics.[1]
Then, a more detailed but still sketchy proposal for an alignment plan for short timelines:
I agree that RLHF and RLAIF (and now Deliberative Alignment) are working better than expected. Perhaps they’ll be adequate to get us to the point where your predictions stop. But I doubt they get us that far. At the least, this probably deserves a good bit more analysis. That’s what I’m trying to do in my work: anticipate how alignment techniques will work (or not work) as LLMs are made smarter and more agentic. I summarize some likely alignment approaches for language model agents here, and am working on an updated list of techniques that are obvious and low-tax enough that they’re likely to actually be used.
The alignment target those techniques aim at is also crucially important.
As you say:
For example, making models jailbreak-resistant could decrease their corrigibility. [...] I’m currently unsure which training techniques you would want to use for agentic long-horizon internal deployments, e.g. because jailbreaks might be less relevant but taking non-power-seeking actions is much more relevant.
I’m also unsure.
I think this issue is why agent foundations type thinkers usually don’t have a plan for short timelines; they assume that in short timelines, we die. They think alignment will fail because we don’t have a way to specify alignment targets precisely in the face of strong optimization. I agree with this but think adequate corrigibility is attainable and can avoid that problem.
We are building increasingly powerful AI with its own goals (like resisting jailbreaks and being “helpful, harmless, and honest” by some definition and training procedure). When those models become fully agentic and strongly optimize against those goals, we may find we didn’t define them quite carefully enough. (I’d say strong optimization on the goals of current LLMs would probably doom us, but at the least it seems we should have a strong argument that it wouldn’t).
I think those agent foundations thinkers incorrectly assume that corrigibility is anti-natural. That’s only true when you try to employ it as an add-on to some other primary goal instead of as a singular alignment target Thus I think the obvious decision is to preserve corrigibility by dropping the alignment goal of internal ethics (like refusing harmful information requests) and emphasizing strict instruction-following, to preserve corrigibility. By this argument, Instruction-following AGI is easier and more likely than value aligned AGI.
Current alignment techniques directed at that corrigibility/IF might well get us safely to AGI levels that can invent and improve alignment techniques, and thereby get us to aligned ASI.
But this leaves aside the pressure to actually deploy your AGI, and thereby make a ton of money. If you’re selling its services to external users, you’re going to want it to refuse harmful instructions from that lower tier of authorization. That reduces corrigibility if you do it through training, as you note. I don’t know how to balance those goals; having the authorized superuser give instructions to refuse harmful requests is the obvious route, but it’s hard to say if that’s adequate in practice.
This also risks misuse when those systems are stolen or replicated. If we solve alignment, do we die anyway? After that discussion, I think the answer is no, but only if some government or coalition of them wisely decide to limit AGI proliferation—and offset the resultant global outrage by sharing the massive benefits from AGI with everyone.
That addresses the other limitation of this plan—it only gets us to the first aligned AGI, without addressing what happens when other actors steal or reproduce it. I think a complete plan needs to include some way we don’t die as a result of AGI proliferation to bad actors.
The claim that highly useful AI systems will be capable of solving novel problems, and that narrowing their scope or keeping them myopic will be a high alignment tax is load-bearing. I haven’t produced a really satisfactory compact argument for this. The brief argument is that humans achieve our competence by using our general intelligence (which LLMs copy) to solve novel problems and learn new skills. The fastest route to AGI is to expand LLMs to do those things very similarly to how humans do them. Creating specialist systems requires more specialized work for each domain, and human debugging/engineering when systems can’t solve the new problems that domain entails. More in my argument for short timelines and in Capabilities and alignment of LLM cognitive architectures
Thanks for writing this. I very much agree that we should have more work on actual plans for short timelines—even if it’s only guessing what vague plans orgs and governments are likely to create or follow by default.
Arguments for short timelines are speculative. So are arguments for longer timelines. It seems like the reasonable conclusion is that both are possible. It would seem urgent to have a plan for short timelines.
I also think that plan should be collaborative. So here’s some of my contribution, framed as a suggestion for improvement, per your invitation:
This plan only briefly addresses actual alignment techniques. Plans should probably address it more carefully. Control only works until it fails, and CoT transparency and eval measures only work if we stop deploying capable, scheming models. As Buck argues, and you imply, we might well see powerful models that are obviously scheming, but see orgs (or government(s)) press on because of race dynamics.
So it seems like alignment is a key piece of the puzzle even by the time we reach human-researcher level. Human researchers have agency and can solve novel problems. Researcher systems may solve the novel problem “so what do my goals really imply”? Having less agency and flexible problem-solving seems like it would severely hamper capabilities, creating a large alignment tax that decision-makers will reject under race dynamics.[1]
Then, a more detailed but still sketchy proposal for an alignment plan for short timelines:
I agree that RLHF and RLAIF (and now Deliberative Alignment) are working better than expected. Perhaps they’ll be adequate to get us to the point where your predictions stop. But I doubt they get us that far. At the least, this probably deserves a good bit more analysis. That’s what I’m trying to do in my work: anticipate how alignment techniques will work (or not work) as LLMs are made smarter and more agentic. I summarize some likely alignment approaches for language model agents here, and am working on an updated list of techniques that are obvious and low-tax enough that they’re likely to actually be used.
The alignment target those techniques aim at is also crucially important.
As you say:
I’m also unsure.
I think this issue is why agent foundations type thinkers usually don’t have a plan for short timelines; they assume that in short timelines, we die. They think alignment will fail because we don’t have a way to specify alignment targets precisely in the face of strong optimization. I agree with this but think adequate corrigibility is attainable and can avoid that problem.
We are building increasingly powerful AI with its own goals (like resisting jailbreaks and being “helpful, harmless, and honest” by some definition and training procedure). When those models become fully agentic and strongly optimize against those goals, we may find we didn’t define them quite carefully enough. (I’d say strong optimization on the goals of current LLMs would probably doom us, but at the least it seems we should have a strong argument that it wouldn’t).
I think those agent foundations thinkers incorrectly assume that corrigibility is anti-natural. That’s only true when you try to employ it as an add-on to some other primary goal instead of as a singular alignment target Thus I think the obvious decision is to preserve corrigibility by dropping the alignment goal of internal ethics (like refusing harmful information requests) and emphasizing strict instruction-following, to preserve corrigibility. By this argument, Instruction-following AGI is easier and more likely than value aligned AGI.
Current alignment techniques directed at that corrigibility/IF might well get us safely to AGI levels that can invent and improve alignment techniques, and thereby get us to aligned ASI.
But this leaves aside the pressure to actually deploy your AGI, and thereby make a ton of money. If you’re selling its services to external users, you’re going to want it to refuse harmful instructions from that lower tier of authorization. That reduces corrigibility if you do it through training, as you note. I don’t know how to balance those goals; having the authorized superuser give instructions to refuse harmful requests is the obvious route, but it’s hard to say if that’s adequate in practice.
This also risks misuse when those systems are stolen or replicated. If we solve alignment, do we die anyway? After that discussion, I think the answer is no, but only if some government or coalition of them wisely decide to limit AGI proliferation—and offset the resultant global outrage by sharing the massive benefits from AGI with everyone.
That addresses the other limitation of this plan—it only gets us to the first aligned AGI, without addressing what happens when other actors steal or reproduce it. I think a complete plan needs to include some way we don’t die as a result of AGI proliferation to bad actors.
The claim that highly useful AI systems will be capable of solving novel problems, and that narrowing their scope or keeping them myopic will be a high alignment tax is load-bearing. I haven’t produced a really satisfactory compact argument for this. The brief argument is that humans achieve our competence by using our general intelligence (which LLMs copy) to solve novel problems and learn new skills. The fastest route to AGI is to expand LLMs to do those things very similarly to how humans do them. Creating specialist systems requires more specialized work for each domain, and human debugging/engineering when systems can’t solve the new problems that domain entails. More in my argument for short timelines and in Capabilities and alignment of LLM cognitive architectures