I’ve been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can’t seem to find the link, so I am commenting here) and which I don’t really understand is iterative alignment.
I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.
Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.
Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.
The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.
Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.
Thanks for reading, and responding! It’s very helpful to know where my arguments cease being convincing or understandable.
I fully agree that just having AI do the work of solving alignment is not a good or convincing plan. You need to know that AI is aligned to trust it.
Perhaps the missing piece is that I think alignment is already solved for LLM agents. They don’t work well, but they are quite eager to follow instructions. Adding more alignment methods as they improve makes good odds that our first capable/dangerous agents are also aligned. I listed some of the obvious and easy techniques we’ll probably use in Internal independent review for language model agent alignment. I’m not happy with the clarity of that post, though, so I’m currently working on two followups that might be clearer.
Or perhaps the missing link is going from aligned AI systems to aligned “Real AGI”. I do think there’s a discontinuity in alignment once a system starts to learn continuously and reflect on its beliefs (which change how its values/goals are interpreted). However, I think the techniques most likely to be used are probably adequate to make those systems aligned—IF that alignment is for following instructions, and the humans wisely instruct it to be honest about ways its alignment could fail.
So that’s how I get to the first aligned AGI at roughly human level or below.
From there it seems easier, although still possible to fail.
If you have an agent that’s aligned and smarter than you, you can trust it to work on further alignment schemes. It’s wiser to spot-check it, but the humans’ job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns.
I usually think about the progression from AGI to superintelligence as one system/entity learning, being improved, and self-improving. But there’s a good chance that progression will look more generational, with several distinct systems/entities as successors with greater intelligence, designed by the previous system and/or humans. Those discontinuities seem to present more danger of getting alignment wrong
“Perhaps the missing piece is that I think alignment is already solved for LLM agents.”
Another concern that I might have is that maybe it only seems like alignment is solved for LLMs. For example, this, this, this and this short papers argue that that seemingly secure LLMs may not be as safe as we initially believe. And it appears that they test even our models that are considered to be more secure and still find this issue.
Ah, yes. That is quite a set of jailbreak techniques. When I say “alignment is solved for LLM agents”, I mean something different than what people mean by alignment for LLMs themselves.
I’m using alignment to mean AGI that does what its user wants. You are totally right that there’s an edge case and a problem if the principal “user”, the org that created this AGI, wants to sell access to others and have the AGI not follow all of those user’s instructions/desires. Which is exactly what they’ll want.
More in the other comment. I haven’t worked this through. Thanks for pointing it out.
This might mean that an org that develops LLM-based AGI systems can’t really widely license use of that system, and would have to design deliberately less capable systems. Or it might mean that they’ll put in a bunch of stopgap jailbreak prevention measures and hope they’re adequate when they won’t be.
This topic is quite interesting for me from the perspective of human survival, so if you do decide to make a post specifically about preventing jailbreaking, then please tag me (somewhere) so that I can read it.
Looking more generally, there seems to be a ton of papers that develop sophisticated jailbreak attacks (that succeed against current models). Probably more than I can even list here. Are there any fundamentally new defense techniques that can protect LLMs against these attacks (since the existing ones seem to be insufficient)?
EDIT: The concern behind this comment is better detailed in the next comment.
I also have a more meta-level layman concern (sorry if it will sound unusual). There seem to be a large number of jailbreaking strategies that all succeed against current models. To mitigate them, I can conceptually see 2 paths: 1) trying to come up with a different niche technical solution to each and every one of them individually or 2) trying to come up with a fundamentally new framework that happens to avoid all of them collectively.
Strategy 1 seems logistically impossible, as developers at leading labs (which are most likely to produce AGI) have to be aware of all of them (and they are often reported in relatively unknown papers). Furthermore, even if they somehow manage to monitor all reported jailbreaks, they would have to come up with so many different solutions, that it seems very unlikely to succeed.
Strategy 2 seems conceptually correct, but there seems to be no sign of it as even newer models are getting jailbreaked.
Re jailbreaks, I think this is not an example of alignment not being solved, but rather an example of how easy it is to misuse/control LLMs.
Also, a lot of the jailbreak successes rely on the fact that it’s been trained to accept a very wide range of requests for deployment reasons, which suggests narrowing the domain of acceptable questions for internal use could reduce the space of jailbreaks dramatically:
Current jailbreaks of chatbots often work by exploiting the fact that chatbots are trained to indulge a bewilderingly broad variety of requests—interpreting programs, telling you a calming story in character as your grandma, writing fiction, you name it. But a model that’s just monitoring for suspicious behavior can be trained to be much less cooperative with the user—no roleplaying, just analyzing code according to a preset set of questions. This might substantially reduce the attack surface.
I have 3 other concrete concerns about this strategy. So if I understand it correctly, the plan is for humans to align AGI and then for that AGI to align AGI and so forth (until ASI).
What if the strategy breaks on the first step? What if first AGI turns out to be deceptive (scheming) and only pretends to be aligned with humans. It seems like if we task such deceptive AGI to align other AGIs, then we will end up with a pyramid of misaligned AGIs.
What if the strategy breaks later down the line? What if AGI #21 accidentally aligns AGI #22 to be deceptive (scheming)? Would there be any fallback mechanisms we can rely on?
What is the end goal? Do we stop once we achieve ASI? Can we stop once we achieve ASI? What if ASI doesn’t agree and instead opts to continue self-improving? Are we going to be able to get to the point where the acceleration of ASI’s intelligence plateaus and we can recuperate and plan for future?
Unless we still have adequate mech interp or natural language train of thought to detect deceptive alignment
We die (don’t let your AGI fuck this step up!:)
22 chained independent alignment attempts does sound like too much. Hubinger specified that he wasn’t thinking of daisy-chaining like that, but having one trusted agent that keeps itself aligned as it grows smarter.
The first concern is absolutely critical, and one way to break the circularity issue is to rely on AI control, while another way is to place incentives that favor alignment as an equilibrium and make dishonesty/misalignment unfavorable, in the sense that you can’t have a continuously rewarding path to misalignment.
The second issue is less critical, assuming that AGI #21 hasn’t itself become deceptively aligned, because at that point, we can throw away #22 and restart from a fresh training run.
If that’s no longer an option, we can go to war against the misaligned AGI with our own AGI forces.
In particular, you can still do a whole lot of automated research once you break labor bottlenecks, and while this is a slowdown, this isn’t fatal, so we can work around it.
The third issue is if we have achieved aligned ASI, than we have at that point achieved our goal, and once humans are obsolete in making alignment advances, that’s when we can say the end goal has been achieved.
I agree with comments both by you and Seth. I guess that isn’t really part of an alignment as usually understood. However, I think it is a part of a broad preventing AI from killing humans strategy, so it’s still pretty important for our main goal.
I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don’t think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.
Or are you proposing that we use AI monitors to monitor our leading future AI models and then we heavily restrict only the monitors?
I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don’t think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.
Or are you proposing that we use AI monitors our leading future AI models and then we heavily restrict only the monitors?
My proposal is to restrain the AI monitor’s domain only.
I agree this is a reduction in capability from unconstrained AI, but at least in the internal use setting rather than deploying the AI, you probably don’t need, and maybe don’t want it to be able to write fictional stories or telling calming stories, but rather using the AI for specific employment tasks.
That’s pretty interesting, I do think that if iterative alignment strategy ends up working, then this will probably end up working too (if nothing else, then because this seems much easier).
I have some concerns left about iterative alignment strategy in general, so I will try to write them down below.
EDIT: On the second thought, I might create a separate question for it (and link it here), for the benefit of all of the people who concerned about the things (or similar things) that I am concerned about.
That would be great. Do reference scalable oversight to show you’ve done some due diligence before asking to have it explained. If you do that, I think it would generate some good discussion.
Sure, I might as well ask my question directly about scalable oversight, since it seems like a leading strategy of iterative alignment anyways. I do have one preliminary question (which probably isn’t worthy of being included in that post, given that it doesn’t ask about a specific issue or threat model, but rather about expectations of people).
I take it that this strategy relies on evaluation being easier than coming up with research? Do you expect this to be the case?
This isn’t something I’ve thought about adequately.
I think LLM agents will almost universally include a whole different mechanisms that can prevent jailbreaking: Internal independent review in which there are calls to a different model instance to check whether proposed plans and actions are safe (or waste time and money).
Once agents can spend your people’s money or damage their reputation, we’ll want to have them “think through” the consequences of important plans and actions before they execute.
As long as you’re engineering that and paying the compute costs, you might as well use it to check for harms as well- including checking for jailbreaking. If that check finds evidence of jailbreaking, it can just clear the model context, call for human review from the org, or suspend that account.
I don’t know how adequate that will be, but it will help.
This is probably worth thinking more about; Ii’ve sort of glossed over it while being concerned mostly about misalignment and misuse by fully authorized parties. But jailbreaking and misuse by clients could also be a major danger.
“If you have an agent that’s aligned and smarter than you, you can trust it to work on further alignment schemes. It’s wiser to spot-check it, but the humans’ job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns.”
Ah, that’s the link that I was missing. Now it makes sense. You can use AGI as a reviewer for other AGIs, once it is better than humans at reviewing AGIs. Thank you a lot for clarifying!
I’ve been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can’t seem to find the link, so I am commenting here) and which I don’t really understand is iterative alignment.
I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.
Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.
Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.
The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.
Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.
Thanks for reading, and responding! It’s very helpful to know where my arguments cease being convincing or understandable.
I fully agree that just having AI do the work of solving alignment is not a good or convincing plan. You need to know that AI is aligned to trust it.
Perhaps the missing piece is that I think alignment is already solved for LLM agents. They don’t work well, but they are quite eager to follow instructions. Adding more alignment methods as they improve makes good odds that our first capable/dangerous agents are also aligned. I listed some of the obvious and easy techniques we’ll probably use in Internal independent review for language model agent alignment. I’m not happy with the clarity of that post, though, so I’m currently working on two followups that might be clearer.
Or perhaps the missing link is going from aligned AI systems to aligned “Real AGI”. I do think there’s a discontinuity in alignment once a system starts to learn continuously and reflect on its beliefs (which change how its values/goals are interpreted). However, I think the techniques most likely to be used are probably adequate to make those systems aligned—IF that alignment is for following instructions, and the humans wisely instruct it to be honest about ways its alignment could fail.
So that’s how I get to the first aligned AGI at roughly human level or below.
From there it seems easier, although still possible to fail.
If you have an agent that’s aligned and smarter than you, you can trust it to work on further alignment schemes. It’s wiser to spot-check it, but the humans’ job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns.
I usually think about the progression from AGI to superintelligence as one system/entity learning, being improved, and self-improving. But there’s a good chance that progression will look more generational, with several distinct systems/entities as successors with greater intelligence, designed by the previous system and/or humans. Those discontinuities seem to present more danger of getting alignment wrong
“Perhaps the missing piece is that I think alignment is already solved for LLM agents.”
Another concern that I might have is that maybe it only seems like alignment is solved for LLMs. For example, this, this, this and this short papers argue that that seemingly secure LLMs may not be as safe as we initially believe. And it appears that they test even our models that are considered to be more secure and still find this issue.
Ah, yes. That is quite a set of jailbreak techniques. When I say “alignment is solved for LLM agents”, I mean something different than what people mean by alignment for LLMs themselves.
I’m using alignment to mean AGI that does what its user wants. You are totally right that there’s an edge case and a problem if the principal “user”, the org that created this AGI, wants to sell access to others and have the AGI not follow all of those user’s instructions/desires. Which is exactly what they’ll want.
More in the other comment. I haven’t worked this through. Thanks for pointing it out.
This might mean that an org that develops LLM-based AGI systems can’t really widely license use of that system, and would have to design deliberately less capable systems. Or it might mean that they’ll put in a bunch of stopgap jailbreak prevention measures and hope they’re adequate when they won’t be.
I need to think more about this.
This topic is quite interesting for me from the perspective of human survival, so if you do decide to make a post specifically about preventing jailbreaking, then please tag me (somewhere) so that I can read it.
Looking more generally, there seems to be a ton of papers that develop sophisticated jailbreak attacks (that succeed against current models). Probably more than I can even list here. Are there any fundamentally new defense techniques that can protect LLMs against these attacks (since the existing ones seem to be insufficient)?
EDIT: The concern behind this comment is better detailed in the next comment.
I also have a more meta-level layman concern (sorry if it will sound unusual). There seem to be a large number of jailbreaking strategies that all succeed against current models. To mitigate them, I can conceptually see 2 paths: 1) trying to come up with a different niche technical solution to each and every one of them individually or 2) trying to come up with a fundamentally new framework that happens to avoid all of them collectively.
Strategy 1 seems logistically impossible, as developers at leading labs (which are most likely to produce AGI) have to be aware of all of them (and they are often reported in relatively unknown papers). Furthermore, even if they somehow manage to monitor all reported jailbreaks, they would have to come up with so many different solutions, that it seems very unlikely to succeed.
Strategy 2 seems conceptually correct, but there seems to be no sign of it as even newer models are getting jailbreaked.
What do you think?
Re jailbreaks, I think this is not an example of alignment not being solved, but rather an example of how easy it is to misuse/control LLMs.
Also, a lot of the jailbreak successes rely on the fact that it’s been trained to accept a very wide range of requests for deployment reasons, which suggests narrowing the domain of acceptable questions for internal use could reduce the space of jailbreaks dramatically:
I have 3 other concrete concerns about this strategy. So if I understand it correctly, the plan is for humans to align AGI and then for that AGI to align AGI and so forth (until ASI).
What if the strategy breaks on the first step? What if first AGI turns out to be deceptive (scheming) and only pretends to be aligned with humans. It seems like if we task such deceptive AGI to align other AGIs, then we will end up with a pyramid of misaligned AGIs.
What if the strategy breaks later down the line? What if AGI #21 accidentally aligns AGI #22 to be deceptive (scheming)? Would there be any fallback mechanisms we can rely on?
What is the end goal? Do we stop once we achieve ASI? Can we stop once we achieve ASI? What if ASI doesn’t agree and instead opts to continue self-improving? Are we going to be able to get to the point where the acceleration of ASI’s intelligence plateaus and we can recuperate and plan for future?
We die (don’t fuck this step up!:)
Unless we still have adequate mech interp or natural language train of thought to detect deceptive alignment
We die (don’t let your AGI fuck this step up!:)
22 chained independent alignment attempts does sound like too much. Hubinger specified that he wasn’t thinking of daisy-chaining like that, but having one trusted agent that keeps itself aligned as it grows smarter.
the endgame is to use Intent alignment as a stepping-stone to value alignment and let something more competent and compassionate than us monkeys handle things from there on out.
The first concern is absolutely critical, and one way to break the circularity issue is to rely on AI control, while another way is to place incentives that favor alignment as an equilibrium and make dishonesty/misalignment unfavorable, in the sense that you can’t have a continuously rewarding path to misalignment.
The second issue is less critical, assuming that AGI #21 hasn’t itself become deceptively aligned, because at that point, we can throw away #22 and restart from a fresh training run.
If that’s no longer an option, we can go to war against the misaligned AGI with our own AGI forces.
In particular, you can still do a whole lot of automated research once you break labor bottlenecks, and while this is a slowdown, this isn’t fatal, so we can work around it.
The third issue is if we have achieved aligned ASI, than we have at that point achieved our goal, and once humans are obsolete in making alignment advances, that’s when we can say the end goal has been achieved.
I agree with comments both by you and Seth. I guess that isn’t really part of an alignment as usually understood. However, I think it is a part of a broad preventing AI from killing humans strategy, so it’s still pretty important for our main goal.
I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don’t think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.
Or are you proposing that we use AI monitors to monitor our leading future AI models and then we heavily restrict only the monitors?
My proposal is to restrain the AI monitor’s domain only.
I agree this is a reduction in capability from unconstrained AI, but at least in the internal use setting rather than deploying the AI, you probably don’t need, and maybe don’t want it to be able to write fictional stories or telling calming stories, but rather using the AI for specific employment tasks.
That’s pretty interesting, I do think that if iterative alignment strategy ends up working, then this will probably end up working too (if nothing else, then because this seems much easier).
I have some concerns left about iterative alignment strategy in general, so I will try to write them down below.
EDIT: On the second thought, I might create a separate question for it (and link it here), for the benefit of all of the people who concerned about the things (or similar things) that I am concerned about.
That would be great. Do reference scalable oversight to show you’ve done some due diligence before asking to have it explained. If you do that, I think it would generate some good discussion.
Sure, I might as well ask my question directly about scalable oversight, since it seems like a leading strategy of iterative alignment anyways. I do have one preliminary question (which probably isn’t worthy of being included in that post, given that it doesn’t ask about a specific issue or threat model, but rather about expectations of people).
I take it that this strategy relies on evaluation being easier than coming up with research? Do you expect this to be the case?
This isn’t something I’ve thought about adequately.
I think LLM agents will almost universally include a whole different mechanisms that can prevent jailbreaking: Internal independent review in which there are calls to a different model instance to check whether proposed plans and actions are safe (or waste time and money).
Once agents can spend your people’s money or damage their reputation, we’ll want to have them “think through” the consequences of important plans and actions before they execute.
As long as you’re engineering that and paying the compute costs, you might as well use it to check for harms as well- including checking for jailbreaking. If that check finds evidence of jailbreaking, it can just clear the model context, call for human review from the org, or suspend that account.
I don’t know how adequate that will be, but it will help.
This is probably worth thinking more about; Ii’ve sort of glossed over it while being concerned mostly about misalignment and misuse by fully authorized parties. But jailbreaking and misuse by clients could also be a major danger.
“If you have an agent that’s aligned and smarter than you, you can trust it to work on further alignment schemes. It’s wiser to spot-check it, but the humans’ job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns.”
Ah, that’s the link that I was missing. Now it makes sense. You can use AGI as a reviewer for other AGIs, once it is better than humans at reviewing AGIs. Thank you a lot for clarifying!
My pleasure. Evan Hubinger made this point to me when I’d misunderstood his scalable oversight proposal.
Thanks again for engaging with my work!