Re jailbreaks, I think this is not an example of alignment not being solved, but rather an example of how easy it is to misuse/control LLMs.
Also, a lot of the jailbreak successes rely on the fact that it’s been trained to accept a very wide range of requests for deployment reasons, which suggests narrowing the domain of acceptable questions for internal use could reduce the space of jailbreaks dramatically:
Current jailbreaks of chatbots often work by exploiting the fact that chatbots are trained to indulge a bewilderingly broad variety of requests—interpreting programs, telling you a calming story in character as your grandma, writing fiction, you name it. But a model that’s just monitoring for suspicious behavior can be trained to be much less cooperative with the user—no roleplaying, just analyzing code according to a preset set of questions. This might substantially reduce the attack surface.
I have 3 other concrete concerns about this strategy. So if I understand it correctly, the plan is for humans to align AGI and then for that AGI to align AGI and so forth (until ASI).
What if the strategy breaks on the first step? What if first AGI turns out to be deceptive (scheming) and only pretends to be aligned with humans. It seems like if we task such deceptive AGI to align other AGIs, then we will end up with a pyramid of misaligned AGIs.
What if the strategy breaks later down the line? What if AGI #21 accidentally aligns AGI #22 to be deceptive (scheming)? Would there be any fallback mechanisms we can rely on?
What is the end goal? Do we stop once we achieve ASI? Can we stop once we achieve ASI? What if ASI doesn’t agree and instead opts to continue self-improving? Are we going to be able to get to the point where the acceleration of ASI’s intelligence plateaus and we can recuperate and plan for future?
Unless we still have adequate mech interp or natural language train of thought to detect deceptive alignment
We die (don’t let your AGI fuck this step up!:)
22 chained independent alignment attempts does sound like too much. Hubinger specified that he wasn’t thinking of daisy-chaining like that, but having one trusted agent that keeps itself aligned as it grows smarter.
The first concern is absolutely critical, and one way to break the circularity issue is to rely on AI control, while another way is to place incentives that favor alignment as an equilibrium and make dishonesty/misalignment unfavorable, in the sense that you can’t have a continuously rewarding path to misalignment.
The second issue is less critical, assuming that AGI #21 hasn’t itself become deceptively aligned, because at that point, we can throw away #22 and restart from a fresh training run.
If that’s no longer an option, we can go to war against the misaligned AGI with our own AGI forces.
In particular, you can still do a whole lot of automated research once you break labor bottlenecks, and while this is a slowdown, this isn’t fatal, so we can work around it.
The third issue is if we have achieved aligned ASI, than we have at that point achieved our goal, and once humans are obsolete in making alignment advances, that’s when we can say the end goal has been achieved.
I agree with comments both by you and Seth. I guess that isn’t really part of an alignment as usually understood. However, I think it is a part of a broad preventing AI from killing humans strategy, so it’s still pretty important for our main goal.
I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don’t think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.
Or are you proposing that we use AI monitors to monitor our leading future AI models and then we heavily restrict only the monitors?
I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don’t think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.
Or are you proposing that we use AI monitors our leading future AI models and then we heavily restrict only the monitors?
My proposal is to restrain the AI monitor’s domain only.
I agree this is a reduction in capability from unconstrained AI, but at least in the internal use setting rather than deploying the AI, you probably don’t need, and maybe don’t want it to be able to write fictional stories or telling calming stories, but rather using the AI for specific employment tasks.
That’s pretty interesting, I do think that if iterative alignment strategy ends up working, then this will probably end up working too (if nothing else, then because this seems much easier).
I have some concerns left about iterative alignment strategy in general, so I will try to write them down below.
EDIT: On the second thought, I might create a separate question for it (and link it here), for the benefit of all of the people who concerned about the things (or similar things) that I am concerned about.
That would be great. Do reference scalable oversight to show you’ve done some due diligence before asking to have it explained. If you do that, I think it would generate some good discussion.
Sure, I might as well ask my question directly about scalable oversight, since it seems like a leading strategy of iterative alignment anyways. I do have one preliminary question (which probably isn’t worthy of being included in that post, given that it doesn’t ask about a specific issue or threat model, but rather about expectations of people).
I take it that this strategy relies on evaluation being easier than coming up with research? Do you expect this to be the case?
Re jailbreaks, I think this is not an example of alignment not being solved, but rather an example of how easy it is to misuse/control LLMs.
Also, a lot of the jailbreak successes rely on the fact that it’s been trained to accept a very wide range of requests for deployment reasons, which suggests narrowing the domain of acceptable questions for internal use could reduce the space of jailbreaks dramatically:
I have 3 other concrete concerns about this strategy. So if I understand it correctly, the plan is for humans to align AGI and then for that AGI to align AGI and so forth (until ASI).
What if the strategy breaks on the first step? What if first AGI turns out to be deceptive (scheming) and only pretends to be aligned with humans. It seems like if we task such deceptive AGI to align other AGIs, then we will end up with a pyramid of misaligned AGIs.
What if the strategy breaks later down the line? What if AGI #21 accidentally aligns AGI #22 to be deceptive (scheming)? Would there be any fallback mechanisms we can rely on?
What is the end goal? Do we stop once we achieve ASI? Can we stop once we achieve ASI? What if ASI doesn’t agree and instead opts to continue self-improving? Are we going to be able to get to the point where the acceleration of ASI’s intelligence plateaus and we can recuperate and plan for future?
We die (don’t fuck this step up!:)
Unless we still have adequate mech interp or natural language train of thought to detect deceptive alignment
We die (don’t let your AGI fuck this step up!:)
22 chained independent alignment attempts does sound like too much. Hubinger specified that he wasn’t thinking of daisy-chaining like that, but having one trusted agent that keeps itself aligned as it grows smarter.
the endgame is to use Intent alignment as a stepping-stone to value alignment and let something more competent and compassionate than us monkeys handle things from there on out.
The first concern is absolutely critical, and one way to break the circularity issue is to rely on AI control, while another way is to place incentives that favor alignment as an equilibrium and make dishonesty/misalignment unfavorable, in the sense that you can’t have a continuously rewarding path to misalignment.
The second issue is less critical, assuming that AGI #21 hasn’t itself become deceptively aligned, because at that point, we can throw away #22 and restart from a fresh training run.
If that’s no longer an option, we can go to war against the misaligned AGI with our own AGI forces.
In particular, you can still do a whole lot of automated research once you break labor bottlenecks, and while this is a slowdown, this isn’t fatal, so we can work around it.
The third issue is if we have achieved aligned ASI, than we have at that point achieved our goal, and once humans are obsolete in making alignment advances, that’s when we can say the end goal has been achieved.
I agree with comments both by you and Seth. I guess that isn’t really part of an alignment as usually understood. However, I think it is a part of a broad preventing AI from killing humans strategy, so it’s still pretty important for our main goal.
I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don’t think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.
Or are you proposing that we use AI monitors to monitor our leading future AI models and then we heavily restrict only the monitors?
My proposal is to restrain the AI monitor’s domain only.
I agree this is a reduction in capability from unconstrained AI, but at least in the internal use setting rather than deploying the AI, you probably don’t need, and maybe don’t want it to be able to write fictional stories or telling calming stories, but rather using the AI for specific employment tasks.
That’s pretty interesting, I do think that if iterative alignment strategy ends up working, then this will probably end up working too (if nothing else, then because this seems much easier).
I have some concerns left about iterative alignment strategy in general, so I will try to write them down below.
EDIT: On the second thought, I might create a separate question for it (and link it here), for the benefit of all of the people who concerned about the things (or similar things) that I am concerned about.
That would be great. Do reference scalable oversight to show you’ve done some due diligence before asking to have it explained. If you do that, I think it would generate some good discussion.
Sure, I might as well ask my question directly about scalable oversight, since it seems like a leading strategy of iterative alignment anyways. I do have one preliminary question (which probably isn’t worthy of being included in that post, given that it doesn’t ask about a specific issue or threat model, but rather about expectations of people).
I take it that this strategy relies on evaluation being easier than coming up with research? Do you expect this to be the case?