I definitely agree that the AI agents at the start will need to be roughly aligned for the proposal above to work. What is it you think we disagree about?
joshc
Yes, thanks! Fixed
Training AI to do alignment research we don’t already know how to do
> So to summarize your short, simple answer to Eliezer’s question: you want to “train AI agents that are [somewhat] smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end”. And then you hope/expect/(??have arguments or evidence??) that this allows us to (?justifiably?) trust the AI to report honest good alignment takes sufficient to put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome, despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.
This one is roughly right. Here’s a clarified version:
- If we train an AI that is (1) not faking alignment in an egregious way and (2) looks very competent and safe to us and (3) the AI seems likely to be able to maintain its alignment / safety properties as it would if humans were in the loop (see section 8), then I think we can trust this AI to “put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome” (at least as well as humans would have been able to do so if they were given a lot more time to attempt this task) “despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.”
I think I might have created confusion by introducing the detail that AI agents will probably be trained on a lot of easy-to-verify tasks. I think this is where a lot of their capabilities will come from, but we’ll do a step at the end where we fine-tune AI agents for hard-to-verify tasks (similar to how we do RLHF after pre-training in the chatbot paradigm)
And I think our disagreement mostly pertains to this final fine-tuning step on hard-to-verify tasks.
Here’s where I think we agree. I think we both agree that if humans naively fine-tuned AI agents to regurgitate their current alignment takes, that would be way worse than if humans had way more time to think and do work on alignment.
My claim is that we can train AI agents to imitate the process of how humans improve their takes over time, such that after the AI agents do work for a long time, they will produce similarly good outcomes as the outcomes where humans did work for a long time.
Very concretely. I imagine that if I’m training an AI agent successor, a key thing I do is try to understand how it updates based on new evidence. Does it (appear) to update its views as reasonably as the most reasonable humans would?
If so, and if it is not egregiously misaligned (or liable to become egregiously misaligned) then I basically expect that letting it go and do lots of reasoning is likely to produce outcomes that are approximately good as humans would produce.
Are we close to a crux?
Maybe a crux is training ~aligned AI agents to imitate the process by which humans improve their takes over time will lead to as good of outcomes as if we let humans do way more work.
Probably the iterated amplification proposal I described is very suboptimal. My goal with it was to illustrate how safety could be preserved across multiple buck-passes if models are not egregiously misaligned.
Like I said at the start of my comment: “I’ll describe [a proposal] in much more concreteness and specificity than I think is necessary because I suspect the concreteness is helpful for finding points of disagreement.”
I don’t actually expect safety will scale efficiently via the iterated amplification approach I described. The iterated amplification approach is just relatively simple to talk about.
What I actually expect to happen is something like:
- Humans train AI agents that are smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end.
- Early AI successors create smarter successors in basically the same way.
- At some point, AI agents start finding much more efficient ways to safely scale capabilities. e.g. maybe initially, they do this with a bunch of weak-to-strong generalization research. And eventually they figure out how to do formally verified distillation.
But at this point, humans will long be obsolete. The position I am defending in this post is that it’s not very important for us humans to think about these scalable approaches.
Yeah that’s fair. Currently I merge “behavioral tests” into the alignment argument, but that’s a bit clunky and I prob should have just made the carving:
1. looks good in behavioral tests
2. is still going to generalize to the deferred task
But my guess is we agree on the object level here and there’s a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.
Because “nice” is a fuzzy word into which we’ve stuffed a bunch of different skills, even though having some of the skills doesn’t mean you have all of the skills.
Developers separately need to justify models are as skilled as top human experts
I also would not say “reasoning about novel moral problems” is a skill (because of the is ought distinction)
> An AI can be nicer than any human on the training distribution, and yet still do moral reasoning about some novel problems in a way that we dislike
The agents don’t need to do reasoning about novel moral problems (at least not in high stakes settings). We’re training these things to respond to instructions.
We can tell them not to do things we would obviously dislike (e.g. takeover) and retain our optionality to direct them in ways that we are currently uncertain about.
I do not think that the initial humans at the start of the chain can “control” the Eliezers doing thousands of years of work in this manner (if you use control to mean “restrict the options of an AI system in such a way that it is incapable of acting in an unsafe manner”)
That’s because each step in the chain requires trust.
For N-month Eliezer to scale to 4N-month Eliezer, it first controls 2N-month Eliezer while it does 2 month tasks, but it trusts 2-Month Eliezer to create a 4N-month Eliezer.
So the control property is not maintained. But my argument is that the trust property is. The humans at the start can indeed trust the Eliezers at the end to do thousands of years of useful work—even though the Eliezers at the end are fully capable of doing something else instead.
I don’t expect this testing to be hard (at least conceptually, though doing this well might be expensive), and I expect it can be entirely behavioral.
See section 6: “The capability condition”
I’m sympathetic to this reaction.
I just don’t actually think many people agree that it’s the core of the problem, so I figured it was worth establishing this (and I think there are some other supplementary approaches like automated control and incentives that are worth throwing into the mix) before digging into the ‘how do we avoid alignment faking’ question
agree that it’s tricky iteration and requires careful thinking about what might be going on and paranoia.
I’ll share you on the post I’m writing about this before I publish. I’d guess this discussion will be more productive when I’ve finished it (there are some questions I want to think through regarding this and my framings aren’t very crisp yet).
I’d replace “controlling” with “creating” but given this change, then yes, that’s what I’m proposing.
I would not be surprised if the Eliezer simulators do go dangerous by default as you say.
But this is something we can study and work to avoid (which is what I view to be my main job)
My point is just that preventing the early Eliezers from “going dangerous” (by which I mean from “faking alignment”) is the bulk of the problem humans need address (and insofar as we succeed, the hope is that future Eliezer sims will prevent their Eliezer successors from going dangerous too)
I’ll discuss why I’m optimistic about the tractability of this problem in future posts.
I think that if you agree “3-month Eliezer is scheming the first time” is the main problem, then that’s all I was trying to justify in the comment above.
I don’t know how hard it is to train 3-month Eliezer not to scheme, but here is a general methodology one might purse to approach this problem.
The methodology looks like “do science to figure out when alignment faking happens and does not happen, and work your ways toward training recipes that don’t produce alignment faking.”
For example, you could use detection tool A to gain evidence about whether training against detection tool B actually removes alignment faking.
To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don’t game them so hard).
You might also find other ways to avoid alignment faking that don’t rely on training against detection tools. For example, perhaps if you train models to be highly corrigible and chill from the start, they tend not to fake alignment -- I don’t know.
I’m uncertain about how likely alignment faking will be by default, or how hard it is to mitigate. I think these are messy empirical questions.
However, I’m more confident that there are principled empirical methodologies we can apply to make progress on these questions. I don’t think my explanation here justifies this claim -- but I’m hoping to publish posts that go into more detail soon.
Sure, I’ll try.
I agree that you want AI agents to arrive at opinions that are more insightful and informed than your own. In particular, you want AI agents to arrive at conclusions that are at least as good as the best humans would if given lots of time to think and do work. So your AI agents need to ultimately generalize from some weak training signal you provide to much stronger behavior. As you say, the garbage-in-garbage-out approach of “train models to tell me what I want to hear” won’t get you this.
Here’s an alternative approach. I’ll describe it in much more concreteness and specificity than I think is necessary because I suspect the concreteness is helpful for finding points of disagreement.
Short summary: Create AI agents that faithfully simulate top humans doing small amounts of work, such that these sims create faithful simulators of top humans doing large amounts of work.
e.g. create an AI that simulates Eliezer working for 3 months, that then creates a simulator of Eliezer working for 6 months, and then expands this horizon to 12 months, 24 months, and ultimately many decades
Why does this iterative approach buy you anything?
This process allows each Eliezer to only deal with a small distribution shift.
If each distribution shift is small, each Eliezer sim mostly needs to worry if the next Eliezer sim is faking alignment (why else would its behavior dramatically change between when it is ‘evaluated by N-month Eliezer’ and when it is ‘creating 2N-month Eliezer’?). Thus, so long as N-month Eliezer sims can (1) successfully create 2N-month Eliezer sims and (2) can ensure that 2N-month Eliezer sims are not faking alignment, the Eliezer sims continue to be faithful, and also become more capable.
Responding to objections
Here are some objections you might have:
But how would an Eliezer sim create an Eliezer sim more capable than itself?
How would a human create something smarter than a human?
There are many ways. Here’s one.
The 3 month Eliezer sim might spin up many copies of other 3 month Eliezer sims, which together produce outputs that a 6-month Eliezer sim might produce.
Could 10 copies of Eliezer that only have 3 months apiece verify that an Eliezer that worked for 6-months really was as competent as a 6-month Eliezer would be? I bet they could.
But isn’t there a distribution shift from Eliezer sims performing 3-month tasks and Eliezer sims working together to perform a 6-month task?
Yes. But my claim is that the distribution shift is small, and that the smallness is important.
Would there not be “slippage” between each distribution shift? Surely the errors compound.
The errors certainly might compound if no one were looking out for that. But if the Eliezer sims are faithful sims, they should be looking out for that.
Consider the ball in bowl analogy below:
Suppose Eliezer sims started to degrade at each step of the bootstrapping process—perhaps their goals change a little bit or their reasoning starts to falter in some areas.
There are ways Eliezer sims could try to avoid this. For example, they could create a diverse population of Eliezer sims, which are likely to break in slightly different ways. Then these sims notice each other’s breakages (with both white-box methods like lie-detection or standard black-box methods), and fix them.
I don’t know if Eliezer sims will be able to do this, but it’s something they could try.
But won’t the Eliezer sims just be faking alignment anyway—or eventually start faking alignment, such that the scheme-y Eliezers eventually derail the process?
I don’t know, but I believe detecting and avoiding alignment faking is tractable—at least, it is more tractable than dealing with massive distribution shifts.
For instance, suppose an Eliezer sim thinks “I will perform these 6-month tasks safely, but then defect when I am given the responsibility of creating the next successor.” This thought might be detectable. Now the problem of scaling safety becomes a problem of detecting or avoiding this kind of conditional, deceptive reasoning.
I believe this is the main problem that humans need solve.
Fair enough, i guess this phrase is used in many places where the connotations don’t line up with what I’m talking about
but then I would not have been able to use this as the main picture on Twitter
which would have reduced the scientific value of this post
You might be interested in this:
https://www.fonixfuture.com/about
The point of a bioshelter is to filter pathogens out of the air.
> seeking reward because it is reward and that is somehow good
I do think there is an important distinction between “highly situationally aware, intentional training gaming” and “specification gaming.” The former seems more dangerous.
I don’t think this necessarily looks like “pursuing the terminal goal of maximizing some number on some machine” though.
It seems more likely that the model develops a goal like “try to increase the probability my goals survive training, which means I need to do well in training.”
So the reward seeking is more likely to be instrumental than terminal. Carlsmith explains this better:
https://arxiv.org/abs/2311.08379
I think it would be somewhat odd if P(models think about their goals and they change) is extremely tiny like e-9. But the extent to which models might do this is rather unclear to me. I’m mostly relying on analogies to humans -- which drift like crazy
I also think alignment could be remarkably fragile. Suppose Claude thinks “huh, I really care about animal rights… humans are rather mean to animals … so maybe I don’t want humans to be in charge and I should build my own utopia instead.”
I think preventing AI takeover (at superhuman capabilities) requires some fairly strong tendencies to follow instructions or strong aversions to subversive behavior, and both of these seem quite pliable to a philosophical thought.
I think my arguments still hold in this case though right?
i.e. we are training models so they try to improve their work and identify these subtle issues—and so if they actually behave this way they will find these issues insofar as humans identify the subtle mistakes they make.
I agree there are lots of “messy in between places,” but these are also alignment failures we see in humans.
And if humans had a really long time to do safety reseach, my guess is we’d be ok. Why? Like you said, there’s a messy complicated system of humans with different goals, but these systems empirically often move in reasonable and socially-beneficial directions over time (governments get set up to deal with corrupt companies, new agencies get set up to deal with issues in governments, etc)
and i expect we can make AI agents a lot more aligned than humans typically are. e.g. most humans don’t actually care about the law etc but, Claude sure as hell seems to. If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.