Aligning a toy model of optimization
Suppose I have a magic box that takes as input a program , and produces , with only times the cost of a single evaluation of . Could we use this box to build an aligned AI, or would broad access to such a box result in doom?
This capability is vaguely similar to modern ML, especially if we use to search over programs. But I think we can learn something from studying simpler models.
An unaligned benchmark
(Related.)
I can use to define a simple unaligned AI (details omitted):
Collect data from a whole bunch of sensors, including a “reward channel.”
Use to find a program that makes good predictions about that data.
Use to find a policy that achieves a high reward when interacting with .
This isn’t a great design, but it works as a benchmark. Can we build an aligned AI that is equally competent?
(I haven’t described how works for stochastic programs. The most natural definition is a bit complicated, but the details don’t seem to matter much. You can just imagine that it returns a random that is within one standard deviation of the optimal expected value.)
Competing with the benchmark
(Related.)
If I run this system with a long time horizon and a hard-to-influence reward channel, then it may competently acquire influence in order to achieve a high reward.
We’d like to use to build an AI that acquires influence just as effectively, but will use that influence to give us security and resources to reflect and grow wiser, and remain responsive to our instructions.
We’d like the aligned AI to be almost as efficient. Ideally the proportional overhead would converge to 0 as we consider more complex models. At worst the overhead should be a constant factor.
Possible approach
(Related.)
My hope is to use to learn a policy which can answer questions in a way that reflects “everything knows.” This requires:
Setting up an objective that incentivizes to give good answers to questions.
Arguing that there exists a suitable policy that is only slightly more complicated than .
If we have such a , then we can use it to directly answer questions like “What’s the best thing to do in this situation?” The hope is:
Its answers can leverage everything knows, and in particular all of ’s knowledge about how to acquire influence. So using in this way is competitive with using directly.
It knows enough about human preferences to be corrigible.
“Everything knows” is slippery; I mean something like “what a sufficiently-idealized Bayesian would believe after updating on the fact that achieves a high reward.” Constructing an objective which incentivizes these answers probably requires understanding the nature of that update.
Thoughts on feasibility
In the context of ML, I usually imagine training via iterated amplification. Unfortunately, iterated amplification doesn’t correspond to optimizing a single objective—it requires either training a sequence of agents or exploiting properties of local search (using the previous iterate to provide oversight for the next). If we just have , it’s not clear if we can efficiently do anything like iterated amplification or debate.
If aligning is impossible, I think that’s slightly bad news for aligning ML. That said, it’s reasonably likely that local search will be easier to align, so the next step would be constructing a simple model of local search.
There are also some ways in which the optimizer case seems easier:
It’s a simpler model and so more amenable to analysis. The Bayesian update from ” gets a high reward” is more straightforward when is actually optimized.
We don’t have to worry about optimization difficulty.
Given a policy we can directly search for an input on which it behaves a certain way.
It’s OK if it’s impossible
When working on alignment I aim to either find a scalable alignment strategy or a clear argument for why scalable alignment is impossible. I’m excited about considering easy-to-analyze versions of the alignment problem even if they are impossible:
It gives us practice making impossibility arguments, and developing relevant intuitions and techniques.
It clarifies the difficulty of the alignment problem—if we know why we can’t handle simple cases like , then we learn something about what the solution must look like in more complicated cases.
It gives us a sense of what impossibility results might look like, if we were able to prove them in more realistic cases. Would they actually be strong enough to guide action, or convince anyone skeptical?
Expensive optimization
I described as requiring times more compute than . If we implemented it naively it would instead cost times more than .
We can use this more expense in our unaligned benchmark, which produces an AI that we can actually run (but it would be terrible, since it does a brute force search over programs). It should be easier to compete with this really slow AI. But it’s still not trivial and I think it’s worth working on. If we can’t compete with this benchmark, I’d feel relatively pessimistic about aligning ML.
Most of this post seems to be simplified/streamlined versions of what you’ve written before. The following points seem to be new, and I have some questions:
“training a sequence of agents” is bad because it might require multiple invocations of Opt so it’s not competitive with an unaligned AI that uses Opt a small constant number of times?
Can you explain more how iterated amplification exploits properties of local search?
Is this because (or one way to think about it is) Opt corresponds to NP and iterated amplification or debate correspond to something higher in the polynomial hierarchy?
You described Opt as returning the argmax for U using only n times more compute than U, without any caveats. Surely this isn’t actually possible because in the worst case it does require 2n times more than U? So the only way to be competitive with the Opt-based benchmark is to make use of Opt as a black box?
Why is it easier? (If you treat them both as black boxes, the difficulty should be the same?) Is it because we don’t have to treat the slow naive version of Opt as a black box that we have to make use of, and therefore there are more things we can do to try to be competitive with it?
Why wouldn’t just be impossible? Is it because ML occupies a different point on the speed/capability Pareto frontier and it might be easier to build an aligned AI near that point (compared to the point that the really slow AI occupies) ?
I mostly want to call attention to this similar-but-slightly-simpler problem of aligning Opt. Most of the content is pretty similar to what I’ve described in the ML case, simplified partly as an exposition thing, and partly because everything is simpler for Opt. I want this problem statement to stand relatively independently since I think it can be worked on relatively independently (especially if it ends up being an impossibility argument).
Yes.
It could be OK if the number of bits increased exponentially with each invocation (if an n bit policy is overseen by a bunch of copies of an n/2 bit policy, then the total cost is 2x). I think it’s more likely you’l just avoid doing anything like amplification.
At each step of local search you have some current policy and you are going to produce a new one (e.g. by taking a gradient descent step, or by generating a bunch of perturbations). You can use the current policy to help define the objective for the next one, rather than needing to make a whole separate call to Opt.
Yes.
Yes.
Yes.
Because of ML involving local search. It seems unhappy because a single step of local search feels very similar to “generate a bunch of options and see which one is best,” so it would be surprising if you could align local search without being able to align “generate a bunch of options and see which one is best.” But it might be helpful that you are doing a long sequence of steps.
That makes sense. Are you describing it as a problem that you (or others you already have in mind such as people at OpenAI) will work on, or are you putting it out there for people looking for a problem to attack?
So, something like, when training the next level agent in IDA, you initialize the model parameters with the current parameters rather than random parameters?
I will work on it at least a little, I’m encouraging others to think about it.
You don’t even need to explicitly maintain separate levels of agent. You just always use the current model to compute the rewards, and use that reward function to compute a gradient and update.
You’re using current model to perform subtasks of “compute the reward for the current task(s) being trained” and then updating, and local optimization ensures the update will make the model better (or at least no worse) at the task being trained, but how do you know the update won’t also make the model worse at the subtasks of “compute the reward for the current task(s) being trained”?
Is the answer something like, the current tasks being trained includes all previously trained tasks? But even then, it’s not clear that as you add more tasks to the training set, performance on previously trained tasks won’t degrade.
The idea is that “the task being trained” is something like: 50% what you care about at the object level, 50% the subtasks that occur in the evaluation process. The model may sometimes get worse at the evaluation process, or at the object level task, you are just trying to optimize some weighted combination.
There are a bunch of distinct difficulties here. One is that the distribution of “subtasks that occur in the evaluation process” is nonstationary. Another is that we need to set up the game so that doing both evaluation and the object level task is not-much-harder than just doing the object level task.
Planned summary:
Current ML capabilities are centered around **local search**: we get a gradient (or an approximation to one, as with evolutionary algorithms), and take a step in that direction to find a new model. Iterated amplification takes advantage of this fact: rather than a sequence of gradient steps on a fixed reward, we can do a sequence of amplification steps and distillation gradient steps.
However, we can consider an even simpler model of ML capabilities: function maximization. Given a function from n-bit strings to real numbers, we model ML as allowing us to find the input n-bit string with the maximum output value, in only O(n) time (rather than the O(2n) time that brute force search would take). If this were all we knew about ML capabilities, could we still design an aligned, competitive version of it? While this is not the actual problem we face, due to its simplicity it is more amenable to theoretical analysis, and so is worth thinking about.
We could make an unaligned AI that maximizes some explicit reward using only 2 calls to Opt: first, use Opt to find a good world model M that can predict the dynamics and reward, and then use Opt to find a policy that does well when interacting with M. This is unaligned for all the usual reasons: most obviously, it will try to seize control of the reward channel.
An aligned version does need to use Opt, since that’s the only way of turning a naively-exponential search into a linear one; without using Opt the resulting system won’t be competitive. We can’t just generalize iterated amplification to this case, since iterated amplification relies on a _sequence_ of applications of ML capabilities: this would lead to an aligned AI that uses Opt many times, which will not be competitive since the unaligned AI only requires 2 calls to Opt.
One possible approach is to design an AI with good incentives (in the same way that iterated amplification aims to approximate HCH) that “knows everything that the unaligned AI knows”. However, it would also be useful to produce a proof of impossibility: this would tell us something about what a solution must look like in more complex settings.
Planned opinion:
Amusingly, I liked this post primarily because comparing this setting to the typical setting for iterated amplification was useful for seeing the design choices and intuitions that motivated iterated amplification.
(ETA: This comment may not make much sense unless the reader is familiar with section 2.2 of AI safety via debate.)
At the meta level, given Opt that we have to use as a black box, it seems like:
Building an unaligned AI corresponds to P or at most NP
Verifying that a flaw in a proposed aligned AI design actually is a flaw corresponds to P
Checking whether an AI design is aligned (or has an alignment flaw) corresponds to NP or co-NP
Designing an actually aligned AI corresponds to Σ2P
Determining that aligned AI is impossible corresponds to Π2P
Determining whether there is an aligned AI that comes with a clear argument for it being aligned corresponds to NP
Determining whether there is a clear argument for aligned AI being impossible corresponds to NP
Does that seem right? For the impossibility part you’re proposing to do 7 but since the actual problem is closer to 5, it could easily be the case that aligned AI is impossible but there is no clear argument for it. (I.e., there’s no short argument that can convince you and you have to do the Π2P computation instead.) So I would think that if 6 is false then that actually is (probably) bad news.
As one of the authors of the paper that introduced the idea of human analogues of computational complexity classes, which I’ve found to be really interesting and useful (see here for another place that I use it), I’m curious about your thoughts on the ways I’ve used it (e.g., am I misunderstanding the idea or misusing it) and whether you have any further thoughts about it yourself, such as what kinds of problems you expect to be outside the human equivalent of NP.
Determining whether aligned AI is impossible seems harder than determining whether there is any hope for a knowably-aligned AI.
I think that when a design problem is impossible, there is often an argument for why it’s impossible. Certainly that’s not obvious though, and you might just be out of luck. (That said, it’s also not obvious that problems in NP are easier to solve than Σ2P, both contain problems you just can’t solve and so you are relying on extra structural facts in either case.)
My knowledge/experience is limited but I know in cryptography we generally can’t find arguments for why it’s impossible to design an algorithm to break a cipher, and have to rely on the fact that lots of smart people have tried to attack a cipher and nobody has succeeded. (Sometimes we can have a “security proof” in the sense of a reduction from the security of a cipher to a problem like factoring, but then we’re still relying on the fact that lots of smart people have tried to design faster factoring algorithms and nobody has succeeded.) Note that this is only a Π1P or co-NP problem, whereas determining that aligned AI is impossible is in Π2P according to my analysis.
I guess that may be true in a raw compute sense, but there’s also an issue of how do you convince people that you’ve solved a problem? Again in crypto,
NP problems (find a flaw in a cipher) get solved by individual researchers,
co-NP problems (determine a cipher is secure) is “solved” by people trying and failing to break a cipher,
Σ2P problems (design a fast cipher free from security flaws) get solved by government-sponsored contests that the whole field participates in, and
Π2P problems (determine that any cipher faster than X can’t be secure) are left unsolved. (The fact that people tried and failed to design a secure cipher faster than X is really weak evidence that it’s impossible and of course nobody knows how to prove anything like this.)
Use Opt to find a language model. The hope is to make it imitate a human researcher’s thought process fast enough that the imitation can attempt to solve the AI alignment problem for us.
Use Opt to find a proof that generating such an imitation will not lead to a daemon’s treacherous turn, as defined by the model disagreeing in its prediction from a large enough Solomonoff fraction of its competitors. The hope is that the consequentialist portion of the hypothesis space is not large and cooperative/homogenous enough to form a single voting block that bypasses the daemon alarm.
I suggest as a first step, we should just aim for an uncompetitive aligned AI, one that might use a lot more training data, or many more invocations of Opt than the benchmark. (If we can’t solve that, that seems fairly strong evidence that a competitive aligned AI is impossible or beyond our abilities. Or if someone proposes a candidate and we can’t decide whether it’s actually aligned or not, that would also be very useful strategic information that doesn’t require the candidate to be competitive.)
Do you already have a solution to the uncompetitive aligned AI problem that you can sketch out? It sounds like you think iterated amplification or debate can be implemented using Opt (in an uncompetitive way), so maybe you can give enough details about that to either show that it is aligned or provide people a chance to find flaws in it?
The point of working in this setting is mostly to constrain the search space or make it easier to construct an impossibility argument.
If dropping competitiveness, what counts as a solution? Is “imitate a human, but run it fast” fair game? We could try to hash out the details in something along those lines, and I think that’s worthwhile, but I don’t think it’s a top priority and I don’t think the difficulties will end up being that similar. I think it may be productive to relax the competitiveness requirement (e.g. to allow solutions that definitely have at most a polynomial slowdown), but probably not a good idea to eliminate it altogether.
I’m not sure, but mainly because I’m not sure what counts as a solution to your problem. If we had a specification of that, couldn’t we just remove the parts that deal with competitiveness?
I guess not, because a human imitation might have selfish goals and not be intent aligned to the user?
What about my suggestion of hashing the details of how to implement IDA/DEBATE using Opt and then seeing if we can decide whether or not it’s aligned?
Not much of an impossibility argument, but I just want to point out that any solution that involves Opt outputting a program should somehow deal with the concern that the program might contain inner optimizers. This aspect seems to me very hard (unless we somehow manage to conclude that the smallest/fastest program that computes some function does not contain inner optimizers).
ETA: the term “inner optimizer” is deprecated in favor of “mesa-optimizer”.
If we can test whether the model is behaving badly on a given input, then we can use Opt to search for any input where the model behaves badly. So we can end up with a system that works well on distribution and doesn’t work poorly off distribution. If it’s possible to handle outer alignment in this setting, I expect inner alignment will also be possible. (Though it’s not obvious, since you might take longer to learn given an unaligned learned optimizer.)
In the case of deceptive alignment, our ability to test whether the model behaves badly on input x affects the behavior of the model on input x (and similarly, our ability to come up with a U s.t.Opt(U) allows us to find—xif the model behaves badly on input—xaffects the behavior of the model on input x).
Therefore, to the extent that deceptive alignment is plausible in programs that Opt outputs, the inner alignment problem seems to me very hard.
That seems fine though. If the model behaves badly on any input we can test that. If the model wants to behave well on every input, then we’re happy. If it wants to behave badly on some input, we’ll catch it.
Are you concerned that we can’t test whether the model is behaving badly on a particular input? I think if you have that problem you are also in trouble for outer alignment.
When you say “test” do you mean testing by writing a single program that outputs whether the model performs badly on a given input (for any input)?
If so, I’m concerned that we won’t be able to write such a program.
If not (i.e. if we only assume that human researchers can safely figure out whether the model behaves badly on a given input), then I don’t understand how we can use Opt to find an input that the model behaves badly on (in a way that would work even if deceptive alignment occurs).
That’s the hope. (Though I assume we mostly get it by an application of Opt, or more efficiently by modifying our original invocation of Opt to return a program with some useful auxiliary functions, rather than by writing it by hand.)
(I’m sure this point is obvious to Paul, but it wasn’t to me)
We can search for inputs on which a policy behaves badly, which is really helpful for verifying the worst case of a certain policy. But we can’t search for a policy which has a good worst case, because that would require using the black box inside the function passed to the black box, which we can’t do. I think you can also say this as “the black box is an NP oracle, not a Σ2P oracle”.
This still means that we can build a system which in the worst case does nothing, rather than in the worst case is dangerous: we do whatever thing to get some policy, then we search for an input on which it behaves badly, and if one exists we don’t run the policy.
What is pi here?