Most of this post seems to be simplified/streamlined versions of what you’ve written before. The following points seem to be new, and I have some questions:
Unfortunately, iterated amplification doesn’t correspond to optimizing a single objective—
U it requires either training a sequence of agents or exploiting properties of local search (using the previous iterate to provide oversight for the next).
“training a sequence of agents” is bad because it might require multiple invocations of Opt so it’s not competitive with an unaligned AI that uses Opt a small constant number of times?
Can you explain more how iterated amplification exploits properties of local search?
If we just have
Opt, it’s not clear if we can efficiently do anything like iterated amplification or debate.
Is this because (or one way to think about it is) Opt corresponds to NP and iterated amplification or debate correspond to something higher in the polynomial hierarchy?
I described Opt as requiring n times more compute than U. If we implemented it naively it would instead cost 2n times more than U.
You described Opt as returning the argmax for U using only n times more compute than U, without any caveats. Surely this isn’t actually possible because in the worst case it does require 2n times more than U? So the only way to be competitive with the Opt-based benchmark is to make use of Opt as a black box?
It should be easier to compete with this really slow AI. But it’s still not trivial and I think it’s worth working on.
Why is it easier? (If you treat them both as black boxes, the difficulty should be the same?) Is it because we don’t have to treat the slow naive version of Opt as a black box that we have to make use of, and therefore there are more things we can do to try to be competitive with it?
If we can’t compete with this benchmark, I’d feel relatively pessimistic about aligning ML.
Why wouldn’t just be impossible? Is it because ML occupies a different point on the speed/capability Pareto frontier and it might be easier to build an aligned AI near that point (compared to the point that the really slow AI occupies) ?
Most of this post seems to be simplified/streamlined versions of what you’ve written before.
I mostly want to call attention to this similar-but-slightly-simpler problem of aligning Opt. Most of the content is pretty similar to what I’ve described in the ML case, simplified partly as an exposition thing, and partly because everything is simpler for Opt. I want this problem statement to stand relatively independently since I think it can be worked on relatively independently (especially if it ends up being an impossibility argument).
“training a sequence of agents” is bad because it might require multiple invocations of Opt so it’s not competitive with an unaligned AI that uses Opt a small constant number of times?
Yes.
It could be OK if the number of bits increased exponentially with each invocation (if an n bit policy is overseen by a bunch of copies of an n/2 bit policy, then the total cost is 2x). I think it’s more likely you’l just avoid doing anything like amplification.
Can you explain more how iterated amplification exploits properties of local search?
At each step of local search you have some current policy and you are going to produce a new one (e.g. by taking a gradient descent step, or by generating a bunch of perturbations). You can use the current policy to help define the objective for the next one, rather than needing to make a whole separate call to Opt.
Is this because (or one way to think about it is) Opt corresponds to NP and iterated amplification or debate correspond to something higher in the polynomial hierarchy?
Yes.
You described Opt as returning the argmax for U using only n times more compute than U, without any caveats. Surely this isn’t actually possible because in the worst case it does require 2n times more than U? So the only way to be competitive with the Opt-based benchmark is to make use of Opt as a black box?
Yes.
Why is it easier? (If you treat them both as black boxes, the difficulty should be the same?) Is it because we don’t have to treat the slow naive version of Opt as a black box that we have to make use of, and therefore there are more things we can do to try to be competitive with it?
Yes.
Why wouldn’t just be impossible? Is it because ML occupies a different point on the speed/capability Pareto frontier and it might be easier to build an aligned AI near that point (compared to the point that the really slow AI occupies) ?
Because of ML involving local search. It seems unhappy because a single step of local search feels very similar to “generate a bunch of options and see which one is best,” so it would be surprising if you could align local search without being able to align “generate a bunch of options and see which one is best.” But it might be helpful that you are doing a long sequence of steps.
I want this problem statement to stand relatively independently since I think it can be worked on relatively independently (especially if it ends up being an impossibility argument).
That makes sense. Are you describing it as a problem that you (or others you already have in mind such as people at OpenAI) will work on, or are you putting it out there for people looking for a problem to attack?
At each step of local search you have some current policy and you are going to produce a new one (e.g. by taking a gradient descent step, or by generating a bunch of perturbations). You can use the current policy to help define the objective for the next one, rather than needing to make a whole separate call to Opt.
So, something like, when training the next level agent in IDA, you initialize the model parameters with the current parameters rather than random parameters?
Are you describing it as a problem that you (or others you already have in mind such as people at OpenAI) will work on, or are you putting it out there for people looking for a problem to attack?
I will work on it at least a little, I’m encouraging others to think about it.
So, something like, when training the next level agent in IDA, you initialize the model parameters with the current parameters rather than random parameters?
You don’t even need to explicitly maintain separate levels of agent. You just always use the current model to compute the rewards, and use that reward function to compute a gradient and update.
You don’t even need to explicitly maintain separate levels of agent. You just always use the current model to compute the rewards, and use that reward function to compute a gradient and update.
You’re using current model to perform subtasks of “compute the reward for the current task(s) being trained” and then updating, and local optimization ensures the update will make the model better (or at least no worse) at the task being trained, but how do you know the update won’t also make the model worse at the subtasks of “compute the reward for the current task(s) being trained”?
Is the answer something like, the current tasks being trained includes all previously trained tasks? But even then, it’s not clear that as you add more tasks to the training set, performance on previously trained tasks won’t degrade.
The idea is that “the task being trained” is something like: 50% what you care about at the object level, 50% the subtasks that occur in the evaluation process. The model may sometimes get worse at the evaluation process, or at the object level task, you are just trying to optimize some weighted combination.
There are a bunch of distinct difficulties here. One is that the distribution of “subtasks that occur in the evaluation process” is nonstationary. Another is that we need to set up the game so that doing both evaluation and the object level task is not-much-harder than just doing the object level task.
Most of this post seems to be simplified/streamlined versions of what you’ve written before. The following points seem to be new, and I have some questions:
“training a sequence of agents” is bad because it might require multiple invocations of Opt so it’s not competitive with an unaligned AI that uses Opt a small constant number of times?
Can you explain more how iterated amplification exploits properties of local search?
Is this because (or one way to think about it is) Opt corresponds to NP and iterated amplification or debate correspond to something higher in the polynomial hierarchy?
You described Opt as returning the argmax for U using only n times more compute than U, without any caveats. Surely this isn’t actually possible because in the worst case it does require 2n times more than U? So the only way to be competitive with the Opt-based benchmark is to make use of Opt as a black box?
Why is it easier? (If you treat them both as black boxes, the difficulty should be the same?) Is it because we don’t have to treat the slow naive version of Opt as a black box that we have to make use of, and therefore there are more things we can do to try to be competitive with it?
Why wouldn’t just be impossible? Is it because ML occupies a different point on the speed/capability Pareto frontier and it might be easier to build an aligned AI near that point (compared to the point that the really slow AI occupies) ?
I mostly want to call attention to this similar-but-slightly-simpler problem of aligning Opt. Most of the content is pretty similar to what I’ve described in the ML case, simplified partly as an exposition thing, and partly because everything is simpler for Opt. I want this problem statement to stand relatively independently since I think it can be worked on relatively independently (especially if it ends up being an impossibility argument).
Yes.
It could be OK if the number of bits increased exponentially with each invocation (if an n bit policy is overseen by a bunch of copies of an n/2 bit policy, then the total cost is 2x). I think it’s more likely you’l just avoid doing anything like amplification.
At each step of local search you have some current policy and you are going to produce a new one (e.g. by taking a gradient descent step, or by generating a bunch of perturbations). You can use the current policy to help define the objective for the next one, rather than needing to make a whole separate call to Opt.
Yes.
Yes.
Yes.
Because of ML involving local search. It seems unhappy because a single step of local search feels very similar to “generate a bunch of options and see which one is best,” so it would be surprising if you could align local search without being able to align “generate a bunch of options and see which one is best.” But it might be helpful that you are doing a long sequence of steps.
That makes sense. Are you describing it as a problem that you (or others you already have in mind such as people at OpenAI) will work on, or are you putting it out there for people looking for a problem to attack?
So, something like, when training the next level agent in IDA, you initialize the model parameters with the current parameters rather than random parameters?
I will work on it at least a little, I’m encouraging others to think about it.
You don’t even need to explicitly maintain separate levels of agent. You just always use the current model to compute the rewards, and use that reward function to compute a gradient and update.
You’re using current model to perform subtasks of “compute the reward for the current task(s) being trained” and then updating, and local optimization ensures the update will make the model better (or at least no worse) at the task being trained, but how do you know the update won’t also make the model worse at the subtasks of “compute the reward for the current task(s) being trained”?
Is the answer something like, the current tasks being trained includes all previously trained tasks? But even then, it’s not clear that as you add more tasks to the training set, performance on previously trained tasks won’t degrade.
The idea is that “the task being trained” is something like: 50% what you care about at the object level, 50% the subtasks that occur in the evaluation process. The model may sometimes get worse at the evaluation process, or at the object level task, you are just trying to optimize some weighted combination.
There are a bunch of distinct difficulties here. One is that the distribution of “subtasks that occur in the evaluation process” is nonstationary. Another is that we need to set up the game so that doing both evaluation and the object level task is not-much-harder than just doing the object level task.