Let’s consider the following online learning setup:
At each timestep t, πθt takes action at∈A and receives reward rt∈R. Then, we perform the simple policy gradient update
θt+1=θt+rt∇θlog(P(at|πθt)).
Now, we can ask the question, would πθt be a mesa-optimizer? The first thing that’s worth noting is that the above setup is precisely the standard RL training setup—the only difference is that there’s no deployment stage. What that means, though, is that if standard RL training produces a mesa-optimizer, then this will produce a mesa-optimizer too, because the training process isn’t different in any way whatsoever. If π is acting in a diverse environment that requires search to be able to be solved effectively, then π will still need to learn to do search—the fact that there won’t ever be a deployment stage in the future is irrelevant to π’s current training dynamics (unless π is deceptive and knows there won’t be a deployment stage—that’s the only situation where it might be relevant).
Given that, we can ask the question of whether π, if it’s a mesa-optimizer, is likely to be misaligned—and in particular whether it’s likely to be deceptive. Again, in terms of proxy alignment, the training process is exactly the same, so the picture isn’t any different at all—if there are simpler, easier-to-optimize-for proxies, then π is likely to learn those instead of the true base objective. Like I mentioned previously, however, deceptive alignment is the one case where it might matter that you’re doing online learning, since if the model knows that it might do different things based on that fact. However, there are still lots of reasons why a model might be deceptive even in an online learning setup—for example, it might expect better opportunities for defection in the future, and thus want to prevent being modified now so that it can defect when it’ll be most impactful.
When I say “optimize all the parameters at runtime”, I do not mean “take one gradient step in between each timestep”. I mean, at each timestep, fully optimize all of the parameters. Optimize θ all the way to convergence before every single action.
Think back to the central picture of mesa-optimization (at least as I understand it). The mesa-optimizer shows up because some data is only available at runtime, not during training, so it has to be processed at runtime using parameters selected during training. In the online RL setup you sketch here, “runtime” for mesa-optimization purposes is every time the system chooses its action—i.e. every timestep—and “training” is all the previous timesteps. A mesa-optimizer should show up if, at every timestep, some relevant new data comes in and the system has to process that data in order to choose the optimal action, using parameters inherited from previous timesteps.
Now, suppose we fully optimize all of the parameters at every timestep. The objective function for this optimization would presumably be ∑trtlog(P[at|πθ]), with the sum taken over all previous data points, since that’s what the RL setup is approximating.
This optimization would probably still “find” the same mesa-optimizer as before, but now it looks less like a mesa-optimizer problem and more like an outer alignment problem: that objective function is probably not actually the thing we want. The fact that the true optimum for that objective function probably has our former “mesa-optimizer” embedded in it is a pretty strong signal that that objective function itself is not outer aligned; the true optimum of that objective function is not really the thing we want.
The RL process is actually optimizing E[∑trt], the log just comes from the REINFORCE trick. Regardless, I’m not sure I understand what you mean by optimizing fully to convergence at each timestep—convergence is a limiting property, so I don’t know what it could mean do it for a single timestep. Perhaps you mean just taking the optimal policy π∗ such that
π∗=argmaxπE[∑trt|π]?
In that case, that is in fact the definition of outer alignment I’ve given in the past, so I agree that whether π∗ is aligned or not is an outer alignment question.
Let’s consider the following online learning setup:
At each timestep t, πθt takes action at∈A and receives reward rt∈R. Then, we perform the simple policy gradient update θt+1=θt+rt∇θlog(P(at | πθt)).
Now, we can ask the question, would πθt be a mesa-optimizer? The first thing that’s worth noting is that the above setup is precisely the standard RL training setup—the only difference is that there’s no deployment stage. What that means, though, is that if standard RL training produces a mesa-optimizer, then this will produce a mesa-optimizer too, because the training process isn’t different in any way whatsoever. If π is acting in a diverse environment that requires search to be able to be solved effectively, then π will still need to learn to do search—the fact that there won’t ever be a deployment stage in the future is irrelevant to π’s current training dynamics (unless π is deceptive and knows there won’t be a deployment stage—that’s the only situation where it might be relevant).
Given that, we can ask the question of whether π, if it’s a mesa-optimizer, is likely to be misaligned—and in particular whether it’s likely to be deceptive. Again, in terms of proxy alignment, the training process is exactly the same, so the picture isn’t any different at all—if there are simpler, easier-to-optimize-for proxies, then π is likely to learn those instead of the true base objective. Like I mentioned previously, however, deceptive alignment is the one case where it might matter that you’re doing online learning, since if the model knows that it might do different things based on that fact. However, there are still lots of reasons why a model might be deceptive even in an online learning setup—for example, it might expect better opportunities for defection in the future, and thus want to prevent being modified now so that it can defect when it’ll be most impactful.
When I say “optimize all the parameters at runtime”, I do not mean “take one gradient step in between each timestep”. I mean, at each timestep, fully optimize all of the parameters. Optimize θ all the way to convergence before every single action.
Think back to the central picture of mesa-optimization (at least as I understand it). The mesa-optimizer shows up because some data is only available at runtime, not during training, so it has to be processed at runtime using parameters selected during training. In the online RL setup you sketch here, “runtime” for mesa-optimization purposes is every time the system chooses its action—i.e. every timestep—and “training” is all the previous timesteps. A mesa-optimizer should show up if, at every timestep, some relevant new data comes in and the system has to process that data in order to choose the optimal action, using parameters inherited from previous timesteps.
Now, suppose we fully optimize all of the parameters at every timestep. The objective function for this optimization would presumably be ∑trtlog(P[at|πθ]), with the sum taken over all previous data points, since that’s what the RL setup is approximating.
This optimization would probably still “find” the same mesa-optimizer as before, but now it looks less like a mesa-optimizer problem and more like an outer alignment problem: that objective function is probably not actually the thing we want. The fact that the true optimum for that objective function probably has our former “mesa-optimizer” embedded in it is a pretty strong signal that that objective function itself is not outer aligned; the true optimum of that objective function is not really the thing we want.
Does that make sense?
The RL process is actually optimizing E[∑trt], the log just comes from the REINFORCE trick. Regardless, I’m not sure I understand what you mean by optimizing fully to convergence at each timestep—convergence is a limiting property, so I don’t know what it could mean do it for a single timestep. Perhaps you mean just taking the optimal policy π∗ such that π∗=argmaxπE[∑trt | π]? In that case, that is in fact the definition of outer alignment I’ve given in the past, so I agree that whether π∗ is aligned or not is an outer alignment question.
Sure, π∗ works for what I’m saying, assuming that sum-over-time only includes the timesteps taken thus far. In that case, I’m saying that either:
the mesa optimizer doesn’t appear in π∗, in which case the problem is fixed by fully optimizing everything at every timestep (i.e. by using π∗), or
the mesa optimizer does appear in π∗, in which case the problem was really an outer alignment issue all along.