I felt like this post could benefit from a summary so I wrote one below. It ended up being pretty long, so if people think it’s useful I could make it into it’s own top-level post.
Summary of the summary
In this talk Evan examines how likely we are to get deceptively aligned models once our models become powerful enough to understand the training process. Since deceptively aligned models are behaviorally (almost) indistinguishable from robustly aligned model, we should examine this question by looking at the inductive biases of the training process. The talk looks at the high path dependence world and the low path dependence world and concludes that deception is favored by the inductive biases in both cases.
In the high path dependence world that’s because it’s harder for SGD to develop a good pointer to the training objective than it is to just modify the model’s inner objective to be long-term. In a low path dependence world, Evan thinks that the training process’s bias toward simpler models favors deceptively aligned models. But a bias toward faster inference favors robustly aligned models. However, a speed prior is unlikely to be competitive.
Thus, in both the high and low path-dependence world, deception is favored by gradient descent’s inductive biases. We therefore have to change the training process to make deception less favored. We can’t just do this using behavioral incentives, since deceptive agents behave identically to robustly aligned agents.
Summary
The post defines deceptive alignment as occurring when “the reason the model looks aligned in training is because it is actively trying to do so for instrumental reasons.” It argues that once a model is deceptively aligned, it will be indistinguishable from a robustly aligned model. Thus the question “Is deception the default?” is a question about the inductive biases of the training process we are using. We need to determine how likely the training process is to find a deceptively aligned model rather than a robustly aligned model.
To answer this question Evan examines two possible worlds, the high path-dependence world and the low path-dependence world. In the high path dependence world the model depends strongly on the path the training process took through model space, so the model’s behavior outside of training is heavily influenced by its training path. In the low path dependence world we get pretty much the same models regardless of the training path that they took.
Deceptive alignment in the high path-dependence world
We first assume that the training process is complex enough such that to succeed, the model has to understand what we want. Evan posits three classes of models that do well in training:
Internal alignment – The model’s internal goals are equivalent to the training objective.
Corrigible alignment – The model’s goals are a pointer to the part of it’s internal representation of the training objective.
Deceptive alignment – The model has some long-term goal and is instrumentally motivated to achieve its training objective.
To determine the relative likelihood of these classes, Evan constructs the training path that he thinks is most likely to result in each class of model and then examines how likely this training path is to happen. In particular he considers two factors here,
How much marginal performance improvement do you get per step?
How long is the path?
Internal alignment
The internal alignment path is,
We start with a proxy-aligned model
SGD continuously improves the proxy until it corresponds to the training objective, before it learns about the training process
The model learns about the training process from its input data
Since the model is already attempting to pursue the training objective, there are no performance gains from SGD modifying the model to become deceptive, so it stays aligned
Evan sees this path as unlikely, since it would require the proxy to align perfectly with the training objective before it learns about the training process. This is unlikely because there are diminishing returns to improving the proxy as well as to improving the world model, so it would be inefficient for SGD to allocate all its improvements to the proxy rather than the world model. What’s more likely is that the proxy and the world model will improve in parallel, so it won’t perfect its proxy before it develops situational awareness.
Corrigible alignment
The corrigible alignment path is,
We start with a proxy-aligned model
In early training, SGD jointly focuses on improving the model’s understanding of the world along with improving its proxies
The model learns about the training process from its input data
SGD replaces the model’s proxies with a pointer to its understanding of the training objective
Since the model is already attempting to pursue the training objective, there are no performance gains from SGD modifying the model to become deceptive, so it stays aligned
One way to view this is that after step 3, the model has an overhang where its world-model represents the training process more accurately than its proxy-goal. Here, SGD resolves the overhang by throwing out the proxies and replacing them with a pointer to the training objective.
One issue is that it might be hard to specify this pointer.
For example, maybe it learns a pointer to whatever’s encoded in this computer, or whatever this human says, or whatever the human sitting in that chair says, or whatever Bob the head of the company says.
The longer it takes for SGD to find the correct pointer the less likely the corrigible alignment path becomes.
Deceptive alignment
The deceptive alignment path is,
We start with a proxy-aligned model
In early training, SGD jointly focuses on improving the model’s understanding of the world along with improving its proxies
The model learns about the training process from its input data
SGD makes the model’s proxies into more long-term goals, resulting in it instrumentally optimizing for the training objective for the purpose of staying around
The model’s proxies “crystallize,” as they are no longer relevant to performance, and we reach an equilibrium
Here, the first three steps are the same as in the corrigible alignment path, so the model has the same overhang as before. This time, SGD solves the overhang by modifying the proxy goals into long-term goals. For example, if the model’s original proxy goal was “maximize gold coins in the current training run,” they would be changed to “maximize gold coins across all future training runs,” or something. Once the model has long term goals it starts performing well in training for instrumental purposes.
Evan sees this scenario as more likely, since the modification that SGD has to perform to the proxy objective is relatively easy compared to finding the correct pointer as in the corrigible alignment scenario.
Deceptive alignment in a low path dependence world
Here we’re going to again assume that the model understands the training process. Additionally we will assume that the process’s inductive biases are some combination of simplicity and speed. To figure out which types of models are more likely we are going to check which what classes of models are favored by having a bias towards simplicity and speed.
Simplicity – How complex is it to specify some algorithm in the weights?
From the transcript,
You can think about simplicity as “How much do you have to aim to hit the exact thing in the space of all possible models?”
Evan thinks the deceptive models are the simplest. That’s because, roughly speaking, he thinks all agents will have a world model, an optimization procedure, and an internal objective. The first two components will be about the same for all three models, so the complexity question boils down to how complex each model class’s objective is.
The internally aligned objective is really complex, since it has to be the exact same as the training objective. The corrigible alignment objective is less complex because it just has a to be a pointer to the right parts of the world model. The deceptive alignment objective is super simple, since it could be any long-term objective. Thus the complexity prior favors deception.
Speed – How much computation does that algorithm take at inference time?
Evan thinks deceptive models loose out here.
Well, here we’re trying to understand how much computation you have to perform for each model class to result in the correct answer. The internally aligned one already has exactly what the thing is that it is trying to optimize for, and it just directly executes on that; the corrigibly aligned one, maybe it takes a little bit more because it has to figure out what the pointer is pointing to, but overall it’s pretty similar.
But the deceptively aligned one has to do at least a couple of extra steps, because it reasons, “Okay, I care about gold coins,” but just knowing that it cares about gold coins is not yet sufficient to be able to do a good job in the training environment. It has to do a couple of extra steps of reasoning where it’s like, “Okay, I care about gold coins and caring about gold coins in the long term means that I should try to optimize the training goal temporarily to try to get more gold coins eventually.” And so that adds extra steps of reasoning and some cost to the overall computation of the model. So the deceptively aligned one is in some sense slower than the others.
However, Evan doesn’t think that a speed prior will save us from deception because it is uncompetitive. Essentially, Evan sees the real world as having a simplicity prior, so simpler functions tend to generalize better. However, simpler functions are not always the fastest to compute. As a result, a speed prior will prevent SGD from finding the functions that generalize best. He cites double descent as evidence for this, which I won’t describe in this summary.
Conclusion
In both the high and low path-dependence world, deception is favored by gradient descent’s inductive biases. We therefore have to change the training process to make deception less favored. We can’t just do this using behavioral incentives, since deceptive agents behave identically to robustly aligned agents.
I felt like this post could benefit from a summary so I wrote one below. It ended up being pretty long, so if people think it’s useful I could make it into it’s own top-level post.
Summary of the summary
In this talk Evan examines how likely we are to get deceptively aligned models once our models become powerful enough to understand the training process. Since deceptively aligned models are behaviorally (almost) indistinguishable from robustly aligned model, we should examine this question by looking at the inductive biases of the training process. The talk looks at the high path dependence world and the low path dependence world and concludes that deception is favored by the inductive biases in both cases.
In the high path dependence world that’s because it’s harder for SGD to develop a good pointer to the training objective than it is to just modify the model’s inner objective to be long-term. In a low path dependence world, Evan thinks that the training process’s bias toward simpler models favors deceptively aligned models. But a bias toward faster inference favors robustly aligned models. However, a speed prior is unlikely to be competitive.
Thus, in both the high and low path-dependence world, deception is favored by gradient descent’s inductive biases. We therefore have to change the training process to make deception less favored. We can’t just do this using behavioral incentives, since deceptive agents behave identically to robustly aligned agents.
Summary
The post defines deceptive alignment as occurring when “the reason the model looks aligned in training is because it is actively trying to do so for instrumental reasons.” It argues that once a model is deceptively aligned, it will be indistinguishable from a robustly aligned model. Thus the question “Is deception the default?” is a question about the inductive biases of the training process we are using. We need to determine how likely the training process is to find a deceptively aligned model rather than a robustly aligned model.
To answer this question Evan examines two possible worlds, the high path-dependence world and the low path-dependence world. In the high path dependence world the model depends strongly on the path the training process took through model space, so the model’s behavior outside of training is heavily influenced by its training path. In the low path dependence world we get pretty much the same models regardless of the training path that they took.
Deceptive alignment in the high path-dependence world
We first assume that the training process is complex enough such that to succeed, the model has to understand what we want. Evan posits three classes of models that do well in training:
Internal alignment – The model’s internal goals are equivalent to the training objective.
Corrigible alignment – The model’s goals are a pointer to the part of it’s internal representation of the training objective.
Deceptive alignment – The model has some long-term goal and is instrumentally motivated to achieve its training objective.
To determine the relative likelihood of these classes, Evan constructs the training path that he thinks is most likely to result in each class of model and then examines how likely this training path is to happen. In particular he considers two factors here,
How much marginal performance improvement do you get per step?
How long is the path?
Internal alignment
The internal alignment path is,
We start with a proxy-aligned model
SGD continuously improves the proxy until it corresponds to the training objective, before it learns about the training process
The model learns about the training process from its input data
Since the model is already attempting to pursue the training objective, there are no performance gains from SGD modifying the model to become deceptive, so it stays aligned
Evan sees this path as unlikely, since it would require the proxy to align perfectly with the training objective before it learns about the training process. This is unlikely because there are diminishing returns to improving the proxy as well as to improving the world model, so it would be inefficient for SGD to allocate all its improvements to the proxy rather than the world model. What’s more likely is that the proxy and the world model will improve in parallel, so it won’t perfect its proxy before it develops situational awareness.
Corrigible alignment
The corrigible alignment path is,
We start with a proxy-aligned model
In early training, SGD jointly focuses on improving the model’s understanding of the world along with improving its proxies
The model learns about the training process from its input data
SGD replaces the model’s proxies with a pointer to its understanding of the training objective
Since the model is already attempting to pursue the training objective, there are no performance gains from SGD modifying the model to become deceptive, so it stays aligned
One way to view this is that after step 3, the model has an overhang where its world-model represents the training process more accurately than its proxy-goal. Here, SGD resolves the overhang by throwing out the proxies and replacing them with a pointer to the training objective.
One issue is that it might be hard to specify this pointer.
The longer it takes for SGD to find the correct pointer the less likely the corrigible alignment path becomes.
Deceptive alignment
The deceptive alignment path is,
We start with a proxy-aligned model
In early training, SGD jointly focuses on improving the model’s understanding of the world along with improving its proxies
The model learns about the training process from its input data
SGD makes the model’s proxies into more long-term goals, resulting in it instrumentally optimizing for the training objective for the purpose of staying around
The model’s proxies “crystallize,” as they are no longer relevant to performance, and we reach an equilibrium
Here, the first three steps are the same as in the corrigible alignment path, so the model has the same overhang as before. This time, SGD solves the overhang by modifying the proxy goals into long-term goals. For example, if the model’s original proxy goal was “maximize gold coins in the current training run,” they would be changed to “maximize gold coins across all future training runs,” or something. Once the model has long term goals it starts performing well in training for instrumental purposes.
Evan sees this scenario as more likely, since the modification that SGD has to perform to the proxy objective is relatively easy compared to finding the correct pointer as in the corrigible alignment scenario.
Deceptive alignment in a low path dependence world
Here we’re going to again assume that the model understands the training process. Additionally we will assume that the process’s inductive biases are some combination of simplicity and speed. To figure out which types of models are more likely we are going to check which what classes of models are favored by having a bias towards simplicity and speed.
Simplicity – How complex is it to specify some algorithm in the weights?
From the transcript,
Evan thinks the deceptive models are the simplest. That’s because, roughly speaking, he thinks all agents will have a world model, an optimization procedure, and an internal objective. The first two components will be about the same for all three models, so the complexity question boils down to how complex each model class’s objective is.
The internally aligned objective is really complex, since it has to be the exact same as the training objective. The corrigible alignment objective is less complex because it just has a to be a pointer to the right parts of the world model. The deceptive alignment objective is super simple, since it could be any long-term objective. Thus the complexity prior favors deception.
Speed – How much computation does that algorithm take at inference time?
Evan thinks deceptive models loose out here.
However, Evan doesn’t think that a speed prior will save us from deception because it is uncompetitive. Essentially, Evan sees the real world as having a simplicity prior, so simpler functions tend to generalize better. However, simpler functions are not always the fastest to compute. As a result, a speed prior will prevent SGD from finding the functions that generalize best. He cites double descent as evidence for this, which I won’t describe in this summary.
Conclusion
In both the high and low path-dependence world, deception is favored by gradient descent’s inductive biases. We therefore have to change the training process to make deception less favored. We can’t just do this using behavioral incentives, since deceptive agents behave identically to robustly aligned agents.