Suppose that during training my AI system had some arbitrary long-term goal. Many long-term goals would be best-served if the deployed AI system had that same goal. And so my AI is motivated to get a low loss, so that gradient descent won’t change its goals.
As a result, a very wide range of long-term goals will lead to competent loss-minimizing behavior. On the other hand, there is a very narrow range of short-term goals that lead to competent loss-minimizing behavior: “minimize the loss.”
So gradient descent on the short-term loss function can easily push towards long-term goals (in fact it would both push towards the precise short-term goals that result in low loss and arbitrary long-term goals, and it seems like a messy empirical question which one you get). This might not happen early in training, but eventually our model is competent enough to appreciate these arguments and perhaps for it to be extremely obvious to it that it should avoid taking actions that would be penalized by training.
It doesn’t seem like there are any behavioral checks we can do to easily push gradient descent back in the other direction, since an agent that is trying to get a low loss will always just adopt whatever behavior is best for getting a low loss (as long as it thinks it is on the training distribution).
This all is true even if my AI has subhuman long-horizon reasoning. Overall my take is maybe that there is a 25% chance that this becomes a serious issue soon enough to be relevant to us and that is resistant to simple attempts to fix it (though it’s also possible we will fail to even competently implement simple fixes). I expect to learn much more about this as we start engaging with AI systems intelligent enough for it to be a potential issue over the next 5-10 years.
This issue is discussed here. Overall I think it’s speculative but plausible.
I agree that this sort of deceptive misalignment story is speculative but a priori plausible. I think it’s very difficult to reason about these sorts of nuanced inductive biases without having sufficiently tight analogies to current systems or theoretical models; how this will play out (as with other questions of inductive bias) probably depends to a large extent on what the high-level structure of the AI system looks like. Because of this, I think it’s more likely than not that our predictions about what these inductive biases will look like are pretty off-base. That being said, here are the first few specific reasons to doubt the scenario which come to mind right now:
If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term. It’s imaginable that the goal is a mesa-objective which is mixed in some inescapably non-modular way with the rest of the system, but then it would be surprising to me if the system’s behavior could really best be best characterized as optimizing this single objective; as opposed to applying a bunch of heuristics, some of which involve pursuing mesa-objectives and some of which don’t fit into that schema—so perhaps framing everything the agent does in terms of objectives isn’t the most useful framing (?).
If an agent has a long-term objective, for which achieving the desired short-term objective is only instrumentally useful, then in order to succeed the agent needs to figure out how to minimize the loss by using its reasoning skills (by default, within a single episode). If, on the other hand, the agent has an appropriate short-term objective, then the agent will learn (across episodes) how to minimize the loss through gradient descent. I expect the latter scenario to typically result in better loss for statistical reasons, since the agent can take advantage of more samples. (This would be especially clear if, in the training paradigm of the future, the competence of the agent increases during training.)
(There’s also the idea of imposing a speed prior; not sure how likely that direction is to pan out.)
Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It’s not enough for it to be plausible that it could happen often; it needs to happen all the time.
If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term.
What kind of regularization could this be? And are you imagining an AlphaZero-style system with a hardcoded value head, or an organically learned modularity?
Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It’s not enough for it to be plausible that it could happen often; it needs to happen all the time.
I think the situation is much better if deceptive alignment is inconsistent. I also think that’s more likely, particularly if we are trying.
That said, I don’t think the problem goes away completely if deceptive alignment is inconsistent. We may still have limited ability to distinguish deceptively aligned models from models that are trying to optimize reward, or we may find that models that are trying to optimize reward are unsuitable in practice (e.g. because of the issues raised in mechanism 1) and so selecting for things that works means you are selecting for deceptive alignment.
Mechanism 2: deceptive alignment
Suppose that during training my AI system had some arbitrary long-term goal. Many long-term goals would be best-served if the deployed AI system had that same goal. And so my AI is motivated to get a low loss, so that gradient descent won’t change its goals.
As a result, a very wide range of long-term goals will lead to competent loss-minimizing behavior. On the other hand, there is a very narrow range of short-term goals that lead to competent loss-minimizing behavior: “minimize the loss.”
So gradient descent on the short-term loss function can easily push towards long-term goals (in fact it would both push towards the precise short-term goals that result in low loss and arbitrary long-term goals, and it seems like a messy empirical question which one you get). This might not happen early in training, but eventually our model is competent enough to appreciate these arguments and perhaps for it to be extremely obvious to it that it should avoid taking actions that would be penalized by training.
It doesn’t seem like there are any behavioral checks we can do to easily push gradient descent back in the other direction, since an agent that is trying to get a low loss will always just adopt whatever behavior is best for getting a low loss (as long as it thinks it is on the training distribution).
This all is true even if my AI has subhuman long-horizon reasoning. Overall my take is maybe that there is a 25% chance that this becomes a serious issue soon enough to be relevant to us and that is resistant to simple attempts to fix it (though it’s also possible we will fail to even competently implement simple fixes). I expect to learn much more about this as we start engaging with AI systems intelligent enough for it to be a potential issue over the next 5-10 years.
This issue is discussed here. Overall I think it’s speculative but plausible.
I agree that this sort of deceptive misalignment story is speculative but a priori plausible. I think it’s very difficult to reason about these sorts of nuanced inductive biases without having sufficiently tight analogies to current systems or theoretical models; how this will play out (as with other questions of inductive bias) probably depends to a large extent on what the high-level structure of the AI system looks like. Because of this, I think it’s more likely than not that our predictions about what these inductive biases will look like are pretty off-base. That being said, here are the first few specific reasons to doubt the scenario which come to mind right now:
If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term. It’s imaginable that the goal is a mesa-objective which is mixed in some inescapably non-modular way with the rest of the system, but then it would be surprising to me if the system’s behavior could really best be best characterized as optimizing this single objective; as opposed to applying a bunch of heuristics, some of which involve pursuing mesa-objectives and some of which don’t fit into that schema—so perhaps framing everything the agent does in terms of objectives isn’t the most useful framing (?).
If an agent has a long-term objective, for which achieving the desired short-term objective is only instrumentally useful, then in order to succeed the agent needs to figure out how to minimize the loss by using its reasoning skills (by default, within a single episode). If, on the other hand, the agent has an appropriate short-term objective, then the agent will learn (across episodes) how to minimize the loss through gradient descent. I expect the latter scenario to typically result in better loss for statistical reasons, since the agent can take advantage of more samples. (This would be especially clear if, in the training paradigm of the future, the competence of the agent increases during training.)
(There’s also the idea of imposing a speed prior; not sure how likely that direction is to pan out.)
Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It’s not enough for it to be plausible that it could happen often; it needs to happen all the time.
What kind of regularization could this be? And are you imagining an AlphaZero-style system with a hardcoded value head, or an organically learned modularity?
I think the situation is much better if deceptive alignment is inconsistent. I also think that’s more likely, particularly if we are trying.
That said, I don’t think the problem goes away completely if deceptive alignment is inconsistent. We may still have limited ability to distinguish deceptively aligned models from models that are trying to optimize reward, or we may find that models that are trying to optimize reward are unsuitable in practice (e.g. because of the issues raised in mechanism 1) and so selecting for things that works means you are selecting for deceptive alignment.