I agree that this sort of deceptive misalignment story is speculative but a priori plausible. I think it’s very difficult to reason about these sorts of nuanced inductive biases without having sufficiently tight analogies to current systems or theoretical models; how this will play out (as with other questions of inductive bias) probably depends to a large extent on what the high-level structure of the AI system looks like. Because of this, I think it’s more likely than not that our predictions about what these inductive biases will look like are pretty off-base. That being said, here are the first few specific reasons to doubt the scenario which come to mind right now:
If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term. It’s imaginable that the goal is a mesa-objective which is mixed in some inescapably non-modular way with the rest of the system, but then it would be surprising to me if the system’s behavior could really best be best characterized as optimizing this single objective; as opposed to applying a bunch of heuristics, some of which involve pursuing mesa-objectives and some of which don’t fit into that schema—so perhaps framing everything the agent does in terms of objectives isn’t the most useful framing (?).
If an agent has a long-term objective, for which achieving the desired short-term objective is only instrumentally useful, then in order to succeed the agent needs to figure out how to minimize the loss by using its reasoning skills (by default, within a single episode). If, on the other hand, the agent has an appropriate short-term objective, then the agent will learn (across episodes) how to minimize the loss through gradient descent. I expect the latter scenario to typically result in better loss for statistical reasons, since the agent can take advantage of more samples. (This would be especially clear if, in the training paradigm of the future, the competence of the agent increases during training.)
(There’s also the idea of imposing a speed prior; not sure how likely that direction is to pan out.)
Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It’s not enough for it to be plausible that it could happen often; it needs to happen all the time.
If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term.
What kind of regularization could this be? And are you imagining an AlphaZero-style system with a hardcoded value head, or an organically learned modularity?
Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It’s not enough for it to be plausible that it could happen often; it needs to happen all the time.
I think the situation is much better if deceptive alignment is inconsistent. I also think that’s more likely, particularly if we are trying.
That said, I don’t think the problem goes away completely if deceptive alignment is inconsistent. We may still have limited ability to distinguish deceptively aligned models from models that are trying to optimize reward, or we may find that models that are trying to optimize reward are unsuitable in practice (e.g. because of the issues raised in mechanism 1) and so selecting for things that works means you are selecting for deceptive alignment.
I agree that this sort of deceptive misalignment story is speculative but a priori plausible. I think it’s very difficult to reason about these sorts of nuanced inductive biases without having sufficiently tight analogies to current systems or theoretical models; how this will play out (as with other questions of inductive bias) probably depends to a large extent on what the high-level structure of the AI system looks like. Because of this, I think it’s more likely than not that our predictions about what these inductive biases will look like are pretty off-base. That being said, here are the first few specific reasons to doubt the scenario which come to mind right now:
If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term. It’s imaginable that the goal is a mesa-objective which is mixed in some inescapably non-modular way with the rest of the system, but then it would be surprising to me if the system’s behavior could really best be best characterized as optimizing this single objective; as opposed to applying a bunch of heuristics, some of which involve pursuing mesa-objectives and some of which don’t fit into that schema—so perhaps framing everything the agent does in terms of objectives isn’t the most useful framing (?).
If an agent has a long-term objective, for which achieving the desired short-term objective is only instrumentally useful, then in order to succeed the agent needs to figure out how to minimize the loss by using its reasoning skills (by default, within a single episode). If, on the other hand, the agent has an appropriate short-term objective, then the agent will learn (across episodes) how to minimize the loss through gradient descent. I expect the latter scenario to typically result in better loss for statistical reasons, since the agent can take advantage of more samples. (This would be especially clear if, in the training paradigm of the future, the competence of the agent increases during training.)
(There’s also the idea of imposing a speed prior; not sure how likely that direction is to pan out.)
Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It’s not enough for it to be plausible that it could happen often; it needs to happen all the time.
What kind of regularization could this be? And are you imagining an AlphaZero-style system with a hardcoded value head, or an organically learned modularity?
I think the situation is much better if deceptive alignment is inconsistent. I also think that’s more likely, particularly if we are trying.
That said, I don’t think the problem goes away completely if deceptive alignment is inconsistent. We may still have limited ability to distinguish deceptively aligned models from models that are trying to optimize reward, or we may find that models that are trying to optimize reward are unsuitable in practice (e.g. because of the issues raised in mechanism 1) and so selecting for things that works means you are selecting for deceptive alignment.