I don’t think it makes sense to classify every instance of this as deceptive alignment—and I don’t think this is the usual use of the term.
I think that to say “this is deceptive alignment” is generally to say something like “there’s a sense in which this system has a goal different from ours, is modeling the selection pressure it’s under, anticipating that this selection pressure may not exist in the future, and adapting its behaviour accordingly”.
That still leaves things underdefined, e.g. since this can all happen implicitly and/or without the system knowing this mechanism exists. However, if you’re not suggesting in any sense that [anticipation of potential future removal of selection pressure] is a big factor, then it’s strange to call it deceptive alignment.
I assume Wiblin means it in this sense—not that this is the chance we get catastrophically bad generalization, but rather that it happens via a mechanism he’d characterize this way.
[I’m now less clear that this is generally agreed, since e.g. Apollo seem to be using a foolish-to-my-mind definition here: When an AI has Misaligned goals and uses Strategic Deception to achieve them (see “Appendix C—Alternative definitions we considered”, for clarification). This is not close to the RFLO definition, so I really wish they wouldn’t use the same name. Things are confusing enough without our help.]
All that said, it’s not clear to me that [deceptive alignment] is a helpful term or target, given that there isn’t a crisp boundary, and that there’ll be a tendency to tackle an artificially narrow version of the problem. The rationale for solving it usually seems to be [if we can solve/avoid this subproblem, we’d have instrumentally useful guarantees in solving the more general generalization problem] - but I haven’t seen a good case made that we get the kind of guarantees we’d need (e.g. knowing only that we avoid explicit/intentional/strategic… deception of the oversight process is not enough). It’s easy to motte-and-bailey ourselves into trouble.
I don’t think it makes sense to classify every instance of this as deceptive alignment—and I don’t think this is the usual use of the term.
I think that to say “this is deceptive alignment” is generally to say something like “there’s a sense in which this system has a goal different from ours, is modeling the selection pressure it’s under, anticipating that this selection pressure may not exist in the future, and adapting its behaviour accordingly”.
That still leaves things underdefined, e.g. since this can all happen implicitly and/or without the system knowing this mechanism exists.
However, if you’re not suggesting in any sense that [anticipation of potential future removal of selection pressure] is a big factor, then it’s strange to call it deceptive alignment.
I assume Wiblin means it in this sense—not that this is the chance we get catastrophically bad generalization, but rather that it happens via a mechanism he’d characterize this way.
[I’m now less clear that this is generally agreed, since e.g. Apollo seem to be using a foolish-to-my-mind definition here: When an AI has Misaligned goals and uses Strategic Deception to achieve them (see “Appendix C—Alternative definitions we considered”, for clarification).
This is not close to the RFLO definition, so I really wish they wouldn’t use the same name. Things are confusing enough without our help.]
All that said, it’s not clear to me that [deceptive alignment] is a helpful term or target, given that there isn’t a crisp boundary, and that there’ll be a tendency to tackle an artificially narrow version of the problem.
The rationale for solving it usually seems to be [if we can solve/avoid this subproblem, we’d have instrumentally useful guarantees in solving the more general generalization problem] - but I haven’t seen a good case made that we get the kind of guarantees we’d need (e.g. knowing only that we avoid explicit/intentional/strategic… deception of the oversight process is not enough).
It’s easy to motte-and-bailey ourselves into trouble.