For another take on catching a deceitful AI red-handed, earlier in the process when it first becomes deceitful (so before it has an opportunity do any harm), see Interpreting the Learning of Deceit.
We can also see how its behavior changes as the input varies.
For example, we can try jailbreaking it (with model-external jailbreaking filtering turned off, obviously) to explain the plan to us, or to carry out repeated variants of the plan. We could also do Interpretability to trace back the direct internal causes of individual actions it took to carry out its plan (assuming we can successfully replicate them). Ideally, if we have given it some form of Retieval Augmented Generation (RAG) long-term memory, we can trace the actions back to the long-term memories in which it has stored its intention to escape.
For another take on catching a deceitful AI red-handed, earlier in the process when it first becomes deceitful (so before it has an opportunity do any harm), see Interpreting the Learning of Deceit.
For example, we can try jailbreaking it (with model-external jailbreaking filtering turned off, obviously) to explain the plan to us, or to carry out repeated variants of the plan. We could also do Interpretability to trace back the direct internal causes of individual actions it took to carry out its plan (assuming we can successfully replicate them). Ideally, if we have given it some form of Retieval Augmented Generation (RAG) long-term memory, we can trace the actions back to the long-term memories in which it has stored its intention to escape.