A model is deceptively aligned with its designers. However, the designers have very good control mechanisms in place such that they would certainly catch the AI if it tried to act misaligned. Therefore, the model acts aligned with the designers’ intentions 100% of the time. In this world, a model that is technically deceptively aligned may still be safe in practice (although this equilibrium could be fragile and unsafe in the long run).
In that case, there is no strategic deception (the designers are not induced in error by the AI).
I think we consider this case strategic deception, because we have an intuition of what inputs the AI receives (one where it would be controlled little enough to be deceptive) that differs from the actual one.
To fix this, I propose strategic deception not be defined according to the AI’s behavior, but to its hypothetical behavior on an idealized class of inputs that represents all situations where the designers want the AI to behave in a certain way.
E.g. The designers are holding the AI in a simulation to check if it’s deceptive. This information is given to the AI in a pre-prompt due to technical issues. However, the designers want the AI to avoid strategic deception even during deployment where it won’t be told it’s in a simulation, so their idealized test set includes prompts without this information. By this definition they cannot check if the AI exhibits strategic deception before deployment in this situation.
Also, I am unsatisfied by “in order to accomplish some outcome” and “[the AI’s] goals” because this assumes an agentic framework, which might not be relevant in real-world AI.
How to fix the first, for agentic AI only: “for which the AI predicts an outcome that can be human-interpreted as furthering its goals” Not sure how to talk about deceptive non-agentic AI.
In that case, there is no strategic deception (the designers are not induced in error by the AI).
I think we consider this case strategic deception, because we have an intuition of what inputs the AI receives (one where it would be controlled little enough to be deceptive) that differs from the actual one.
To fix this, I propose strategic deception not be defined according to the AI’s behavior, but to its hypothetical behavior on an idealized class of inputs that represents all situations where the designers want the AI to behave in a certain way.
E.g. The designers are holding the AI in a simulation to check if it’s deceptive. This information is given to the AI in a pre-prompt due to technical issues. However, the designers want the AI to avoid strategic deception even during deployment where it won’t be told it’s in a simulation, so their idealized test set includes prompts without this information.
By this definition they cannot check if the AI exhibits strategic deception before deployment in this situation.
Also, I am unsatisfied by “in order to accomplish some outcome” and “[the AI’s] goals” because this assumes an agentic framework, which might not be relevant in real-world AI.
How to fix the first, for agentic AI only: “for which the AI predicts an outcome that can be human-interpreted as furthering its goals”
Not sure how to talk about deceptive non-agentic AI.