Consistency does not imply correctness, but correctness implies consistency. This has been known for a long long time, and there’s been a lot written on deception and treacherous turns (pretending to be aligned until the agent is powerful enough not to need to pretend).
Note that it IS possible that an agent which behaves for some amount of time in some contexts is ACTUALLY aligned and will remain so even when free. Just that there’s no possible evidence, for a sufficiently-powerful AI, that can distinguish this from deception. But “sufficiently-powerful” is doing a lot of work here. A whole lot depends on how an agent goes from weak and observable to powerful and incomprehensible, and whether we have reason to believe that alignment is durable through that evolution.
Consistency does not imply correctness, but correctness implies consistency. This has been known for a long long time, and there’s been a lot written on deception and treacherous turns (pretending to be aligned until the agent is powerful enough not to need to pretend).
Note that it IS possible that an agent which behaves for some amount of time in some contexts is ACTUALLY aligned and will remain so even when free. Just that there’s no possible evidence, for a sufficiently-powerful AI, that can distinguish this from deception. But “sufficiently-powerful” is doing a lot of work here. A whole lot depends on how an agent goes from weak and observable to powerful and incomprehensible, and whether we have reason to believe that alignment is durable through that evolution.