I like this frame, and I don’t recall seeing it already addressed.
What I have seen written about deceptiveness generally seems to assume that the AGI would be sufficiently capable of obfuscating its thoughts from direct queries and from any interpretability tools we have available that it could effectively make its plans for world domination in secret, unobserved by humans. That does seem like an even more effective strategy for optimizing its actual utility function than not bothering to think through such plans at all, if it’s able to do it. But it’s hard to do, and even thinking about it is risky.
I can imagine something like what you describe happening as a middle stage, for entities that are agentic enough to have (latent, probably misaligned since alignment is probably hard) goals, but not yet capable enough to think hard about how to optimize for them without being detected. It seems more likely if (1) almost all sufficiently powerful AI systems created by humans will actually have misaligned goals, (2) AIs are optimized very hard against having visibly misaligned cognition (selection of which AIs to keep being a form of optimization, in this context), and (3) our techniques for making misaligned cognition visible are more reliably able to detect an active process / subsystem doing planning towards goals than the mere latent presence of such goals. (3) seems likely, at least for a while and assuming we have any meaningful interpretability tools at all; it’s hard for me to imagine a detector of latent properties that doesn’t just always say “well, there are some off-distribution inputs that would make it do something very bad” for every sufficiently powerful AI, even one that was aligned-in-practice because those inputs would reliably never be given to it.
I like this frame, and I don’t recall seeing it already addressed.
What I have seen written about deceptiveness generally seems to assume that the AGI would be sufficiently capable of obfuscating its thoughts from direct queries and from any interpretability tools we have available that it could effectively make its plans for world domination in secret, unobserved by humans. That does seem like an even more effective strategy for optimizing its actual utility function than not bothering to think through such plans at all, if it’s able to do it. But it’s hard to do, and even thinking about it is risky.
I can imagine something like what you describe happening as a middle stage, for entities that are agentic enough to have (latent, probably misaligned since alignment is probably hard) goals, but not yet capable enough to think hard about how to optimize for them without being detected. It seems more likely if (1) almost all sufficiently powerful AI systems created by humans will actually have misaligned goals, (2) AIs are optimized very hard against having visibly misaligned cognition (selection of which AIs to keep being a form of optimization, in this context), and (3) our techniques for making misaligned cognition visible are more reliably able to detect an active process / subsystem doing planning towards goals than the mere latent presence of such goals. (3) seems likely, at least for a while and assuming we have any meaningful interpretability tools at all; it’s hard for me to imagine a detector of latent properties that doesn’t just always say “well, there are some off-distribution inputs that would make it do something very bad” for every sufficiently powerful AI, even one that was aligned-in-practice because those inputs would reliably never be given to it.