But it might be possible to do with alignment tools that are also AIs or other purposes/prompts that condition behavior/identity of the same model, as a sort of GAN. Humans can’t inspect models directly, so it’s always interpretability and oversight that’s AI-driven in some way, even if it’s grounded in humans eventually. So it might be the case that at a superhuman level, an aligned AGI will have other superhuman AGIs (or prompts/bureaucracies/identities within the same AGI) that design convincing honeypots to reveal any deceptive alignment regimes within it. It doesn’t seem impossible that an apparently (even to itself) aligned AI has misalignment somewhere in its crash space, outside the goodhart boundary (where it’s no longer known to be sane), and exploring the crash space certainly warrants containment measures.
But it might be possible to do with alignment tools that are also AIs or other purposes/prompts that condition behavior/identity of the same model, as a sort of GAN. Humans can’t inspect models directly, so it’s always interpretability and oversight that’s AI-driven in some way, even if it’s grounded in humans eventually. So it might be the case that at a superhuman level, an aligned AGI will have other superhuman AGIs (or prompts/bureaucracies/identities within the same AGI) that design convincing honeypots to reveal any deceptive alignment regimes within it. It doesn’t seem impossible that an apparently (even to itself) aligned AI has misalignment somewhere in its crash space, outside the goodhart boundary (where it’s no longer known to be sane), and exploring the crash space certainly warrants containment measures.