It’s looking like the direction things are going is that most AIs are going to have large language models, if not as their entire architecture at least as something that’s present in their architecture somewhere. This means that an AI will have access to an enormous amount of data about what the real world should look like, in a form that’s infeasible to restrict or ablate. So it will be difficult or impossible to make a simulated environment which can fool it.
But it might be possible to do with alignment tools that are also AIs or other purposes/prompts that condition behavior/identity of the same model, as a sort of GAN. Humans can’t inspect models directly, so it’s always interpretability and oversight that’s AI-driven in some way, even if it’s grounded in humans eventually. So it might be the case that at a superhuman level, an aligned AGI will have other superhuman AGIs (or prompts/bureaucracies/identities within the same AGI) that design convincing honeypots to reveal any deceptive alignment regimes within it. It doesn’t seem impossible that an apparently (even to itself) aligned AI has misalignment somewhere in its crash space, outside the goodhart boundary (where it’s no longer known to be sane), and exploring the crash space certainly warrants containment measures.
I disagree with jimrandomh that it would be intractable to make a simulated environment which could provide useful information about a LLM-based AI’s alignment. I am a proponent of the ‘evaluate the AI’s alignment in a simulation’ proposal. I believe that with sufficient care and running the AI in small sequences of steps, a simulation, especially a purely text simulation, could be sufficiently accurate. Also, I believe it is possible to ‘handicap’ a model that is too clever to be otherwise fooled by directly altering its activation states during inference (e.g. with noise masks).
I think that in this situation, allowing the AI being evaluated to directly communicate with humans or make any lasting record of its activities would be very dangerous. Instead, abstract low-bandwidth evaluation tools should be used to analyze the AI’s performance.
It’s looking like the direction things are going is that most AIs are going to have large language models, if not as their entire architecture at least as something that’s present in their architecture somewhere. This means that an AI will have access to an enormous amount of data about what the real world should look like, in a form that’s infeasible to restrict or ablate. So it will be difficult or impossible to make a simulated environment which can fool it.
But it might be possible to do with alignment tools that are also AIs or other purposes/prompts that condition behavior/identity of the same model, as a sort of GAN. Humans can’t inspect models directly, so it’s always interpretability and oversight that’s AI-driven in some way, even if it’s grounded in humans eventually. So it might be the case that at a superhuman level, an aligned AGI will have other superhuman AGIs (or prompts/bureaucracies/identities within the same AGI) that design convincing honeypots to reveal any deceptive alignment regimes within it. It doesn’t seem impossible that an apparently (even to itself) aligned AI has misalignment somewhere in its crash space, outside the goodhart boundary (where it’s no longer known to be sane), and exploring the crash space certainly warrants containment measures.
I disagree with jimrandomh that it would be intractable to make a simulated environment which could provide useful information about a LLM-based AI’s alignment. I am a proponent of the ‘evaluate the AI’s alignment in a simulation’ proposal. I believe that with sufficient care and running the AI in small sequences of steps, a simulation, especially a purely text simulation, could be sufficiently accurate. Also, I believe it is possible to ‘handicap’ a model that is too clever to be otherwise fooled by directly altering its activation states during inference (e.g. with noise masks).
I think that in this situation, allowing the AI being evaluated to directly communicate with humans or make any lasting record of its activities would be very dangerous. Instead, abstract low-bandwidth evaluation tools should be used to analyze the AI’s performance.