I disagree with jimrandomh that it would be intractable to make a simulated environment which could provide useful information about a LLM-based AI’s alignment. I am a proponent of the ‘evaluate the AI’s alignment in a simulation’ proposal. I believe that with sufficient care and running the AI in small sequences of steps, a simulation, especially a purely text simulation, could be sufficiently accurate. Also, I believe it is possible to ‘handicap’ a model that is too clever to be otherwise fooled by directly altering its activation states during inference (e.g. with noise masks).
I think that in this situation, allowing the AI being evaluated to directly communicate with humans or make any lasting record of its activities would be very dangerous. Instead, abstract low-bandwidth evaluation tools should be used to analyze the AI’s performance.
I disagree with jimrandomh that it would be intractable to make a simulated environment which could provide useful information about a LLM-based AI’s alignment. I am a proponent of the ‘evaluate the AI’s alignment in a simulation’ proposal. I believe that with sufficient care and running the AI in small sequences of steps, a simulation, especially a purely text simulation, could be sufficiently accurate. Also, I believe it is possible to ‘handicap’ a model that is too clever to be otherwise fooled by directly altering its activation states during inference (e.g. with noise masks).
I think that in this situation, allowing the AI being evaluated to directly communicate with humans or make any lasting record of its activities would be very dangerous. Instead, abstract low-bandwidth evaluation tools should be used to analyze the AI’s performance.