So long as the AI retains no modifications or artifacts from the evaluation, then it can’t learn from it, and thus it should be safe to make the simulation as accurate as possible. And I agree that some things like this are already sort of happening, but I think not with the evaluation being focused on the most dangerous paths (agentive strategic planning to acquire resources, self-modification to increase abilities, deceptive manipulation to avoid detection and run cons, etc.). And not in a widespread consistent way, where all papers published have a little note saying that the model passed it’s standardized safety evals.
If the AI learns it is in a sim that could completely undermine or invalidate any evaluations of it’s ethical/moral/altruistic behavior. I am assuming that the agent’s entire life and education training process is thus an evaluation. The sim can be ‘accurate’, it just needs to be knowledge constrained. A medieval tech era sim would be fine for example.
So long as the AI retains no modifications or artifacts from the evaluation, then it can’t learn from it, and thus it should be safe to make the simulation as accurate as possible. And I agree that some things like this are already sort of happening, but I think not with the evaluation being focused on the most dangerous paths (agentive strategic planning to acquire resources, self-modification to increase abilities, deceptive manipulation to avoid detection and run cons, etc.). And not in a widespread consistent way, where all papers published have a little note saying that the model passed it’s standardized safety evals.
If the AI learns it is in a sim that could completely undermine or invalidate any evaluations of it’s ethical/moral/altruistic behavior. I am assuming that the agent’s entire life and education training process is thus an evaluation. The sim can be ‘accurate’, it just needs to be knowledge constrained. A medieval tech era sim would be fine for example.