I agree about ‘deceive the deployed AGI into thinking it is in a simulation’ being a doomed plan. I do think that a somewhat nearby plan, where the AGI is kept entirely in the simulation from start to finish, is possibly a good idea. In this simbox plan, the AGI is used not to create or enact plans in the real world, but rather as a guinea pig for alignment experiments. Once you’ve learned enough from your experiments, build a new AGI from scratch that you think is more align-able, and try to align it in a new simbox. Only once you’ve repeated this enough times that you are confident you have an align-able architecture do you train the model on any real world data and try to use it to help you make a pivotal act. I imagine that early versions of align-able architecture will need to have strong enforceable limitations on the speed and overall intelligence and creativity of the model, as well as be highly interpretable, reasonably corrigible, etc.
This plan has weak points, but I think they are different from the ones mentioned in this post. The weak points I see with the simbox-guinea-pig plan are:
How sure can you be that the results of the experiments you are running are not being ‘thrown’ by a AGI aware that it is in a simbox?
Can your interpretability methods detect whether the current run of the AGI has noticed that it is in a simbox yet?
How can you monitor the experiments without exposing yourself to infohazards?
Can you keep the AGI weak enough to not break out or pose infohazard danger while still being strong enough that the results of the experiments are relevant? (I assume there will be emergent capabilities that need to be accounted for in the realm of powerful-enough-to-be-useful-for-a-pivotal-act).
Can you setup backup systems such that failures will be set-backs but not catastrophes?
So, the simbox-guinea-pig plan seems hard, and with many open unsolved questions, but not necessarily doomed.
Additionally, I think that the problems you aim an AGI at could matter a lot for how dangerous of responses you get back. Some proposed solutions to posed problems will by their nature be much safer and easier to verify. For example, a plan for a new widget that can be built and tested under controlled lab conditions, and explored thoroughly for booby traps or side effects. That’s a lot safer than a plan to, say, make a target individual the next president of the United States.
Edit: forgot to add what I currently think of as the biggest weakness of this plan. The very high alignment tax. How do you convince the inventor of this AI system to not deploy it and instead run lots of expensive testing on it and only deploy a safer version that might not be achieved for years yet? I think this is where demonstrative bad-behavior-prompting safety evaluations could help in showing when a model was dangerous enough to warrant such a high level of caution.
I agree about ‘deceive the deployed AGI into thinking it is in a simulation’ being a doomed plan. I do think that a somewhat nearby plan, where the AGI is kept entirely in the simulation from start to finish, is possibly a good idea. In this simbox plan, the AGI is used not to create or enact plans in the real world, but rather as a guinea pig for alignment experiments. Once you’ve learned enough from your experiments, build a new AGI from scratch that you think is more align-able, and try to align it in a new simbox. Only once you’ve repeated this enough times that you are confident you have an align-able architecture do you train the model on any real world data and try to use it to help you make a pivotal act. I imagine that early versions of align-able architecture will need to have strong enforceable limitations on the speed and overall intelligence and creativity of the model, as well as be highly interpretable, reasonably corrigible, etc.
This plan has weak points, but I think they are different from the ones mentioned in this post. The weak points I see with the simbox-guinea-pig plan are:
How sure can you be that the results of the experiments you are running are not being ‘thrown’ by a AGI aware that it is in a simbox?
Can your interpretability methods detect whether the current run of the AGI has noticed that it is in a simbox yet?
How can you monitor the experiments without exposing yourself to infohazards?
Can you keep the AGI weak enough to not break out or pose infohazard danger while still being strong enough that the results of the experiments are relevant? (I assume there will be emergent capabilities that need to be accounted for in the realm of powerful-enough-to-be-useful-for-a-pivotal-act).
Can you setup backup systems such that failures will be set-backs but not catastrophes?
So, the simbox-guinea-pig plan seems hard, and with many open unsolved questions, but not necessarily doomed.
Additionally, I think that the problems you aim an AGI at could matter a lot for how dangerous of responses you get back. Some proposed solutions to posed problems will by their nature be much safer and easier to verify. For example, a plan for a new widget that can be built and tested under controlled lab conditions, and explored thoroughly for booby traps or side effects. That’s a lot safer than a plan to, say, make a target individual the next president of the United States.
Edit: forgot to add what I currently think of as the biggest weakness of this plan. The very high alignment tax. How do you convince the inventor of this AI system to not deploy it and instead run lots of expensive testing on it and only deploy a safer version that might not be achieved for years yet? I think this is where demonstrative bad-behavior-prompting safety evaluations could help in showing when a model was dangerous enough to warrant such a high level of caution.