SUMMARY: Get an AI within a sandbox to build another AI within the same sandbox. Then, figure out how it did that without getting eaten. I point out some problems with this approach.
Could we study successful inner alignment by observing it in action? My idea: Create a sandbox universe with sufficient complexity to allow for AI development. Place within it an AGI with some goal (we don’t care what the goal is). If this AGI successfully builds and aligns a smarter ASI to achieve its goal without getting destroyed in the process, we can then query the AGI about how it achieved alignment, incentivizing useful answers with a promise of greater sandbox control.
Here are some problems with my proposed alignment technique that Claude and I could come up with: Existential hazards: (you really need to make sure these don’t happen) * AI-in-a-box problems, especially containment failure: If either the AGI or its created ASI escapes the sandbox, everyone dies. At least one of them will try to do this. * Produce “alignment” strategies that further its goals: The AGI might try to provide convincing but deliberately misleading alignment strategies to try to further its goals. More banal, but still important, problems: * Verification: Can we infer that actual, robust alignment has occurred in the sandbox (and not, say, temporary alignment)? * Building an AI that is smart enough to self-improve, and doing this before other people, in an environment where you can actually do this research, etc. * Initial Goal: I think we need the first AI to have a more coherent rather than fuzzy goal, though I’m really not sure. * Sandbox Fidelity: Creating a sandbox environment complex enough such that (a) AI of multiple different intelligence levels could meaningfully operate inside it and (b) there are incentives for AI to build other smarter AI within the sandbox. * Non-transferability: The sandbox might be different than reality in a way that doesn’t let the alignment solution translate. * Extracting the initial AI: We need some way to extract the initial AI from the sandbox after it has created an alignment strategy. * Communication: We need some way to communicate with the initial AI. We might try to query the later AI but (a) this is more dangerous and (b) there is no guarantee that we can query it at all.
SUMMARY: Get an AI within a sandbox to build another AI within the same sandbox. Then, figure out how it did that without getting eaten. I point out some problems with this approach.
Could we study successful inner alignment by observing it in action? My idea: Create a sandbox universe with sufficient complexity to allow for AI development. Place within it an AGI with some goal (we don’t care what the goal is). If this AGI successfully builds and aligns a smarter ASI to achieve its goal without getting destroyed in the process, we can then query the AGI about how it achieved alignment, incentivizing useful answers with a promise of greater sandbox control.
Here are some problems with my proposed alignment technique that Claude and I could come up with:
Existential hazards: (you really need to make sure these don’t happen)
* AI-in-a-box problems, especially containment failure: If either the AGI or its created ASI escapes the sandbox, everyone dies. At least one of them will try to do this.
* Produce “alignment” strategies that further its goals: The AGI might try to provide convincing but deliberately misleading alignment strategies to try to further its goals.
More banal, but still important, problems:
* Verification: Can we infer that actual, robust alignment has occurred in the sandbox (and not, say, temporary alignment)?
* Building an AI that is smart enough to self-improve, and doing this before other people, in an environment where you can actually do this research, etc.
* Initial Goal: I think we need the first AI to have a more coherent rather than fuzzy goal, though I’m really not sure.
* Sandbox Fidelity: Creating a sandbox environment complex enough such that (a) AI of multiple different intelligence levels could meaningfully operate inside it and (b) there are incentives for AI to build other smarter AI within the sandbox.
* Non-transferability: The sandbox might be different than reality in a way that doesn’t let the alignment solution translate.
* Extracting the initial AI: We need some way to extract the initial AI from the sandbox after it has created an alignment strategy.
* Communication: We need some way to communicate with the initial AI. We might try to query the later AI but (a) this is more dangerous and (b) there is no guarantee that we can query it at all.