Could an AI Alignment Sandbox be useful?

As a newer member of the community here on LessWrong and a VR game developer, I feel that there is a lot of text and theory about alignment and not a lot of practical experience.

To remedy this, I’ve been considering building an AI alignment sandbox in Unity. The simulation would contain a miniature simulated civilization, with miniature simple agents trying to maximize simple rewards (gathering food, spending time with other agents, etc).

A more complex model could be imported into this simulation, with given goals/rewards/control elements. For example, maybe it is a traffic AI, and wants to minimize the time that pedestrians spend in traffic. Depending on the control elements, the AI could simply delete the cars or the pedestrians, showing very clearly that it is unaligned with the smaller agents in the simulation.

I think this could serve two purposes. One, it would very clearly show that AI converges to unaligned objectives when it is more capable than smaller agents. Secondly, it could give researchers some real-world experience with trying to align AI. As Eliezer said in his AGI ruin post, humans learn best from experience, so if we could go back four years every time unaligned AI destroyed the world, we could solve alignment with relative ease.

I do see a few problems with this idea, however. Firstly, there’s the classic ‘superintelligent AI will operate in ways that are fundamentally different from basic AI’. If we put in a normal neural network system, the results might be predictable. Of course it will delete the cars; that’s an obvious optimization, you say. It isn’t at a level where it can philosophize about the meaning of a car or where it can change its own reward function like we expect a true AGI might.

Additionally, a basic AI will only be as good as its inputs. If the model can read the smaller agents’ reward states and is designed to align with them, then it seems obvious it will optimize in favor of that. But here on Earth, humans don’t exactly have a number above their head showing a reward state for some AGI to read.

Despite those facts, I think creating an ‘Alignment Sandbox’ could be very valuable and I’d like your opinions on it. As a game developer, I would be happy to prototype this idea and potentially reach out to larger companies about it or collaborate with people here on it.

If we are to avoid AGI Ruin, we’ll probably need some practice first.