Zoe Williams comments on Distinguishing test from training

Zoe Williams 6 Dec 2022 8:55 UTC
1 point
0
Post summary (feel free to suggest edits!):
Various people have proposed variants of “align AGI by making it sufficiently uncertain about whether it’s in the real world versus still in training”. This seems unpromising because AGI could still have bad outcomes if convinced, and convincing it would be difficult.
Non-exhaustive list of how it could tell it’s in reality:
1. Reality is large (eg. some things are possible that couldn’t be easily spoofed, such as access to larger compute)
2. It’s the first place the AI’s history could show interaction with other complex systems (eg. humans reacting on scale to the AI’s actions)
3. It might notice the world in it’s in the type that’s less likely to be simulated (eg. consistent physics, no bugs)
If you can understand the contents of the AI’s mind well enough to falsify every possible check it could do to determine the difference between simulation and reality, then you could use that knowledge to build a friendly AI that doesn’t need to be fooled in the first place.
(If you’d like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)