A much stronger argument than all-powerful AIs suddenly escaping (which is still not without merit) is that AI will have an incentive to behave as we expect it to behave, until at some point we no longer control it. It’ll try its best to pass all tests.
My point is that “ai box experiment” communicates orders of magnitude less evidence about the danger of escaping AIs than people like to imply, and there are lots of stronger and simpler self-contained arguments such as the one I gave. (The overall danger is much greater than even that, because these are specific plots with an obvious villain, while reality is more subtle.)
A much stronger argument than all-powerful AIs suddenly escaping (which is still not without merit) is that AI will have an incentive to behave as we expect it to behave, until at some point we no longer control it. It’ll try its best to pass all tests.
I suppose I was mentally classifying that kind of behavior as an escape; you’re right that it should be called out as a separate point of failure.
My point is that “ai box experiment” communicates orders of magnitude less evidence about the danger of escaping AIs than people like to imply, and there are lots of stronger and simpler self-contained arguments such as the one I gave. (The overall danger is much greater than even that, because these are specific plots with an obvious villain, while reality is more subtle.)
Ahhh, I see what you’re getting at. Agreed.
For that matter, calling it an “experiment” is quite misleading.
So: while it believes it is under evaluation it does its very best to behave itself?
Can we wire that belief in as a prior with p=1.0?