A boxed AI won’t be able to magically make it’s creators forget about AI risks and unbox it.
The results of AI box game trials disagree.
t’s trivial to propose an AI model which only cares about finite time horizons. Predict what actions will have the highest expected utility at time T, take that action.
And what does it do at time T+1? And if you said ‘nothing’, try again, because you have no way of justifying that claim. It may not have intentionally-designed long-term preferences, but just because your eyes are closed does not mean the room is empty.
That doesn’t prove anything, no one has even seen logs. Based on reading what people involved have said about it, I strongly suspect the trick is for the AI to emotionally abuse the gatekeeper until they don’t want to play anymore (which counts as letting the AI out.)
This doesn’t apply to the real world AI, since no one is forcing you to choose between letting the AI out, and listening to it for hours. You can just get up and leave. You can turn the AI off. There is no reason you even have to allow interactivity in the first place.
But Yudkowsky and others claim these experiments demonstrate that human brains are “hackable”. That there is some sentence which, just by reading, will cause you to involuntarily perform any arbitrary action. And that a sufficiently powerful AI can discover it.
And what does it do at time T+1?
At time T+1, it does whatever it thinks will result in the greatest reward at time T+2, and so on. Or you could have it shut off or reset to a blank state.
The results of AI box game trials disagree.
And what does it do at time T+1? And if you said ‘nothing’, try again, because you have no way of justifying that claim. It may not have intentionally-designed long-term preferences, but just because your eyes are closed does not mean the room is empty.
That doesn’t prove anything, no one has even seen logs. Based on reading what people involved have said about it, I strongly suspect the trick is for the AI to emotionally abuse the gatekeeper until they don’t want to play anymore (which counts as letting the AI out.)
This doesn’t apply to the real world AI, since no one is forcing you to choose between letting the AI out, and listening to it for hours. You can just get up and leave. You can turn the AI off. There is no reason you even have to allow interactivity in the first place.
But Yudkowsky and others claim these experiments demonstrate that human brains are “hackable”. That there is some sentence which, just by reading, will cause you to involuntarily perform any arbitrary action. And that a sufficiently powerful AI can discover it.
At time T+1, it does whatever it thinks will result in the greatest reward at time T+2, and so on. Or you could have it shut off or reset to a blank state.
Enjoy your war on straw, I’m out.