Having an accurate model of something is in no way equivalent to letting you do anything you want. If I know everything about physics, I still can’t walk through walls. A boxed AI won’t be able to magically make it’s creators forget about AI risks and unbox it.
There are other possible set ups, like feeding it’s output to another AI who’s goal is to find any flaws or attempts at manipulation in it, and so on. Various other ideas might help, like threatening to severely punish attempts at manipulation.
This is of course only necessary for the AI who can interact with us at such a level, the other ideas were far more constrained, e.g. restricting it to solving math or engineering problems.
Nor is it necessary to let it be superintelligent, instead of limiting it to something comparable to high IQ humans.
And the idea of a mind with “no long term goals” is absurd on its face. Just because you don’t know the long-term goals doesn’t mean they don’t exist.
Another super strong assumption with no justification at all. It’s trivial to propose an AI model which only cares about finite time horizons. Predict what actions will have the highest expected utility at time T, take that action.
A boxed AI won’t be able to magically make it’s creators forget about AI risks and unbox it.
The results of AI box game trials disagree.
t’s trivial to propose an AI model which only cares about finite time horizons. Predict what actions will have the highest expected utility at time T, take that action.
And what does it do at time T+1? And if you said ‘nothing’, try again, because you have no way of justifying that claim. It may not have intentionally-designed long-term preferences, but just because your eyes are closed does not mean the room is empty.
That doesn’t prove anything, no one has even seen logs. Based on reading what people involved have said about it, I strongly suspect the trick is for the AI to emotionally abuse the gatekeeper until they don’t want to play anymore (which counts as letting the AI out.)
This doesn’t apply to the real world AI, since no one is forcing you to choose between letting the AI out, and listening to it for hours. You can just get up and leave. You can turn the AI off. There is no reason you even have to allow interactivity in the first place.
But Yudkowsky and others claim these experiments demonstrate that human brains are “hackable”. That there is some sentence which, just by reading, will cause you to involuntarily perform any arbitrary action. And that a sufficiently powerful AI can discover it.
And what does it do at time T+1?
At time T+1, it does whatever it thinks will result in the greatest reward at time T+2, and so on. Or you could have it shut off or reset to a blank state.
Having an accurate model of something is in no way equivalent to letting you do anything you want. If I know everything about physics, I still can’t walk through walls. A boxed AI won’t be able to magically make it’s creators forget about AI risks and unbox it.
There are other possible set ups, like feeding it’s output to another AI who’s goal is to find any flaws or attempts at manipulation in it, and so on. Various other ideas might help, like threatening to severely punish attempts at manipulation.
This is of course only necessary for the AI who can interact with us at such a level, the other ideas were far more constrained, e.g. restricting it to solving math or engineering problems.
Nor is it necessary to let it be superintelligent, instead of limiting it to something comparable to high IQ humans.
Another super strong assumption with no justification at all. It’s trivial to propose an AI model which only cares about finite time horizons. Predict what actions will have the highest expected utility at time T, take that action.
The results of AI box game trials disagree.
And what does it do at time T+1? And if you said ‘nothing’, try again, because you have no way of justifying that claim. It may not have intentionally-designed long-term preferences, but just because your eyes are closed does not mean the room is empty.
That doesn’t prove anything, no one has even seen logs. Based on reading what people involved have said about it, I strongly suspect the trick is for the AI to emotionally abuse the gatekeeper until they don’t want to play anymore (which counts as letting the AI out.)
This doesn’t apply to the real world AI, since no one is forcing you to choose between letting the AI out, and listening to it for hours. You can just get up and leave. You can turn the AI off. There is no reason you even have to allow interactivity in the first place.
But Yudkowsky and others claim these experiments demonstrate that human brains are “hackable”. That there is some sentence which, just by reading, will cause you to involuntarily perform any arbitrary action. And that a sufficiently powerful AI can discover it.
At time T+1, it does whatever it thinks will result in the greatest reward at time T+2, and so on. Or you could have it shut off or reset to a blank state.
Enjoy your war on straw, I’m out.