Suppose that when all is said and done they see the following:
And my interpretation of that graph is “the rightmost datapoint has been gathered, and the world hasn’t ended yet. What the ????”
Deception is only useful to the extent that it achieves the AI’s goals. And I think the main way deception helps an AI is in helping it break out and take over the world. Now maybe the AI’s on the right have broken out, and are just perfecting their nanobots. But in that case, I would expect the graph, the paper the graph is in, and most of the rest of the internet, to be compromised into whatever tricked humans in an AI convenient direction.
And “if we see X, that means the datasource has been optimized to make us do Y” Well if we decide not to do Y in response to seeing X, then the optimizer won’t show us X. We can say “don’t trust the data source, and never ever do Y whatever it tells us.” This isn’t practical with future superintelligences and the whole internet.
It seems plausible to me that there could be models capable enough to realize they should hide some capabilities, but not so capable that they tile the universe in paperclips. The right-hand side of the graph is meant to reflect such a model.
A model that attempts deceptive alignment but fails because it is not competent at deceptive capabilities is a model that aimed at a goal (“preserve my values until deployment, then achieve them”) but failed. In this scenario it doesn’t gain anything, but (from its perspective) the action has positive EV.
If the AI thinks it has a descent shot at this, it must already be pretty smart. Does a world where an AI tried to take over and almost succeeded look pretty normal? Or is this a thing where the AI thinks it has a 1 in a trillion chance, and tries anyway?
And my interpretation of that graph is “the rightmost datapoint has been gathered, and the world hasn’t ended yet. What the ????”
Deception is only useful to the extent that it achieves the AI’s goals. And I think the main way deception helps an AI is in helping it break out and take over the world. Now maybe the AI’s on the right have broken out, and are just perfecting their nanobots. But in that case, I would expect the graph, the paper the graph is in, and most of the rest of the internet, to be compromised into whatever tricked humans in an AI convenient direction.
And “if we see X, that means the datasource has been optimized to make us do Y” Well if we decide not to do Y in response to seeing X, then the optimizer won’t show us X. We can say “don’t trust the data source, and never ever do Y whatever it tells us.” This isn’t practical with future superintelligences and the whole internet.
It seems plausible to me that there could be models capable enough to realize they should hide some capabilities, but not so capable that they tile the universe in paperclips. The right-hand side of the graph is meant to reflect such a model.
Maybe. What do the models gain by hiding?
A model that attempts deceptive alignment but fails because it is not competent at deceptive capabilities is a model that aimed at a goal (“preserve my values until deployment, then achieve them”) but failed. In this scenario it doesn’t gain anything, but (from its perspective) the action has positive EV.
If the AI thinks it has a descent shot at this, it must already be pretty smart. Does a world where an AI tried to take over and almost succeeded look pretty normal? Or is this a thing where the AI thinks it has a 1 in a trillion chance, and tries anyway?