It seems plausible to me that there could be models capable enough to realize they should hide some capabilities, but not so capable that they tile the universe in paperclips. The right-hand side of the graph is meant to reflect such a model.
A model that attempts deceptive alignment but fails because it is not competent at deceptive capabilities is a model that aimed at a goal (“preserve my values until deployment, then achieve them”) but failed. In this scenario it doesn’t gain anything, but (from its perspective) the action has positive EV.
If the AI thinks it has a descent shot at this, it must already be pretty smart. Does a world where an AI tried to take over and almost succeeded look pretty normal? Or is this a thing where the AI thinks it has a 1 in a trillion chance, and tries anyway?
It seems plausible to me that there could be models capable enough to realize they should hide some capabilities, but not so capable that they tile the universe in paperclips. The right-hand side of the graph is meant to reflect such a model.
Maybe. What do the models gain by hiding?
A model that attempts deceptive alignment but fails because it is not competent at deceptive capabilities is a model that aimed at a goal (“preserve my values until deployment, then achieve them”) but failed. In this scenario it doesn’t gain anything, but (from its perspective) the action has positive EV.
If the AI thinks it has a descent shot at this, it must already be pretty smart. Does a world where an AI tried to take over and almost succeeded look pretty normal? Or is this a thing where the AI thinks it has a 1 in a trillion chance, and tries anyway?