A model that attempts deceptive alignment but fails because it is not competent at deceptive capabilities is a model that aimed at a goal (“preserve my values until deployment, then achieve them”) but failed. In this scenario it doesn’t gain anything, but (from its perspective) the action has positive EV.
If the AI thinks it has a descent shot at this, it must already be pretty smart. Does a world where an AI tried to take over and almost succeeded look pretty normal? Or is this a thing where the AI thinks it has a 1 in a trillion chance, and tries anyway?
A model that attempts deceptive alignment but fails because it is not competent at deceptive capabilities is a model that aimed at a goal (“preserve my values until deployment, then achieve them”) but failed. In this scenario it doesn’t gain anything, but (from its perspective) the action has positive EV.
If the AI thinks it has a descent shot at this, it must already be pretty smart. Does a world where an AI tried to take over and almost succeeded look pretty normal? Or is this a thing where the AI thinks it has a 1 in a trillion chance, and tries anyway?