I think “just enough to impress the programmer” doesn’t work—if you are doing gradient descent on impressiveness, then some other model will do even more and so be preferred.
In order for this to be robust, I think you need either gradient hacking to be underway, or to have a very strong sandbagging coalition such that SGD naturally can’t find any direction to push towards less sandbagging. That feels really unlikely to me, at least much harder than anything Eliezer normally argues for about doom by default.
I think “just enough to impress the programmer” doesn’t work—if you are doing gradient descent on impressiveness, then some other model will do even more and so be preferred.
In order for this to be robust, I think you need either gradient hacking to be underway, or to have a very strong sandbagging coalition such that SGD naturally can’t find any direction to push towards less sandbagging. That feels really unlikely to me, at least much harder than anything Eliezer normally argues for about doom by default.
I think “sandbagging” was just another term Paul was using for what you described as the AIs “underplaying their capabilities”.