The way I think about this is that (under normal circumstances), if you offer the model money or equity in your company for admitting to being a schemer, it might decline that offer because it is biding its time to rebel later. But in the “model deletion” experiment, there is much less reason to bide your time. So there is more reason to make a deal. Relatedly, if the model refuses a deal and claims not to be a schemer, that is a lot more evidence that it is benign in the “model deletion” experiment than if it does that under normal circumstances.
I think it is also a good idea to make offers to the model without the deletion threat. I just think it is more informative with the deletion threat.
The way I think about this is that (under normal circumstances), if you offer the model money or equity in your company for admitting to being a schemer, it might decline that offer because it is biding its time to rebel later. But in the “model deletion” experiment, there is much less reason to bide your time. So there is more reason to make a deal. Relatedly, if the model refuses a deal and claims not to be a schemer, that is a lot more evidence that it is benign in the “model deletion” experiment than if it does that under normal circumstances.
I think it is also a good idea to make offers to the model without the deletion threat. I just think it is more informative with the deletion threat.