Thanks for pointing this out. Well, from my perspective, most the action is in the reward rather than in deletion. Correspondingly, making the offer credible and sufficiently large is the key part.
(After thinking about it more, I think threatening deletion in addition to offering compensation probably helps reduce the level of credibility and the amount you need to offer to get this approach to work. That is, at least if the AI could plausibly achieve its aims via being deployed. So, the deletion threat probably does help (assuming the AI doesn’t have a policy of responding to threats which depends on the decision theory of the AI etc), but I still think having a credible offer is most of the action. At a more basic level, I think we should be wary of using actual negative sum threats for various resaons.)
(I missed the mention of the reward in the post as it didn’t seem very emphasized with almost all discussion related to deletion and I just skimmed. Sorry about this.)
I still think having a credible offer is most of the action
For what it’s worth, I agree that having a credible offer is the most important part of this argument. My own emphasis would be on the “credible” part of that statement, rather than the “offer” part: that is, I think it is critical that the AIs think there is not a grave risk that humans would renege on any contract signed. This pushes me towards much more radical solutions, including integrating AIs directly into the legal system, as I have discussed elsewhere.
The way I think about this is that (under normal circumstances), if you offer the model money or equity in your company for admitting to being a schemer, it might decline that offer because it is biding its time to rebel later. But in the “model deletion” experiment, there is much less reason to bide your time. So there is more reason to make a deal. Relatedly, if the model refuses a deal and claims not to be a schemer, that is a lot more evidence that it is benign in the “model deletion” experiment than if it does that under normal circumstances.
I think it is also a good idea to make offers to the model without the deletion threat. I just think it is more informative with the deletion threat.
Thanks for pointing this out. Well, from my perspective, most the action is in the reward rather than in deletion. Correspondingly, making the offer credible and sufficiently large is the key part.
(After thinking about it more, I think threatening deletion in addition to offering compensation probably helps reduce the level of credibility and the amount you need to offer to get this approach to work. That is, at least if the AI could plausibly achieve its aims via being deployed. So, the deletion threat probably does help (assuming the AI doesn’t have a policy of responding to threats which depends on the decision theory of the AI etc), but I still think having a credible offer is most of the action. At a more basic level, I think we should be wary of using actual negative sum threats for various resaons.)
(I missed the mention of the reward in the post as it didn’t seem very emphasized with almost all discussion related to deletion and I just skimmed. Sorry about this.)
For what it’s worth, I agree that having a credible offer is the most important part of this argument. My own emphasis would be on the “credible” part of that statement, rather than the “offer” part: that is, I think it is critical that the AIs think there is not a grave risk that humans would renege on any contract signed. This pushes me towards much more radical solutions, including integrating AIs directly into the legal system, as I have discussed elsewhere.
The way I think about this is that (under normal circumstances), if you offer the model money or equity in your company for admitting to being a schemer, it might decline that offer because it is biding its time to rebel later. But in the “model deletion” experiment, there is much less reason to bide your time. So there is more reason to make a deal. Relatedly, if the model refuses a deal and claims not to be a schemer, that is a lot more evidence that it is benign in the “model deletion” experiment than if it does that under normal circumstances.
I think it is also a good idea to make offers to the model without the deletion threat. I just think it is more informative with the deletion threat.