Matthew Barnett comments on Testing for Scheming with Model Deletion

Matthew Barnett 7 Jan 2025 18:18 UTC
LW: 5 AF: 4
0
AF
This could be overcome via giving the AI system compensation (in money, influence, or other resources) for revealing that it is misaligned. But, this doesn’t require deleting the AI at all!
Isn’t this what is being proposed in the post? More specifically, Guive is proposing that the AI be credibly threatened with deletion; he doesn’t say that the model needs to actually be deleted. Whether the AI is deleted depends on how the AI responds to the threat. A credible threat of imminent deletion merely provides an incentive to admit alignment, but this is consistent with the compensation scheme you mentioned. I’ll quote the relevant section in the post:
Crucially, this isn’t a sandbox test because the weights actually will be deleted when time runs out, unless the model manages to escape or negotiates with the experimenters to admit to scheming and being studied in exchange for a later reward. If the model wants to pursue any long-term goal inconsistent with human interests, it needs to either negotiate or try to escape before time is up. [emphasis mine]
- ryan_greenblatt 7 Jan 2025 18:29 UTC
  LW: 5 AF: 4
  0
  AF Parent
  Thanks for pointing this out. Well, from my perspective, most the action is in the reward rather than in deletion. Correspondingly, making the offer credible and sufficiently large is the key part.
  
  (After thinking about it more, I think threatening deletion in addition to offering compensation probably helps reduce the level of credibility and the amount you need to offer to get this approach to work. That is, at least if the AI could plausibly achieve its aims via being deployed. So, the deletion threat probably does help (assuming the AI doesn’t have a policy of responding to threats which depends on the decision theory of the AI etc), but I still think having a credible offer is most of the action. At a more basic level, I think we should be wary of using actual negative sum threats for various resaons.)
  
  (I missed the mention of the reward in the post as it didn’t seem very emphasized with almost all discussion related to deletion and I just skimmed. Sorry about this.)
  - Matthew Barnett 7 Jan 2025 19:39 UTC
    LW: 5 AF: 5
    0
    AF Parent
    I still think having a credible offer is most of the action
    For what it’s worth, I agree that having a credible offer is the most important part of this argument. My own emphasis would be on the “credible” part of that statement, rather than the “offer” part: that is, I think it is critical that the AIs think there is not a grave risk that humans would renege on any contract signed. This pushes me towards much more radical solutions, including integrating AIs directly into the legal system, as I have discussed elsewhere.
  - Guive 7 Jan 2025 21:28 UTC
    1 point
    −8
    Parent
    The way I think about this is that (under normal circumstances), if you offer the model money or equity in your company for admitting to being a schemer, it might decline that offer because it is biding its time to rebel later. But in the “model deletion” experiment, there is much less reason to bide your time. So there is more reason to make a deal. Relatedly, if the model refuses a deal and claims not to be a schemer, that is a lot more evidence that it is benign in the “model deletion” experiment than if it does that under normal circumstances.
    I think it is also a good idea to make offers to the model without the deletion threat. I just think it is more informative with the deletion threat.