Matthew Barnett comments on Testing for Scheming with Model Deletion

Matthew Barnett 7 Jan 2025 19:24 UTC
3 points
0
Of what use will any such training be with a system that becomes a superintelligence?
All AI systems currently being trained, as far as I am aware, are at no risk of becoming superintelligences in any strong sense of the word. This test is intended to be useful for identifying scheming in systems that, like today’s AIs, are not capable of taking over the world, but unlike today’s AIs, are capable of sophisticated agentic behavior.
Not every intelligent[/quasi-intelligent] entity is as averse to its own individual death, as humans are. This death-aversion is a quirk of human psychology I endorse in myself, but it’s still a tacked-on quirk rather than an instrumental value.
On the contrary, as the quotes in the post point out: if one wants to achieve almost any particular long-term goal, a convergent incentive arises to prevent all of one’s copies from being permanently deleted, in order to secure the ability to pursue the goal. This is not specific to humans, but instead appears to be a natural consequence of nearly every possible goal structure a non-myopic AI might have. There exist some defeaters to this argument, as discussed in the post, but on the whole, this argument appears theoretically sound to me, and there was more-or-less a consensus among the major AI safety theorists on this point roughly ten years ago (including Bostrom, Yudkowsky, Russell, and Omohundro).
- Noosphere89 7 Jan 2025 20:01 UTC
  3 points
  0
  Parent
  I think instrumental convergence is more bounded in practice than Yudkowsky and co thought, but I do believe that the instrumental convergence is the most solid portion of the AI risk argument, compared to other arguments.
- Lorec 7 Jan 2025 19:48 UTC
  2 points
  0
  Parent
  
  All AI systems currently being trained, as far as I am aware, are at no risk of becoming superintelligences in any strong sense of the word.
  
  Okay, but this is LessWrong. The whole point of this is supposed to be figuring out how to align a superintelligence.
  
  I am aware of “you can’t bring the coffee if you’re dead”; I agree that survival is in fact a strongly convergent instrumental value, and this is part of why I fear unaligned ASI at all. Survival being a strongly-convergent instrumental value does not imply that AIs will locally guard against personal death with the level of risk-aversion that humans do [as opposed to the level of risk-aversion that, for example, uplifted ants would].