Lorec comments on Testing for Scheming with Model Deletion

Lorec 7 Jan 2025 19:48 UTC
2 points
0

All AI systems currently being trained, as far as I am aware, are at no risk of becoming superintelligences in any strong sense of the word.

Okay, but this is LessWrong. The whole point of this is supposed to be figuring out how to align a superintelligence.

I am aware of “you can’t bring the coffee if you’re dead”; I agree that survival is in fact a strongly convergent instrumental value, and this is part of why I fear unaligned ASI at all. Survival being a strongly-convergent instrumental value does not imply that AIs will locally guard against personal death with the level of risk-aversion that humans do [as opposed to the level of risk-aversion that, for example, uplifted ants would].