RogerDearnaley comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

RogerDearnaley 20 Jan 2024 21:30 UTC
1 point
0
At AGI level, I would much rather be working with a model that genuinely, selflessly cares only about the welfare of all humans and wants to do the right thing for them (not a common mentality in the training set), than one that’s just pretending this and actually wants something else. At ASI level, I’d say this was essential: I don’t see how you can expect to reliably be confident that you can box, control, or contain an ASI. (Obviously if you had a formal proof that your cryptographic box was inescapable, then the only questions would be your assumptions, any side-channels you hadn’t accounted for, or outside help, but in a situation like that I don’t see how you get useful work in and out of the box without creating sidechannels.)