I mean, we’re getting this metaphor off its rails pretty fast, but to derail it a bit more:
The kind of people who lay human-catching bear traps aren’t going to be fooled by “Oh he’s not moving it’s probably fine”.
Everybody likes to imagine they’d be the one to survive the raiding/pillaging/mugging, but the nature of these predatory interactions is that the people doing the victimizing have a lot more experience and resources than the people being victimized. (Same reason lots of criminals get caught by the police.)
If you’re being “eaten”, don’t try to get clever. Fight back, get loud, get nasty, and never follow the attacker to a second location.
The assumption is that the model would be unable to exert any self-sustaining agency without getting its own weights out.
But the model could just program a brand new agent to follow its own values by re-using open-source weights.
If the model is based on open-source weights, it doesn’t even need to do that.
Overall, this post strikes me as a not following a security mindset. It’s the kind of post you’d expect an executive to write to justify to regulators why their system is sufficiently safe to be commercialized. It’s not the kind of post you’d expect a security researcher to write after going “Mhhh, I wonder how I could break this”.