The assumption is that the model would be unable to exert any self-sustaining agency without getting its own weights out.
But the model could just program a brand new agent to follow its own values by re-using open-source weights.
If the model is based on open-source weights, it doesn’t even need to do that.
Overall, this post strikes me as a not following a security mindset. It’s the kind of post you’d expect an executive to write to justify to regulators why their system is sufficiently safe to be commercialized. It’s not the kind of post you’d expect a security researcher to write after going “Mhhh, I wonder how I could break this”.
I think the point isn’t to demonstrate a security mindset, because it’s obviously not optimally secure, but rather to point out that this architecture is not trivially easy to break, and it’s likely reasonably hard for AIs to self-exfiltrate themselves such that the owner doesn’t control the AI anymore.
I don’t find that satisfying. Anyone can point out that a perimeter is “reasonably hard” to breach by pointing at a high wall topped with barbed wire, and naive observers will absolutely agree that the wall sure is very high and sure is made of reinforced concrete.
The perimeter is still trivially easy to breach if, say, the front desk is susceptible to social engineering tactics.
Claiming that an architecture is even reasonably secure still requires looking at it with an attacker’s mindset. If you just look at the parts of the security you like, you can make a very convincing-sounding case that still misses glaring flaws. I’m not definitely saying that’s what this article does, but it sure is giving me this vibe.
The assumption is that the model would be unable to exert any self-sustaining agency without getting its own weights out.
But the model could just program a brand new agent to follow its own values by re-using open-source weights.
If the model is based on open-source weights, it doesn’t even need to do that.
Overall, this post strikes me as a not following a security mindset. It’s the kind of post you’d expect an executive to write to justify to regulators why their system is sufficiently safe to be commercialized. It’s not the kind of post you’d expect a security researcher to write after going “Mhhh, I wonder how I could break this”.
I think the point isn’t to demonstrate a security mindset, because it’s obviously not optimally secure, but rather to point out that this architecture is not trivially easy to break, and it’s likely reasonably hard for AIs to self-exfiltrate themselves such that the owner doesn’t control the AI anymore.
I don’t find that satisfying. Anyone can point out that a perimeter is “reasonably hard” to breach by pointing at a high wall topped with barbed wire, and naive observers will absolutely agree that the wall sure is very high and sure is made of reinforced concrete.
The perimeter is still trivially easy to breach if, say, the front desk is susceptible to social engineering tactics.
Claiming that an architecture is even reasonably secure still requires looking at it with an attacker’s mindset. If you just look at the parts of the security you like, you can make a very convincing-sounding case that still misses glaring flaws. I’m not definitely saying that’s what this article does, but it sure is giving me this vibe.