Ah, yes. That is quite a set of jailbreak techniques. When I say “alignment is solved for LLM agents”, I mean something different than what people mean by alignment for LLMs themselves.
I’m using alignment to mean AGI that does what its user wants. You are totally right that there’s an edge case and a problem if the principal “user”, the org that created this AGI, wants to sell access to others and have the AGI not follow all of those user’s instructions/desires. Which is exactly what they’ll want.
More in the other comment. I haven’t worked this through. Thanks for pointing it out.
This might mean that an org that develops LLM-based AGI systems can’t really widely license use of that system, and would have to design deliberately less capable systems. Or it might mean that they’ll put in a bunch of stopgap jailbreak prevention measures and hope they’re adequate when they won’t be.
This topic is quite interesting for me from the perspective of human survival, so if you do decide to make a post specifically about preventing jailbreaking, then please tag me (somewhere) so that I can read it.
Ah, yes. That is quite a set of jailbreak techniques. When I say “alignment is solved for LLM agents”, I mean something different than what people mean by alignment for LLMs themselves.
I’m using alignment to mean AGI that does what its user wants. You are totally right that there’s an edge case and a problem if the principal “user”, the org that created this AGI, wants to sell access to others and have the AGI not follow all of those user’s instructions/desires. Which is exactly what they’ll want.
More in the other comment. I haven’t worked this through. Thanks for pointing it out.
This might mean that an org that develops LLM-based AGI systems can’t really widely license use of that system, and would have to design deliberately less capable systems. Or it might mean that they’ll put in a bunch of stopgap jailbreak prevention measures and hope they’re adequate when they won’t be.
I need to think more about this.
This topic is quite interesting for me from the perspective of human survival, so if you do decide to make a post specifically about preventing jailbreaking, then please tag me (somewhere) so that I can read it.