Seth Herd comments on If we solve alignment, do we die anyway?

Seth Herd 19 Nov 2024 17:26 UTC
4 points
1
This isn’t something I’ve thought about adequately.
I think LLM agents will almost universally include a whole different mechanisms that can prevent jailbreaking: Internal independent review in which there are calls to a different model instance to check whether proposed plans and actions are safe (or waste time and money).
Once agents can spend your people’s money or damage their reputation, we’ll want to have them “think through” the consequences of important plans and actions before they execute.
As long as you’re engineering that and paying the compute costs, you might as well use it to check for harms as well- including checking for jailbreaking. If that check finds evidence of jailbreaking, it can just clear the model context, call for human review from the org, or suspend that account.
I don’t know how adequate that will be, but it will help.
This is probably worth thinking more about; Ii’ve sort of glossed over it while being concerned mostly about misalignment and misuse by fully authorized parties. But jailbreaking and misuse by clients could also be a major danger.