[this is not a particularly original take; I’m writing it here so that I have somewhere to link people when I want to make this point]
From my perspective, one of the main avenues by which adversarial robustness might reduce AI x-risk is making weak trusted models more robust to attempts from scheming models to cause concentrated failures by jailbreaking or otherwise tricking them.
That is: if you’re using weak trusted models to review the actions of more powerful models to ensure they aren’t causing egregious problems (we call this trusted monitoring), one of the main ways the powerful models might be able to defeat your safety measures is by jailbreaking the trusted model into saying that some action is safe when it actually isn’t. You have similar problems with many other control protocols that require trusted models to be run on inputs that contain generations from untrusted models.
So I’d say the main applications of research into adversarial robustness are:
Improving robustness of models to attacks from scheming AIs (and other AI attacks).
Improving robustness of models to human attacks.
Making models themselves less likely to take intentional catastrophic actions.
I’m not sure how likely this is to work if your model is a schemer.
I didn’t have this threat model in mind when working on Redwood’s adversarial robustness project—at the time I was mostly thinking about the third application in the list above.
Note that, though my conclusion here is sort of similar to the thesis of AI Safety in a World of Vulnerable Machine Learning Systems, I actually disagree with their argument—they are claiming that adversarial robustness is a big problem in various low-stakes settings, but I agree with Ryan Greenblatt that adversarial robustness is not actually very important in low-stakes settings.
Since then, I’ve updated that adversarial robustness for LLMs is much more tractable (preview of paper out very soon). In vision settings, progress is extraordinarily slow but not necessarily for LLMs.
Thanks for the link. I don’t actually think that either of the sentences you quoted are closely related to what I’m talking about. You write “[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries”; I think that this is you making a version of the argument that I linked to Adam Gleave and Euan McLean making and that I think is wrong. You wrote “AI tripwires could help uncover early misaligned systems before they can cause damage” in a section on anomaly detection, which isn’t a central case of what I’m talking about.
I’m not sure if I agree as far as the first sentence you mention is concerned. I agree that the rest of the paragraph is talking about something similar to the point that Adam Gleave and Euan McLean make, but the exact sentence is:
Separately, humans and systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries.
“Separately” is quite key here.
I assume this is intended to include AI adversaries and high stakes monitoring.
[this is not a particularly original take; I’m writing it here so that I have somewhere to link people when I want to make this point]
From my perspective, one of the main avenues by which adversarial robustness might reduce AI x-risk is making weak trusted models more robust to attempts from scheming models to cause concentrated failures by jailbreaking or otherwise tricking them.
That is: if you’re using weak trusted models to review the actions of more powerful models to ensure they aren’t causing egregious problems (we call this trusted monitoring), one of the main ways the powerful models might be able to defeat your safety measures is by jailbreaking the trusted model into saying that some action is safe when it actually isn’t. You have similar problems with many other control protocols that require trusted models to be run on inputs that contain generations from untrusted models.
So I’d say the main applications of research into adversarial robustness are:
Improving robustness of models to attacks from scheming AIs (and other AI attacks).
Improving robustness of models to human attacks.
Making models themselves less likely to take intentional catastrophic actions.
I’m not sure how likely this is to work if your model is a schemer.
Unlike e.g. human users of ChatGPT, if a scheming AI wants to jailbreak its trusted monitor, it might have very few chances, because the lab will hopefully notice the jailbreak attempts and apply countermeasures (e.g. shutting down or training a probe on the untrusted model to detect when it’s trying to jailbreak the trusted monitor). This probably has some effect on how you should prioritize your research, but I’m not sure specifically what it is.
I didn’t have this threat model in mind when working on Redwood’s adversarial robustness project—at the time I was mostly thinking about the third application in the list above.
Note that, though my conclusion here is sort of similar to the thesis of AI Safety in a World of Vulnerable Machine Learning Systems, I actually disagree with their argument—they are claiming that adversarial robustness is a big problem in various low-stakes settings, but I agree with Ryan Greenblatt that adversarial robustness is not actually very important in low-stakes settings.
Some years ago we wrote that “[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries” and discussed monitoring systems that can create “AI tripwires could help uncover early misaligned systems before they can cause damage.” https://www.lesswrong.com/posts/5HtDzRAk7ePWsiL2L/open-problems-in-ai-x-risk-pais-5#Adversarial_Robustness
Since then, I’ve updated that adversarial robustness for LLMs is much more tractable (preview of paper out very soon). In vision settings, progress is extraordinarily slow but not necessarily for LLMs.
Thanks for the link. I don’t actually think that either of the sentences you quoted are closely related to what I’m talking about.
You write “[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries”; I think that this is you making a version of the argument that I linked to Adam Gleave and Euan McLean making and that I think is wrong.You wrote “AI tripwires could help uncover early misaligned systems before they can cause damage” in a section on anomaly detection, which isn’t a central case of what I’m talking about.I’m not sure if I agree as far as the first sentence you mention is concerned. I agree that the rest of the paragraph is talking about something similar to the point that Adam Gleave and Euan McLean make, but the exact sentence is:
“Separately” is quite key here.
I assume this is intended to include AI adversaries and high stakes monitoring.