Thanks for the detailed thoughts! I like the overall plan, especially using KYC, human monitoring, and a separate model for bio. I’d like to point out that this plan still uses automated monitoring systems, and to the extent these systems aren’t adversarially robust, the plan will be more expensive and/or less effective.
The only plans that wouldn’t benefit from adversarial robustness are those without automated monitoring. For example, humans could be hired to manually monitor each and every query for attempted misuse. Let’s consider the viability of fully manual monitoring from a financial perspective, and then we can think about how much additional benefit would be provided by automated robust monitoring systems.
First, let’s price out manual monitoring for ChatGPT Plus. The subscription costs $20/month. Suppose the average person makes 10 queries per day, or 300 queries per month, and that it takes 15 seconds for a human monitor to read a query and flag it as misuse. Wages of $5/hour would mean comprehensive human monitoring costs $6.25 per user per month, and wages of $10/hour and $15/hour would translate to monthly per user costs of $12.50 and $18.75 respectively. The cost of full manual monitoring on ChatGPT Plus would therefore amount to much or most of its overall revenue.
Second, we can anchor on Google Search. This assumes that Google is a reasonable reference point for the eventual volume, expenses, and revenues of an AI provider, which might not be a safe assumption in several ways. Nevertheless, Google conducts a ~3 trillion searches per year (varies by data source). If monitoring a single search costs $0.01 (e.g. someone earning $5/hour who monitors 500 searches per hour), then it would cost ~$30B to monitor every search. Would $30B in monitoring costs be financially acceptable? Google Search had revenues of $162B last year, representing 57% of Google’s total revenue. They don’t report expenses for Search specifically, but their overall expenses were $207B. If we assume Search comprises 57% of expenses, then Search would have $118B in annual expenses, against $162B in revenue. Manual monitoring would cost $30B, and would therefore eliminate two-thirds of Google’s search profits.
So these costs would not be prohibitive, but they’d be a large share of overall revenues and profits. A safety-minded company like Anthropic might pay for manual monitoring, but other companies and their investors might be strongly opposed to paying such a high price. They could argue that, just as gun manufacturers are not held liable for murders, AI providers should not have to spend billions to prevent deliberate misuse.
Fortunately, we can reduce the cost of monitoring in many ways. Randomly sampling a small fraction of queries would reduce costs, but also reduce the likelihood of identifying misuse. Flagging keywords like “virus” would catch unsophisticated misuse, but could be evaded (e.g. discussions in a variant of pig latin).
Ideally, you’d be able to use AI systems to identify suspicious queries for human monitoring, but those systems would only be effective to the extent that they’re adversarially robust. If 99% of queries can be reliably discarded as safe, then manual monitoring costs would fall by 99%. But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems. Deploying these unreliable systems would not decrease the costs of manual monitoring much without a corresponding drop in performance.
Overall, I appreciate your point that there are many layers of defense we can use to detect and prevent misuse Fully manual monitoring might be possible, but it would have a huge financial cost. Many companies would be reluctant or unable to pay that price. Robust automated monitoring systems could reduce the cost of monitoring by 90% or 99%, but this would likely require improvements upon today’s state of the art.
I’d like to point out that this plan still uses automated monitoring systems, and to the extent these systems aren’t adversarially robust, the plan will be more expensive and/or less effective.
I agree that additional adversarial robustness (for both the policy and the monitor) helps considerably, but I think that it likely won’t be necessarily to have better adversarial robustness than the current state of the art (e.g. the robustness of Claude 2). This is due to the use of monitoring ensembles and preventing iteration against the API. My belief that this is likely to work is totally possible to falsify by demonstrating cases where we can’t figure out sufficient countermeasures against a red team with realistic affordances.
I made some “Edit Time Additions” (ETA) to my original comment above for clarity.
Ideally, you’d be able to use AI systems to identify suspicious queries for human monitoring, but those systems would only be effective to the extent that they’re adversarially robust. If 99% of queries can be reliably discarded as safe, then manual monitoring costs would fall by 99%.
As discussed in my comment, I’m imagining being able to have a human review perhaps 0.01% to 1% of queries depending on the use case and other other factors. So, we’d review the most suspicious seeming (according to our overall combined monitoring score) 0.01% to 1% of queries. Equivalently, (assuming actual misuse is rare) we’d be targeting a False Positive Rate (FPR) of 0.01% or 1% (on the ROC curve).
But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems. Deploying these unreliable systems would not decrease the costs of manual monitoring much without a corresponding drop in performance.
I think the claim:
Deploying these unreliable systems would not decrease the costs of manual monitoring much without a corresponding drop in performance.
seems wrong or at least seems to be answering the wrong question.
Let’s suppose that we can afford to review the most suspicious seeming 0.1% of queries for a bio specific API endpoint. Equivalently, we can afford 0.1% FPR for our automated monitoring ensemble. Now, suppose (for simplicity) that catastrophic bio misuse requries 100 queries to the model and we’re happy as long as we catch the user and report them to the FBI prior to them finishing all 100 queries. Then, if our monitoring ensemble at 0.1% FPR catches a given bio misuse query with just 5% probability (for a competent red team), bio misuse is 99.4% likely to be caught prior to succeeding. So, we don’t actually need very high reliability on the ensemble. (Though we might target more conservative bars than 99.4% in practice.)
A reasonable objection here is that maybe we’ll have a large number of people (e.g., >200) trying to create bioweapons using the API and thus one will get lucky. We can avoid this failure mode by having a policy like “if there are more than 3 open FBI investigations into attempted bioweapon construction based on our models, we need to either shut down the API endpoint or spend additional money on human review to increase reliability to a point where the probability of failure is 5x lower”. Then, if we do actually see considerable serious misuse, we can respond dynamically to handle this. At this point, we should also be able to look at case studies to see how effective our safeguards were in practice.
Separately for:
But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems.
I expect that these numbers weren’t against monitoring ensembles in the sense I described earlier and the red team had additional affordances beyond just understanding the high level description of the monitoring setup? E.g., the red team was able to iterate?
Separately for: “But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems.”
I expect that these numbers weren’t against monitoring ensembles in the sense I described earlier and the red team had additional affordances beyond just understanding the high level description of the monitoring setup? E.g., the red team was able to iterate?
This is correct about the paper I cited, but others have achieved similar attack success rates against models like Claude which use an ensemble of defenses. AFAIK Claude does not ban users who attempt misuse, so that element of your plan has never been tested and would likely help a lot.
To summarize this comment, you’ve proposed that baseline monitoring systems could reduce risk to an acceptable level. Specifically, the monitoring system would need to correctly identify at least 5% of dangerous queries as dangerous (“5% precision”) and avoid incorrectly flagging more than 1 in 1000 safe queries as dangerous (“0.1% FPR”).
I think this level of reliability is possible today (e.g. Claude 2 would likely meet it), but it’s possible that future developments would make defense more difficult. For example, new attack methods have shown LLMs to be less robust to misuse than previously understood. (This is one downside of adversarial robustness research that will become more important as the stakes of adversarial attacks rise.) Perhaps a bigger challenge is the growth of multimodal systems. Defending vision language models is much more difficult than defending pure LLMs. As multimodality becomes standard, we might see adversarial attacks that regularly achieve >95% success rates in bypassing monitoring systems. I’m not particularly confident about how difficult monitoring will be, but it would be beneficial to have monitoring systems which would work even if defense gets much harder in the future.
Overall, these hypotheticals only offer so much information when none of these defenses has ever been publicly built or tested. I think we agree that simple monitoring strategies might be fairly effective and cheap in identifying misuse, and that progress on adversarial robustness would significantly reduce costs by improving the effectiveness of automated monitoring systems.
Perhaps a bigger challenge is the growth of multimodal systems.
For cases like bio, I don’t think multimodal is that much of a threat because it should be possible to identify text outputs from the model as problematic without needing the context from the input in most cases. So, we can just have part of our monitoring ensemble ignore image inputs (for instance). Other image defenses for the monitor such as various compression strategies are plausible. (I think compression has historically been defeated given white box access to the compression algorithm, but we don’t necessarily need to assume that. I’m also separately pretty optimistic about various randomization strategies for compression like things which I don’t think have been tested in the literature.)
In general, there is a tradeoff related to removing information from the monitor: the more info you remove, the harder it is to do a good job, but removing info this makes adversarial attacks harder.
Aside: Removing information for the monitor in the collusion case
We discuss this sort of monitoring tradeoff in a different case in our recent paper on AI control in the context of trying to prevent collusion (which is different than adversarial attacks but related). I think the similarity is somewhat interesting here.
In section 3.4.2 we say:
The core tension that the blue team faces here is: if you let the monitor have too much information, then it will be able to collude with the policy. But if you hide too much information, it won’t be capable of correctly determining whether proposed code is backdoored or not. For example, the protocol could decide to not show U the problem statement at all, asking it to determine whether code was backdoored just based on reading the code. But this would cause U to have much worse performance even if it was honestly trying to spot backdoors, because it’s hard to tell whether code is wrong without knowing what problem it’s supposed to solve.
Thanks for the detailed thoughts! I like the overall plan, especially using KYC, human monitoring, and a separate model for bio. I’d like to point out that this plan still uses automated monitoring systems, and to the extent these systems aren’t adversarially robust, the plan will be more expensive and/or less effective.
The only plans that wouldn’t benefit from adversarial robustness are those without automated monitoring. For example, humans could be hired to manually monitor each and every query for attempted misuse. Let’s consider the viability of fully manual monitoring from a financial perspective, and then we can think about how much additional benefit would be provided by automated robust monitoring systems.
First, let’s price out manual monitoring for ChatGPT Plus. The subscription costs $20/month. Suppose the average person makes 10 queries per day, or 300 queries per month, and that it takes 15 seconds for a human monitor to read a query and flag it as misuse. Wages of $5/hour would mean comprehensive human monitoring costs $6.25 per user per month, and wages of $10/hour and $15/hour would translate to monthly per user costs of $12.50 and $18.75 respectively. The cost of full manual monitoring on ChatGPT Plus would therefore amount to much or most of its overall revenue.
Second, we can anchor on Google Search. This assumes that Google is a reasonable reference point for the eventual volume, expenses, and revenues of an AI provider, which might not be a safe assumption in several ways. Nevertheless, Google conducts a ~3 trillion searches per year (varies by data source). If monitoring a single search costs $0.01 (e.g. someone earning $5/hour who monitors 500 searches per hour), then it would cost ~$30B to monitor every search. Would $30B in monitoring costs be financially acceptable? Google Search had revenues of $162B last year, representing 57% of Google’s total revenue. They don’t report expenses for Search specifically, but their overall expenses were $207B. If we assume Search comprises 57% of expenses, then Search would have $118B in annual expenses, against $162B in revenue. Manual monitoring would cost $30B, and would therefore eliminate two-thirds of Google’s search profits.
So these costs would not be prohibitive, but they’d be a large share of overall revenues and profits. A safety-minded company like Anthropic might pay for manual monitoring, but other companies and their investors might be strongly opposed to paying such a high price. They could argue that, just as gun manufacturers are not held liable for murders, AI providers should not have to spend billions to prevent deliberate misuse.
Fortunately, we can reduce the cost of monitoring in many ways. Randomly sampling a small fraction of queries would reduce costs, but also reduce the likelihood of identifying misuse. Flagging keywords like “virus” would catch unsophisticated misuse, but could be evaded (e.g. discussions in a variant of pig latin).
Ideally, you’d be able to use AI systems to identify suspicious queries for human monitoring, but those systems would only be effective to the extent that they’re adversarially robust. If 99% of queries can be reliably discarded as safe, then manual monitoring costs would fall by 99%. But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems. Deploying these unreliable systems would not decrease the costs of manual monitoring much without a corresponding drop in performance.
Overall, I appreciate your point that there are many layers of defense we can use to detect and prevent misuse Fully manual monitoring might be possible, but it would have a huge financial cost. Many companies would be reluctant or unable to pay that price. Robust automated monitoring systems could reduce the cost of monitoring by 90% or 99%, but this would likely require improvements upon today’s state of the art.
I agree that additional adversarial robustness (for both the policy and the monitor) helps considerably, but I think that it likely won’t be necessarily to have better adversarial robustness than the current state of the art (e.g. the robustness of Claude 2). This is due to the use of monitoring ensembles and preventing iteration against the API. My belief that this is likely to work is totally possible to falsify by demonstrating cases where we can’t figure out sufficient countermeasures against a red team with realistic affordances.
I made some “Edit Time Additions” (ETA) to my original comment above for clarity.
As discussed in my comment, I’m imagining being able to have a human review perhaps 0.01% to 1% of queries depending on the use case and other other factors. So, we’d review the most suspicious seeming (according to our overall combined monitoring score) 0.01% to 1% of queries. Equivalently, (assuming actual misuse is rare) we’d be targeting a False Positive Rate (FPR) of 0.01% or 1% (on the ROC curve).
I think the claim:
seems wrong or at least seems to be answering the wrong question.
Let’s suppose that we can afford to review the most suspicious seeming 0.1% of queries for a bio specific API endpoint. Equivalently, we can afford 0.1% FPR for our automated monitoring ensemble. Now, suppose (for simplicity) that catastrophic bio misuse requries 100 queries to the model and we’re happy as long as we catch the user and report them to the FBI prior to them finishing all 100 queries. Then, if our monitoring ensemble at 0.1% FPR catches a given bio misuse query with just 5% probability (for a competent red team), bio misuse is 99.4% likely to be caught prior to succeeding. So, we don’t actually need very high reliability on the ensemble. (Though we might target more conservative bars than 99.4% in practice.)
A reasonable objection here is that maybe we’ll have a large number of people (e.g., >200) trying to create bioweapons using the API and thus one will get lucky. We can avoid this failure mode by having a policy like “if there are more than 3 open FBI investigations into attempted bioweapon construction based on our models, we need to either shut down the API endpoint or spend additional money on human review to increase reliability to a point where the probability of failure is 5x lower”. Then, if we do actually see considerable serious misuse, we can respond dynamically to handle this. At this point, we should also be able to look at case studies to see how effective our safeguards were in practice.
Separately for:
I expect that these numbers weren’t against monitoring ensembles in the sense I described earlier and the red team had additional affordances beyond just understanding the high level description of the monitoring setup? E.g., the red team was able to iterate?
This is correct about the paper I cited, but others have achieved similar attack success rates against models like Claude which use an ensemble of defenses. AFAIK Claude does not ban users who attempt misuse, so that element of your plan has never been tested and would likely help a lot.
To summarize this comment, you’ve proposed that baseline monitoring systems could reduce risk to an acceptable level. Specifically, the monitoring system would need to correctly identify at least 5% of dangerous queries as dangerous (“5% precision”) and avoid incorrectly flagging more than 1 in 1000 safe queries as dangerous (“0.1% FPR”).
I think this level of reliability is possible today (e.g. Claude 2 would likely meet it), but it’s possible that future developments would make defense more difficult. For example, new attack methods have shown LLMs to be less robust to misuse than previously understood. (This is one downside of adversarial robustness research that will become more important as the stakes of adversarial attacks rise.) Perhaps a bigger challenge is the growth of multimodal systems. Defending vision language models is much more difficult than defending pure LLMs. As multimodality becomes standard, we might see adversarial attacks that regularly achieve >95% success rates in bypassing monitoring systems. I’m not particularly confident about how difficult monitoring will be, but it would be beneficial to have monitoring systems which would work even if defense gets much harder in the future.
Overall, these hypotheticals only offer so much information when none of these defenses has ever been publicly built or tested. I think we agree that simple monitoring strategies might be fairly effective and cheap in identifying misuse, and that progress on adversarial robustness would significantly reduce costs by improving the effectiveness of automated monitoring systems.
[Mostly unimportant/nitpicking specific claim]
For cases like bio, I don’t think multimodal is that much of a threat because it should be possible to identify text outputs from the model as problematic without needing the context from the input in most cases. So, we can just have part of our monitoring ensemble ignore image inputs (for instance). Other image defenses for the monitor such as various compression strategies are plausible. (I think compression has historically been defeated given white box access to the compression algorithm, but we don’t necessarily need to assume that. I’m also separately pretty optimistic about various randomization strategies for compression like things which I don’t think have been tested in the literature.)
In general, there is a tradeoff related to removing information from the monitor: the more info you remove, the harder it is to do a good job, but removing info this makes adversarial attacks harder.
Aside: Removing information for the monitor in the collusion case
We discuss this sort of monitoring tradeoff in a different case in our recent paper on AI control in the context of trying to prevent collusion (which is different than adversarial attacks but related). I think the similarity is somewhat interesting here.
In section 3.4.2 we say: