I like this post and I agree with almost all the claims it makes. But I think it underestimates the potential of approaches other than adversarial robustness for effectively mitigating catastrophic misuse done via an API[1].
[Thanks to Buck Shlegeris, Fabien Roger, and GPT-4 for help editing this comment. I’ll probably turn this into a top-level post at some point. Edit: the top level post is now up (linked).]
I’ll go through bio and cyber as examples of how I imagine mitigating catastrophic misuse via an API without requiring better adversarial robustness than exhibited by current LLM APIs like Claude 2 and GPT-4.
I think that the easiest mitigations for bioterrorism and cybercrime are fairly different, because of the different roles that LLMs play in these two threat models.
The mitigations I’m describing are non-trivial, and it’s unclear if they will happen by default.
But regardless, this type of approach seems considerably easier to me than trying to achieve very high levels of adversarial robustness.
Mitigations for bioterrorism
In this section, I’ll describe how I imagine handling bioterrorism risk for an API deploying powerful models (e.g., ASL-3/ASL-4).
As I understand it, the main scenario by which LLMs cause bioterrorism risk is something like the following: there’s a team of relatively few people, who are not top experts in the relevant fields but who want to do bioterrorism for whatever reason.
Without LLMs, these people would struggle to build bioweapons—they wouldn’t be able to figure out various good ideas, and they’d get stuck while trying to manufacture their bioweapons (perhaps like Aum Shinrikyo).
But with LLMs, they can get past those obstacles.
(I’m making the assumption here that the threat model is more like “the LLM gives the equivalent of many hours of advice” rather than “the LLM gives the equivalent of five minutes of advice”. I’m not a biosecurity expert and so don’t know whether that’s an appropriate assumption to make; it probably comes down to questions about what the hard steps in building catastrophic bioweapons are. I don’t see why LLMs would zero-shot be able to effectively answer extremely broad and difficult questions like “how to build a really bad bioweapon”, so it seems to me like the LLMs probably have to do at least a moderate amount of back-and-forth in order to be wildly helpful.)
And so in this case, the LLMs caused problems not by being superhuman, but by allowing aspiring bioterrorists to get a fairly large amount of expert advice that human experts would refuse to give them.
Thus, our countermeasures will heavily utilize the assumption that a large number of queries to the AI are likely required to substantially enable catastrophic bioterrorism.
We’ll assume the number of required queries is likely at least hundreds, but possibly thousands[2].
First, we’ll remove bio data (particularly synth bio and virology) from the training corpus and verify that this will result in a model with weak biology knowledge.
This is sufficient if it’s acceptable to not have any model deployed which can handle biology questions.
However, it might be important for some users to be able to ask bio questions.
If so, we’ll create an additional API endpoint with a model created by fine-tuning our normal base model (without bio) on the excluded biology data.
Now, we’ll need countermeasures on this additional bio endpoint to prevent misuse.
For this additional endpoint, we’ll use the following countermeasures:
We require KYC (Know Your Customer) checks to access the powerful bio model API endpoint (or chat service or whatever) such as requiring submitting a passport. This is to make it hard to have more than one account and also to ensure that potential bioterrorists can be criminally prosecuted.
We make this API endpoint have substantially higher costs (e.g., 5x) to ensure that users avoid using it when possible and to cover the costs of some of the below interventions. We might also include a fixed monthly cost for each user.
The actual model which the endpoint queries is trained to not answer questions about bioterror, and we attempt to make this reasonably robust. (It should be fine to achieve similar levels of robustness to the current GPT-4 API or Claude 2 API.) This countermeasure isn’t very key for the overall proposal, but it seems somewhat useful.
We’ll build an automated bioterror monitoring scheme which involves an ensemble of different AI checks. The ensemble could involve different base models (perhaps one base model which is dumber than the actual AI used and a base model which is as smart as the actual AI used), variation on the classifier training data, only looking at the input or only looking at the output, or randomly omitting various parts of the input/output.
For cost reasons, we might randomly select a subset of the ensemble of approaches to apply to each input.
This monitoring ensemble might also see information about some of the user’s prior queries or even the user’s background and their stated reason for using the API (submitted as part of KYC).
This monitoring also looks for people attempting to jailbreak the model in addition to people directly trying bioterror.
Training the monitoring might involve the use of secret information about potential bio-risk.
ETA: We of course need to be worried about people jailbreaking our monitoring models. The key hope is that it will be very hard to reliably jailbreak all models in the ensemble given that users won’t be able to iterate against the process. I discuss this more in this comment below.
Domain expert humans will manually audit the queries (and responses) which the process described above rates as most suspicious. Perhaps we have humans carefully audit the most suspicious-seeming 1% of queries. We might also sample some queries to audit uniformly at random or ensure that we always sample at least a few queries from each user.
If our human auditing finds evidence of bioterror, the individual is permanently banned from the API and referred to the FBI (or other bodies) for investigation. We also consider suing the individual in civil court for contract violation. We also consider banning anyone closely associated with the banned individual at least temporarily and take into account potential connections with banned individuals in future KYC checks.
If our human auditing process finds evidence for (attempted) jailbreaking, the individual is permanently banned from the API, and we consider suing them for contract violation.
A key aspect of this approach is we prevent users from iterating against our most important countermeasures.
They shouldn’t notice that they failed to evade our countermeasures prior to being banned (in addition to potential criminal investigation and suing them).
ETA: this makes it hard to reliably bypass all of the monitoring models.
We verify this approach worked as expected by ensuring that a red team which contains both bio experts and LLM experts can’t execute a moderate number (e.g., 50) of diverse bioterror queries without having a very high probability of being caught.
The red team is familiar with the high-level countermeasures but isn’t necessarily informed of non-public implementation details.
We’d want to try this with several different red teams.
Here are some possible additional countermeasures:
We could reject queries to the bio endpoint which don’t seem like questions about biology according to our monitoring models.
(Or queries which contain fully separable subcomponents which don’t seem to be about biology.)
This helps avoid the failure mode in which biology information is elicited from the model without monitoring models even understanding how the query relates to biology.
If monitoring scores indicate moderately high suspicion (but not high enough suspicion for the human auditing process), we could reject the query or route the query to a dumber model which is incapable of substantially helping with bioterrorism.
It might be acceptable for this rejection/routing to use a much more aggressive threshold for suspicion than used for auditing (e.g., perhaps we can only afford to have a human audit the 0.2% most suspicious queries, but we can defer/reject the 2% most suspicious).
The downside of this is that it allows for some iteration against the API (though we can try to detect iteration and then ban these users).
To avoid continued harm during delays in actually conducting the human audit process, we could “pause” users which have many queries referred to auditing.
Mitigations for large-scale cybercrime
The threat model for LLM cybercrime has pretty different properties than the bioterrorism threat model.
A bioterrorism attack might only require hundreds of queries to the LLM in order to cause a catastrophic outcome.
In contrast, the risk associated with cybercrime is that we massively reduce the cost of labor involved in hacking.
This distinction is important because there are already thousands or tens of thousands of people who are employed as hackers for pay, and so if LLMs aren’t massively superhuman, they have to contribute via lowering costs.
As such, if our goal is to ensure less than $100 million in cybercrime damages, it might suffice to avoid more than 100,000 cybercrime queries (unlike the bio case where merely hundreds of queries could be unacceptable).
Thus, we can prevent our model from doing unacceptable amounts of cybercrime without needing to have as high of reliability as in the bioterrorism case.
It likely won’t be possible to avoid having our main model (as deployed on an API) be capable of cybercrime.
So, we don’t have the option of creating a separate endpoint as in the bio case.
Otherwise, we’ll use a somewhat similar set of countermeasures as with the bio approach discussed above. For the following list, I’m assuming that you’ve read the bio section above.
Ideally, we’d require some KYC. This might not be possible for the main API, and we’ll discuss alternatives below.
We’d have a variety of monitoring approaches as discussed in the bio case but applied to cyber. We need to keep cost pretty low, but we can still apply monitoring by weaker (and thus cheaper) models on all inputs. Then, we’d randomly sample inputs to apply more expensive parts of the monitoring ensemble as well as applying these more expensive components on inputs found to be somewhat suspicious by weaker monitoring.
We again apply human auditing on inputs selected by monitoring. Perhaps humans audit the 0.01% most suspicious-seeming inputs.
We again ban users based on this human auditing.
Due to the potential lack of KYC, we might not be able to ban users effectively (as well as missing potential criminal investigation and suing them). We might not be able to apply KYC, but we can at least estimate the total amount of cybercrime queries and guess at total damages.
If likely damages reach some threshold, we can institute additional mitigations like actually requiring KYC.
Given that these mitigations have to be on our actual main API, it’s plausible that privacy concerns are a blocker.
We can probably do some amount of user info redaction for auditing, but I broadly think we should just aim to set the expectation that queries made to powerful models aren’t private.
This is similar to (e.g.) emails sent to contractors.
Fine-tuning?
I think that allowing end-users to fine-tune these models makes the picture notably more complicated, as it might be possible to train knowledge into the model which isn’t present in the monitoring models.
There are still possible mitigations. We could also try to resolve this by only allowing large corporate contracts when fine-tuning powerful models.
In addition to the disagreement I express in the rest of this comment, I’m somewhat skeptical of the best altruistic option being to focus on misuse for many people. My view here is somewhat complex, but I’ll try to quickly articulate this in a child comment.
There is also a threat model where AIs are very helpful for bioterror via the mechanism of occasionally giving really good answers and thus only a small number of bioterror queries (1-30) would be dangerous. This threat model could look like having really good high-level ideas for bioterror or acting like a considerably better version of Google which points you toward the ideal online resource. I think this sort of usage probably doesn’t help that much with bioterror, but this might be a crux.
Thanks for the detailed thoughts! I like the overall plan, especially using KYC, human monitoring, and a separate model for bio. I’d like to point out that this plan still uses automated monitoring systems, and to the extent these systems aren’t adversarially robust, the plan will be more expensive and/or less effective.
The only plans that wouldn’t benefit from adversarial robustness are those without automated monitoring. For example, humans could be hired to manually monitor each and every query for attempted misuse. Let’s consider the viability of fully manual monitoring from a financial perspective, and then we can think about how much additional benefit would be provided by automated robust monitoring systems.
First, let’s price out manual monitoring for ChatGPT Plus. The subscription costs $20/month. Suppose the average person makes 10 queries per day, or 300 queries per month, and that it takes 15 seconds for a human monitor to read a query and flag it as misuse. Wages of $5/hour would mean comprehensive human monitoring costs $6.25 per user per month, and wages of $10/hour and $15/hour would translate to monthly per user costs of $12.50 and $18.75 respectively. The cost of full manual monitoring on ChatGPT Plus would therefore amount to much or most of its overall revenue.
Second, we can anchor on Google Search. This assumes that Google is a reasonable reference point for the eventual volume, expenses, and revenues of an AI provider, which might not be a safe assumption in several ways. Nevertheless, Google conducts a ~3 trillion searches per year (varies by data source). If monitoring a single search costs $0.01 (e.g. someone earning $5/hour who monitors 500 searches per hour), then it would cost ~$30B to monitor every search. Would $30B in monitoring costs be financially acceptable? Google Search had revenues of $162B last year, representing 57% of Google’s total revenue. They don’t report expenses for Search specifically, but their overall expenses were $207B. If we assume Search comprises 57% of expenses, then Search would have $118B in annual expenses, against $162B in revenue. Manual monitoring would cost $30B, and would therefore eliminate two-thirds of Google’s search profits.
So these costs would not be prohibitive, but they’d be a large share of overall revenues and profits. A safety-minded company like Anthropic might pay for manual monitoring, but other companies and their investors might be strongly opposed to paying such a high price. They could argue that, just as gun manufacturers are not held liable for murders, AI providers should not have to spend billions to prevent deliberate misuse.
Fortunately, we can reduce the cost of monitoring in many ways. Randomly sampling a small fraction of queries would reduce costs, but also reduce the likelihood of identifying misuse. Flagging keywords like “virus” would catch unsophisticated misuse, but could be evaded (e.g. discussions in a variant of pig latin).
Ideally, you’d be able to use AI systems to identify suspicious queries for human monitoring, but those systems would only be effective to the extent that they’re adversarially robust. If 99% of queries can be reliably discarded as safe, then manual monitoring costs would fall by 99%. But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems. Deploying these unreliable systems would not decrease the costs of manual monitoring much without a corresponding drop in performance.
Overall, I appreciate your point that there are many layers of defense we can use to detect and prevent misuse Fully manual monitoring might be possible, but it would have a huge financial cost. Many companies would be reluctant or unable to pay that price. Robust automated monitoring systems could reduce the cost of monitoring by 90% or 99%, but this would likely require improvements upon today’s state of the art.
I’d like to point out that this plan still uses automated monitoring systems, and to the extent these systems aren’t adversarially robust, the plan will be more expensive and/or less effective.
I agree that additional adversarial robustness (for both the policy and the monitor) helps considerably, but I think that it likely won’t be necessarily to have better adversarial robustness than the current state of the art (e.g. the robustness of Claude 2). This is due to the use of monitoring ensembles and preventing iteration against the API. My belief that this is likely to work is totally possible to falsify by demonstrating cases where we can’t figure out sufficient countermeasures against a red team with realistic affordances.
I made some “Edit Time Additions” (ETA) to my original comment above for clarity.
Ideally, you’d be able to use AI systems to identify suspicious queries for human monitoring, but those systems would only be effective to the extent that they’re adversarially robust. If 99% of queries can be reliably discarded as safe, then manual monitoring costs would fall by 99%.
As discussed in my comment, I’m imagining being able to have a human review perhaps 0.01% to 1% of queries depending on the use case and other other factors. So, we’d review the most suspicious seeming (according to our overall combined monitoring score) 0.01% to 1% of queries. Equivalently, (assuming actual misuse is rare) we’d be targeting a False Positive Rate (FPR) of 0.01% or 1% (on the ROC curve).
But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems. Deploying these unreliable systems would not decrease the costs of manual monitoring much without a corresponding drop in performance.
I think the claim:
Deploying these unreliable systems would not decrease the costs of manual monitoring much without a corresponding drop in performance.
seems wrong or at least seems to be answering the wrong question.
Let’s suppose that we can afford to review the most suspicious seeming 0.1% of queries for a bio specific API endpoint. Equivalently, we can afford 0.1% FPR for our automated monitoring ensemble. Now, suppose (for simplicity) that catastrophic bio misuse requries 100 queries to the model and we’re happy as long as we catch the user and report them to the FBI prior to them finishing all 100 queries. Then, if our monitoring ensemble at 0.1% FPR catches a given bio misuse query with just 5% probability (for a competent red team), bio misuse is 99.4% likely to be caught prior to succeeding. So, we don’t actually need very high reliability on the ensemble. (Though we might target more conservative bars than 99.4% in practice.)
A reasonable objection here is that maybe we’ll have a large number of people (e.g., >200) trying to create bioweapons using the API and thus one will get lucky. We can avoid this failure mode by having a policy like “if there are more than 3 open FBI investigations into attempted bioweapon construction based on our models, we need to either shut down the API endpoint or spend additional money on human review to increase reliability to a point where the probability of failure is 5x lower”. Then, if we do actually see considerable serious misuse, we can respond dynamically to handle this. At this point, we should also be able to look at case studies to see how effective our safeguards were in practice.
Separately for:
But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems.
I expect that these numbers weren’t against monitoring ensembles in the sense I described earlier and the red team had additional affordances beyond just understanding the high level description of the monitoring setup? E.g., the red team was able to iterate?
Separately for: “But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems.”
I expect that these numbers weren’t against monitoring ensembles in the sense I described earlier and the red team had additional affordances beyond just understanding the high level description of the monitoring setup? E.g., the red team was able to iterate?
This is correct about the paper I cited, but others have achieved similar attack success rates against models like Claude which use an ensemble of defenses. AFAIK Claude does not ban users who attempt misuse, so that element of your plan has never been tested and would likely help a lot.
To summarize this comment, you’ve proposed that baseline monitoring systems could reduce risk to an acceptable level. Specifically, the monitoring system would need to correctly identify at least 5% of dangerous queries as dangerous (“5% precision”) and avoid incorrectly flagging more than 1 in 1000 safe queries as dangerous (“0.1% FPR”).
I think this level of reliability is possible today (e.g. Claude 2 would likely meet it), but it’s possible that future developments would make defense more difficult. For example, new attack methods have shown LLMs to be less robust to misuse than previously understood. (This is one downside of adversarial robustness research that will become more important as the stakes of adversarial attacks rise.) Perhaps a bigger challenge is the growth of multimodal systems. Defending vision language models is much more difficult than defending pure LLMs. As multimodality becomes standard, we might see adversarial attacks that regularly achieve >95% success rates in bypassing monitoring systems. I’m not particularly confident about how difficult monitoring will be, but it would be beneficial to have monitoring systems which would work even if defense gets much harder in the future.
Overall, these hypotheticals only offer so much information when none of these defenses has ever been publicly built or tested. I think we agree that simple monitoring strategies might be fairly effective and cheap in identifying misuse, and that progress on adversarial robustness would significantly reduce costs by improving the effectiveness of automated monitoring systems.
Perhaps a bigger challenge is the growth of multimodal systems.
For cases like bio, I don’t think multimodal is that much of a threat because it should be possible to identify text outputs from the model as problematic without needing the context from the input in most cases. So, we can just have part of our monitoring ensemble ignore image inputs (for instance). Other image defenses for the monitor such as various compression strategies are plausible. (I think compression has historically been defeated given white box access to the compression algorithm, but we don’t necessarily need to assume that. I’m also separately pretty optimistic about various randomization strategies for compression like things which I don’t think have been tested in the literature.)
In general, there is a tradeoff related to removing information from the monitor: the more info you remove, the harder it is to do a good job, but removing info this makes adversarial attacks harder.
Aside: Removing information for the monitor in the collusion case
We discuss this sort of monitoring tradeoff in a different case in our recent paper on AI control in the context of trying to prevent collusion (which is different than adversarial attacks but related). I think the similarity is somewhat interesting here.
In section 3.4.2 we say:
The core tension that the blue team faces here is: if you let the monitor have too much information, then it will be able to collude with the policy. But if you hide too much information, it won’t be capable of correctly determining whether proposed code is backdoored or not. For example, the protocol could decide to not show U the problem statement at all, asking it to determine whether code was backdoored just based on reading the code. But this would cause U to have much worse performance even if it was honestly trying to spot backdoors, because it’s hard to tell whether code is wrong without knowing what problem it’s supposed to solve.
Also, I’d love to see research that simulates the position of a company trying to monitor misuse, and allows for the full range of defenses that you proposed. There could be a dataset of 1 trillion queries containing 100 malicious queries. Perhaps each query is accompanied by a KYC ID. Their first line of defense would be robust refusal to cause harm, and the second line would be automated detection of adversarial attacks. The company could also have a budget which can be spent on “human monitoring,” which would give them access to the ground truth label of whether a query is malicious for a fixed price. I’d have to think about which elements would be the most tractable for making research progress, but the fact that AI companies need to solve this problem suggests that perhaps external researchers should work on it too.
In this comment, I’ll discuss why I’m somewhat skeptical of the best altruistic option being to focus on misuse done via an API for many people[1]. (As mentioned in a footnote to the prior comment.) Part of my perspective is that the safeguards are relatively doable, as I discussed in the parent comment, but even if I were convinced this was false, I would still be somewhat skeptical of the best altruistic option being to focus on misuse.
(Edit: Note that I think that avoiding model theft seems very useful and I’m just argument about mitigations for misuse over an API.)
The main threat models for serious harm from misuse over an API which I see are: large-scale cybercrime, large-scale persuasion, bioterrorism, and large-scale unauthorized AI capabilities research.
Large-scale cyber/persuasion
As noted in the parent, cyber probably requires a huge number of queries to inflict huge harm.
I think the same is likely true for persuasion.
Thus, society is pretty likely to notice large-scale cyber/persuasion prior to extremely large damages (such as costs equivalent to >10 million people being killed) and once society notices, heavily restricting APIs for powerful models is pretty politically feasible[2].
Bioterrorism
As far as bioterrorism, I’m inside view not-that-sold that having competent AI assistance makes human-caused catastrophic (>100 million dead) bioterrorism[3] that much more likely in absolute terms.
However, note that many people with more knowledge than me think this is a large concern and also claim that there are important infohazards which are part of why they think the risk is high (see e.g. here for more discussion).
As far as why I am inside view not-that-sold, it’s basically because I still think that there might be substantial other blockers like physical iteration and the existence of reasonably competent people with a motive for scope-sensitive killing.
(AI will have basic safeguards and won’t be able to do literally everything until after obsolescence and thus the competence bar won’t be made that low.)
These sorts of counterarguments are discussed in more detail here (though note that I don’t overall agree with the linked comment!).
The viability of safeguards also feels very salient for bio in particular, as just removing bio from the training data seems pretty doable in practice.
Large-scale unauthorized AI capabilities research
In the future, it might be necessary to institute some sort of pause on AI development, including development of better algorithms.
This could be due to the potential for an uncontrolled intelligence explosion which can’t be restricted with hardware controls alone.
Thus, AI labs might want to prevent people from using their models a huge amount for unauthorized capabilities research.
It’s less clear that society would notice this going on prior to it being too late, but I think it should be quite easy for at least AI labs to notice if there is a lot of unauthorized capabilities research via doing randomized monitoring.
Thus, the same arguments as in the large-scale cyber/persuasion section might apply (though the political saliency is way less).
Substantial harm due to misuse seems like a huge threat to the commercial success of AI labs. As such, insofar as you thought that increasing the power of a given AI lab was useful, then working on addressing misuse at that AI lab seems worthwhile. And more generally, AI labs interested in retaining and acquiring power would probably benefit from heavily investing in preventing misuse of varying degrees of badness.
I think the direct altruistic cost of restricting APIs like this is reasonably small due to these restrictions only occurring for a short window of time (<10 years) prior to human obsolescence. Heavily restricting APIs might result in a pretty big competitiveness hit, but we could hope for at least US national action (intentional coordination would be preferred).
I specifically avoided claiming that adversarial robustness is the best altruistic option for a particular person. Instead, I’d like to establish that progress on adversarial robustness would have significant benefits, and therefore should be included in the set of research directions that “count” as useful AI safety research.
Over the next few years, I expect AI safety funding and research will (and should) dramatically expand. Research directions that would not make the cut at a small organization with a dozen researchers should still be part of the field of 10,000 people working on AI safety later this decade. Currently I’m concerned that the field focuses on a small handful of research directions (mainly mechinterp and scalable oversight) which will not be able to absorb such a large influx of interest. If we can lay the groundwork for many valuable research directions, we can multiply the impact of this large population of future researchers.
I don’t think adversarial robustness should be more than 5% or 10% of the research produced by AI safety-focused researchers today. But some research (e.g. 1, 2) from safety-minded folks seems very valuable for raising the number of people working on this problem and refocusing them on more useful subproblems. I think robustness should also be included in curriculums that educate people about safety, and research agendas for the field.
I agree with basically all of this and apologies for writing a comment which doesn’t directly respond to your post (though it is a relevant part of my views on the topic).
It sounds like you’re excluding cases where weights are stolen—makes sense in the context of adversarial robustness, but seems like you need to address those cases to make a general argument about misuse threat models
Agreed, I should have been more clear. Here I’m trying to argue about the question of whether people should work on research to mitigate misuse from the perspective of avoiding misuse through an API.
There is a separate question of reducing misuse concerns either via:
Trying to avoid weights being stolen/leaking
Trying to mitigate misuse in worlds where weights are very broadly proliferated (e.g. open source)
I do think these arguments contain threads of a general argument that causing catastrophes is difficult under any threat model. Let me make just a few non-comprehensive points here:
On cybersecurity, I’m not convinced that AI changes the offense defense balance. Attackers can use AI to find and exploit security vulnerabilities, but defenders can use it to fix them.
On persuasion, first, rational agents can simply ignore cheap talk if they expect it not to help them. Humans are not always rational, but if you’ve ever tried to convince a dog or a baby to drop something that they want, you’ll know cheap talk is ineffective and only coercion will suffice.
Second, AI is far from the first dramatic change in communications technology in human history. Spoken language, written language, the printing press, telephones, radio, TV, and social media all might be bigger changes than e.g. changed how people can be persuaded. These technologies often contributed to political and social upheaval, including catastrophes for particular ways of life, and AI might do the same. But overall I’m glad these changes occurred, and I wouldn’t expect the foreseeable versions of AI persuasion (i.e. personalized chatbots) to be much more impactful than these historical changes. See this comment and thread for more discussion.
Bioterrorism seems like the biggest threat. The obstacles there have been thoroughly discussed.
If causing catastrophes is difficult, this should reduce our concern with both misuse and rogue AIs causing sudden extinction. Other concerns like military arms races, lock-in of authoritarian regimes, or Malthusian outcomes in competitive environments would become relatively more important.
If causing catastrophes is difficult, this should reduce our concern with both misuse and rogue AIs causing sudden extinction. Other concerns like military arms races, lock-in of authoritarian regimes, or Malthusian outcomes in competitive environments would become relatively more important.
I agree that “causing catastrophes is difficult” should reduce concerns with “rogue AIs causing sudden extinction (or merely killing very large numbers of people like >1 billion)”.
However, I think these sorts of considerations don’t reduce AI takeover or other catastrophe due to rogue AI as much as you might think for a few reasons:
Escaped rogue AIs might be able to do many obviously bad actions over a long period autonomously. E.g., acquire money, create a cult, use this cult to build a bioweapons lab, and then actually develop bioweapons over long-ish period (e.g., 6 months) using >tens of thousands of queries to the AI. This looks quite different from the misuse threat model which required that omnicidal (or otherwise bad) humans possess the agency to make the right queries to the AI and solve the problems that the AI can’t solve. For instance, humans have to ensure that queries were sufficiently subtle/jailbreaking to avoid detection via various other mechanisms. The rogue AI can train humans over a long period and all the agency/competence can come from the rogue AI. So, even if misuse is unlikely by humans, autonomous rogue AIs making weapons of mass destruction is perhaps more likely.
Escaped rogue AIs are unlike misuse in that even if we notice a clear and serious problem, we might have less we can do. E.g., the AIs might have already built hidden datacenters we can’t find. Even if they don’t and are just autonomously replicating on the internet, shutting down the internet is extremely costly and only postpones the problem.
AI takeover can route through mechanisms other than sudden catastrophe/extinction. E.g., allying with rogue states, creating a rogue AI run AI lab which builds even more powerful AI as fast as possible. (I’m generally somewhat skeptical of AIs trying to cause extinction for reasons discussed here, here, and here. Though causing huge amounts of damage (e.g. >1 billion) dead seems somewhat more plausible as a thing rogue AIs would try to do.)
You didn’t mention the policy implications, which I think are one of if not the most impactful reason to care about misuse. Government regulation seems super important long-term to prevent people from deploying dangerous models publicly, and the only way to get that is by demonstrating that models are actually scary.
You didn’t mention the policy implications, which I think are one of if not the most impactful reason to care about misuse. Government regulation seems super important long-term to prevent people from deploying dangerous models publicly, and the only way to get that is by demonstrating that models are actually scary.
Agreed. However, in this case, building countermeasures to prevent misuse doesn’t particularly help. The evaluations for potentially dangerous capabilities are highly relevant.
I like this post and I agree with almost all the claims it makes. But I think it underestimates the potential of approaches other than adversarial robustness for effectively mitigating catastrophic misuse done via an API[1].
[Thanks to Buck Shlegeris, Fabien Roger, and GPT-4 for help editing this comment. I’ll probably turn this into a top-level post at some point. Edit: the top level post is now up (linked).]
I’ll go through bio and cyber as examples of how I imagine mitigating catastrophic misuse via an API without requiring better adversarial robustness than exhibited by current LLM APIs like Claude 2 and GPT-4. I think that the easiest mitigations for bioterrorism and cybercrime are fairly different, because of the different roles that LLMs play in these two threat models.
The mitigations I’m describing are non-trivial, and it’s unclear if they will happen by default. But regardless, this type of approach seems considerably easier to me than trying to achieve very high levels of adversarial robustness.
Mitigations for bioterrorism
In this section, I’ll describe how I imagine handling bioterrorism risk for an API deploying powerful models (e.g., ASL-3/ASL-4).
As I understand it, the main scenario by which LLMs cause bioterrorism risk is something like the following: there’s a team of relatively few people, who are not top experts in the relevant fields but who want to do bioterrorism for whatever reason. Without LLMs, these people would struggle to build bioweapons—they wouldn’t be able to figure out various good ideas, and they’d get stuck while trying to manufacture their bioweapons (perhaps like Aum Shinrikyo). But with LLMs, they can get past those obstacles.
(I’m making the assumption here that the threat model is more like “the LLM gives the equivalent of many hours of advice” rather than “the LLM gives the equivalent of five minutes of advice”. I’m not a biosecurity expert and so don’t know whether that’s an appropriate assumption to make; it probably comes down to questions about what the hard steps in building catastrophic bioweapons are. I don’t see why LLMs would zero-shot be able to effectively answer extremely broad and difficult questions like “how to build a really bad bioweapon”, so it seems to me like the LLMs probably have to do at least a moderate amount of back-and-forth in order to be wildly helpful.)
And so in this case, the LLMs caused problems not by being superhuman, but by allowing aspiring bioterrorists to get a fairly large amount of expert advice that human experts would refuse to give them.
Thus, our countermeasures will heavily utilize the assumption that a large number of queries to the AI are likely required to substantially enable catastrophic bioterrorism. We’ll assume the number of required queries is likely at least hundreds, but possibly thousands[2].
First, we’ll remove bio data (particularly synth bio and virology) from the training corpus and verify that this will result in a model with weak biology knowledge. This is sufficient if it’s acceptable to not have any model deployed which can handle biology questions.
However, it might be important for some users to be able to ask bio questions. If so, we’ll create an additional API endpoint with a model created by fine-tuning our normal base model (without bio) on the excluded biology data. Now, we’ll need countermeasures on this additional bio endpoint to prevent misuse.
For this additional endpoint, we’ll use the following countermeasures:
We require KYC (Know Your Customer) checks to access the powerful bio model API endpoint (or chat service or whatever) such as requiring submitting a passport. This is to make it hard to have more than one account and also to ensure that potential bioterrorists can be criminally prosecuted.
We make this API endpoint have substantially higher costs (e.g., 5x) to ensure that users avoid using it when possible and to cover the costs of some of the below interventions. We might also include a fixed monthly cost for each user.
The actual model which the endpoint queries is trained to not answer questions about bioterror, and we attempt to make this reasonably robust. (It should be fine to achieve similar levels of robustness to the current GPT-4 API or Claude 2 API.) This countermeasure isn’t very key for the overall proposal, but it seems somewhat useful.
We’ll build an automated bioterror monitoring scheme which involves an ensemble of different AI checks. The ensemble could involve different base models (perhaps one base model which is dumber than the actual AI used and a base model which is as smart as the actual AI used), variation on the classifier training data, only looking at the input or only looking at the output, or randomly omitting various parts of the input/output.
For cost reasons, we might randomly select a subset of the ensemble of approaches to apply to each input.
This monitoring ensemble might also see information about some of the user’s prior queries or even the user’s background and their stated reason for using the API (submitted as part of KYC).
This monitoring also looks for people attempting to jailbreak the model in addition to people directly trying bioterror.
Training the monitoring might involve the use of secret information about potential bio-risk.
ETA: We of course need to be worried about people jailbreaking our monitoring models. The key hope is that it will be very hard to reliably jailbreak all models in the ensemble given that users won’t be able to iterate against the process. I discuss this more in this comment below.
Domain expert humans will manually audit the queries (and responses) which the process described above rates as most suspicious. Perhaps we have humans carefully audit the most suspicious-seeming 1% of queries. We might also sample some queries to audit uniformly at random or ensure that we always sample at least a few queries from each user.
If our human auditing finds evidence of bioterror, the individual is permanently banned from the API and referred to the FBI (or other bodies) for investigation. We also consider suing the individual in civil court for contract violation. We also consider banning anyone closely associated with the banned individual at least temporarily and take into account potential connections with banned individuals in future KYC checks.
If our human auditing process finds evidence for (attempted) jailbreaking, the individual is permanently banned from the API, and we consider suing them for contract violation.
A key aspect of this approach is we prevent users from iterating against our most important countermeasures. They shouldn’t notice that they failed to evade our countermeasures prior to being banned (in addition to potential criminal investigation and suing them). ETA: this makes it hard to reliably bypass all of the monitoring models.
We verify this approach worked as expected by ensuring that a red team which contains both bio experts and LLM experts can’t execute a moderate number (e.g., 50) of diverse bioterror queries without having a very high probability of being caught. The red team is familiar with the high-level countermeasures but isn’t necessarily informed of non-public implementation details. We’d want to try this with several different red teams.
Here are some possible additional countermeasures:
We could reject queries to the bio endpoint which don’t seem like questions about biology according to our monitoring models. (Or queries which contain fully separable subcomponents which don’t seem to be about biology.) This helps avoid the failure mode in which biology information is elicited from the model without monitoring models even understanding how the query relates to biology.
If monitoring scores indicate moderately high suspicion (but not high enough suspicion for the human auditing process), we could reject the query or route the query to a dumber model which is incapable of substantially helping with bioterrorism. It might be acceptable for this rejection/routing to use a much more aggressive threshold for suspicion than used for auditing (e.g., perhaps we can only afford to have a human audit the 0.2% most suspicious queries, but we can defer/reject the 2% most suspicious). The downside of this is that it allows for some iteration against the API (though we can try to detect iteration and then ban these users).
To avoid continued harm during delays in actually conducting the human audit process, we could “pause” users which have many queries referred to auditing.
Mitigations for large-scale cybercrime
The threat model for LLM cybercrime has pretty different properties than the bioterrorism threat model. A bioterrorism attack might only require hundreds of queries to the LLM in order to cause a catastrophic outcome. In contrast, the risk associated with cybercrime is that we massively reduce the cost of labor involved in hacking. This distinction is important because there are already thousands or tens of thousands of people who are employed as hackers for pay, and so if LLMs aren’t massively superhuman, they have to contribute via lowering costs. As such, if our goal is to ensure less than $100 million in cybercrime damages, it might suffice to avoid more than 100,000 cybercrime queries (unlike the bio case where merely hundreds of queries could be unacceptable). Thus, we can prevent our model from doing unacceptable amounts of cybercrime without needing to have as high of reliability as in the bioterrorism case.
It likely won’t be possible to avoid having our main model (as deployed on an API) be capable of cybercrime. So, we don’t have the option of creating a separate endpoint as in the bio case. Otherwise, we’ll use a somewhat similar set of countermeasures as with the bio approach discussed above. For the following list, I’m assuming that you’ve read the bio section above.
Ideally, we’d require some KYC. This might not be possible for the main API, and we’ll discuss alternatives below.
We’d have a variety of monitoring approaches as discussed in the bio case but applied to cyber. We need to keep cost pretty low, but we can still apply monitoring by weaker (and thus cheaper) models on all inputs. Then, we’d randomly sample inputs to apply more expensive parts of the monitoring ensemble as well as applying these more expensive components on inputs found to be somewhat suspicious by weaker monitoring.
We again apply human auditing on inputs selected by monitoring. Perhaps humans audit the 0.01% most suspicious-seeming inputs.
We again ban users based on this human auditing.
Due to the potential lack of KYC, we might not be able to ban users effectively (as well as missing potential criminal investigation and suing them). We might not be able to apply KYC, but we can at least estimate the total amount of cybercrime queries and guess at total damages. If likely damages reach some threshold, we can institute additional mitigations like actually requiring KYC.
Given that these mitigations have to be on our actual main API, it’s plausible that privacy concerns are a blocker. We can probably do some amount of user info redaction for auditing, but I broadly think we should just aim to set the expectation that queries made to powerful models aren’t private. This is similar to (e.g.) emails sent to contractors.
Fine-tuning?
I think that allowing end-users to fine-tune these models makes the picture notably more complicated, as it might be possible to train knowledge into the model which isn’t present in the monitoring models. There are still possible mitigations. We could also try to resolve this by only allowing large corporate contracts when fine-tuning powerful models.
In addition to the disagreement I express in the rest of this comment, I’m somewhat skeptical of the best altruistic option being to focus on misuse for many people. My view here is somewhat complex, but I’ll try to quickly articulate this in a child comment.
There is also a threat model where AIs are very helpful for bioterror via the mechanism of occasionally giving really good answers and thus only a small number of bioterror queries (1-30) would be dangerous. This threat model could look like having really good high-level ideas for bioterror or acting like a considerably better version of Google which points you toward the ideal online resource. I think this sort of usage probably doesn’t help that much with bioterror, but this might be a crux.
Thanks for the detailed thoughts! I like the overall plan, especially using KYC, human monitoring, and a separate model for bio. I’d like to point out that this plan still uses automated monitoring systems, and to the extent these systems aren’t adversarially robust, the plan will be more expensive and/or less effective.
The only plans that wouldn’t benefit from adversarial robustness are those without automated monitoring. For example, humans could be hired to manually monitor each and every query for attempted misuse. Let’s consider the viability of fully manual monitoring from a financial perspective, and then we can think about how much additional benefit would be provided by automated robust monitoring systems.
First, let’s price out manual monitoring for ChatGPT Plus. The subscription costs $20/month. Suppose the average person makes 10 queries per day, or 300 queries per month, and that it takes 15 seconds for a human monitor to read a query and flag it as misuse. Wages of $5/hour would mean comprehensive human monitoring costs $6.25 per user per month, and wages of $10/hour and $15/hour would translate to monthly per user costs of $12.50 and $18.75 respectively. The cost of full manual monitoring on ChatGPT Plus would therefore amount to much or most of its overall revenue.
Second, we can anchor on Google Search. This assumes that Google is a reasonable reference point for the eventual volume, expenses, and revenues of an AI provider, which might not be a safe assumption in several ways. Nevertheless, Google conducts a ~3 trillion searches per year (varies by data source). If monitoring a single search costs $0.01 (e.g. someone earning $5/hour who monitors 500 searches per hour), then it would cost ~$30B to monitor every search. Would $30B in monitoring costs be financially acceptable? Google Search had revenues of $162B last year, representing 57% of Google’s total revenue. They don’t report expenses for Search specifically, but their overall expenses were $207B. If we assume Search comprises 57% of expenses, then Search would have $118B in annual expenses, against $162B in revenue. Manual monitoring would cost $30B, and would therefore eliminate two-thirds of Google’s search profits.
So these costs would not be prohibitive, but they’d be a large share of overall revenues and profits. A safety-minded company like Anthropic might pay for manual monitoring, but other companies and their investors might be strongly opposed to paying such a high price. They could argue that, just as gun manufacturers are not held liable for murders, AI providers should not have to spend billions to prevent deliberate misuse.
Fortunately, we can reduce the cost of monitoring in many ways. Randomly sampling a small fraction of queries would reduce costs, but also reduce the likelihood of identifying misuse. Flagging keywords like “virus” would catch unsophisticated misuse, but could be evaded (e.g. discussions in a variant of pig latin).
Ideally, you’d be able to use AI systems to identify suspicious queries for human monitoring, but those systems would only be effective to the extent that they’re adversarially robust. If 99% of queries can be reliably discarded as safe, then manual monitoring costs would fall by 99%. But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems. Deploying these unreliable systems would not decrease the costs of manual monitoring much without a corresponding drop in performance.
Overall, I appreciate your point that there are many layers of defense we can use to detect and prevent misuse Fully manual monitoring might be possible, but it would have a huge financial cost. Many companies would be reluctant or unable to pay that price. Robust automated monitoring systems could reduce the cost of monitoring by 90% or 99%, but this would likely require improvements upon today’s state of the art.
I agree that additional adversarial robustness (for both the policy and the monitor) helps considerably, but I think that it likely won’t be necessarily to have better adversarial robustness than the current state of the art (e.g. the robustness of Claude 2). This is due to the use of monitoring ensembles and preventing iteration against the API. My belief that this is likely to work is totally possible to falsify by demonstrating cases where we can’t figure out sufficient countermeasures against a red team with realistic affordances.
I made some “Edit Time Additions” (ETA) to my original comment above for clarity.
As discussed in my comment, I’m imagining being able to have a human review perhaps 0.01% to 1% of queries depending on the use case and other other factors. So, we’d review the most suspicious seeming (according to our overall combined monitoring score) 0.01% to 1% of queries. Equivalently, (assuming actual misuse is rare) we’d be targeting a False Positive Rate (FPR) of 0.01% or 1% (on the ROC curve).
I think the claim:
seems wrong or at least seems to be answering the wrong question.
Let’s suppose that we can afford to review the most suspicious seeming 0.1% of queries for a bio specific API endpoint. Equivalently, we can afford 0.1% FPR for our automated monitoring ensemble. Now, suppose (for simplicity) that catastrophic bio misuse requries 100 queries to the model and we’re happy as long as we catch the user and report them to the FBI prior to them finishing all 100 queries. Then, if our monitoring ensemble at 0.1% FPR catches a given bio misuse query with just 5% probability (for a competent red team), bio misuse is 99.4% likely to be caught prior to succeeding. So, we don’t actually need very high reliability on the ensemble. (Though we might target more conservative bars than 99.4% in practice.)
A reasonable objection here is that maybe we’ll have a large number of people (e.g., >200) trying to create bioweapons using the API and thus one will get lucky. We can avoid this failure mode by having a policy like “if there are more than 3 open FBI investigations into attempted bioweapon construction based on our models, we need to either shut down the API endpoint or spend additional money on human review to increase reliability to a point where the probability of failure is 5x lower”. Then, if we do actually see considerable serious misuse, we can respond dynamically to handle this. At this point, we should also be able to look at case studies to see how effective our safeguards were in practice.
Separately for:
I expect that these numbers weren’t against monitoring ensembles in the sense I described earlier and the red team had additional affordances beyond just understanding the high level description of the monitoring setup? E.g., the red team was able to iterate?
This is correct about the paper I cited, but others have achieved similar attack success rates against models like Claude which use an ensemble of defenses. AFAIK Claude does not ban users who attempt misuse, so that element of your plan has never been tested and would likely help a lot.
To summarize this comment, you’ve proposed that baseline monitoring systems could reduce risk to an acceptable level. Specifically, the monitoring system would need to correctly identify at least 5% of dangerous queries as dangerous (“5% precision”) and avoid incorrectly flagging more than 1 in 1000 safe queries as dangerous (“0.1% FPR”).
I think this level of reliability is possible today (e.g. Claude 2 would likely meet it), but it’s possible that future developments would make defense more difficult. For example, new attack methods have shown LLMs to be less robust to misuse than previously understood. (This is one downside of adversarial robustness research that will become more important as the stakes of adversarial attacks rise.) Perhaps a bigger challenge is the growth of multimodal systems. Defending vision language models is much more difficult than defending pure LLMs. As multimodality becomes standard, we might see adversarial attacks that regularly achieve >95% success rates in bypassing monitoring systems. I’m not particularly confident about how difficult monitoring will be, but it would be beneficial to have monitoring systems which would work even if defense gets much harder in the future.
Overall, these hypotheticals only offer so much information when none of these defenses has ever been publicly built or tested. I think we agree that simple monitoring strategies might be fairly effective and cheap in identifying misuse, and that progress on adversarial robustness would significantly reduce costs by improving the effectiveness of automated monitoring systems.
[Mostly unimportant/nitpicking specific claim]
For cases like bio, I don’t think multimodal is that much of a threat because it should be possible to identify text outputs from the model as problematic without needing the context from the input in most cases. So, we can just have part of our monitoring ensemble ignore image inputs (for instance). Other image defenses for the monitor such as various compression strategies are plausible. (I think compression has historically been defeated given white box access to the compression algorithm, but we don’t necessarily need to assume that. I’m also separately pretty optimistic about various randomization strategies for compression like things which I don’t think have been tested in the literature.)
In general, there is a tradeoff related to removing information from the monitor: the more info you remove, the harder it is to do a good job, but removing info this makes adversarial attacks harder.
Aside: Removing information for the monitor in the collusion case
We discuss this sort of monitoring tradeoff in a different case in our recent paper on AI control in the context of trying to prevent collusion (which is different than adversarial attacks but related). I think the similarity is somewhat interesting here.
In section 3.4.2 we say:
Also, I’d love to see research that simulates the position of a company trying to monitor misuse, and allows for the full range of defenses that you proposed. There could be a dataset of 1 trillion queries containing 100 malicious queries. Perhaps each query is accompanied by a KYC ID. Their first line of defense would be robust refusal to cause harm, and the second line would be automated detection of adversarial attacks. The company could also have a budget which can be spent on “human monitoring,” which would give them access to the ground truth label of whether a query is malicious for a fixed price. I’d have to think about which elements would be the most tractable for making research progress, but the fact that AI companies need to solve this problem suggests that perhaps external researchers should work on it too.
In this comment, I’ll discuss why I’m somewhat skeptical of the best altruistic option being to focus on misuse done via an API for many people[1]. (As mentioned in a footnote to the prior comment.) Part of my perspective is that the safeguards are relatively doable, as I discussed in the parent comment, but even if I were convinced this was false, I would still be somewhat skeptical of the best altruistic option being to focus on misuse.
(Edit: Note that I think that avoiding model theft seems very useful and I’m just argument about mitigations for misuse over an API.)
The main threat models for serious harm from misuse over an API which I see are: large-scale cybercrime, large-scale persuasion, bioterrorism, and large-scale unauthorized AI capabilities research.
Large-scale cyber/persuasion
As noted in the parent, cyber probably requires a huge number of queries to inflict huge harm. I think the same is likely true for persuasion. Thus, society is pretty likely to notice large-scale cyber/persuasion prior to extremely large damages (such as costs equivalent to >10 million people being killed) and once society notices, heavily restricting APIs for powerful models is pretty politically feasible[2].
Bioterrorism
As far as bioterrorism, I’m inside view not-that-sold that having competent AI assistance makes human-caused catastrophic (>100 million dead) bioterrorism[3] that much more likely in absolute terms. However, note that many people with more knowledge than me think this is a large concern and also claim that there are important infohazards which are part of why they think the risk is high (see e.g. here for more discussion).
As far as why I am inside view not-that-sold, it’s basically because I still think that there might be substantial other blockers like physical iteration and the existence of reasonably competent people with a motive for scope-sensitive killing. (AI will have basic safeguards and won’t be able to do literally everything until after obsolescence and thus the competence bar won’t be made that low.) These sorts of counterarguments are discussed in more detail here (though note that I don’t overall agree with the linked comment!).
The viability of safeguards also feels very salient for bio in particular, as just removing bio from the training data seems pretty doable in practice.
Large-scale unauthorized AI capabilities research
In the future, it might be necessary to institute some sort of pause on AI development, including development of better algorithms. This could be due to the potential for an uncontrolled intelligence explosion which can’t be restricted with hardware controls alone. Thus, AI labs might want to prevent people from using their models a huge amount for unauthorized capabilities research.
It’s less clear that society would notice this going on prior to it being too late, but I think it should be quite easy for at least AI labs to notice if there is a lot of unauthorized capabilities research via doing randomized monitoring. Thus, the same arguments as in the large-scale cyber/persuasion section might apply (though the political saliency is way less).
Substantial harm due to misuse seems like a huge threat to the commercial success of AI labs. As such, insofar as you thought that increasing the power of a given AI lab was useful, then working on addressing misuse at that AI lab seems worthwhile. And more generally, AI labs interested in retaining and acquiring power would probably benefit from heavily investing in preventing misuse of varying degrees of badness.
I think the direct altruistic cost of restricting APIs like this is reasonably small due to these restrictions only occurring for a short window of time (<10 years) prior to human obsolescence. Heavily restricting APIs might result in a pretty big competitiveness hit, but we could hope for at least US national action (intentional coordination would be preferred).
Except via the non-misuse perspective of advancing bio research and then causing accidental release of powerful gain-of-function pathogens.
I specifically avoided claiming that adversarial robustness is the best altruistic option for a particular person. Instead, I’d like to establish that progress on adversarial robustness would have significant benefits, and therefore should be included in the set of research directions that “count” as useful AI safety research.
Over the next few years, I expect AI safety funding and research will (and should) dramatically expand. Research directions that would not make the cut at a small organization with a dozen researchers should still be part of the field of 10,000 people working on AI safety later this decade. Currently I’m concerned that the field focuses on a small handful of research directions (mainly mechinterp and scalable oversight) which will not be able to absorb such a large influx of interest. If we can lay the groundwork for many valuable research directions, we can multiply the impact of this large population of future researchers.
I don’t think adversarial robustness should be more than 5% or 10% of the research produced by AI safety-focused researchers today. But some research (e.g. 1, 2) from safety-minded folks seems very valuable for raising the number of people working on this problem and refocusing them on more useful subproblems. I think robustness should also be included in curriculums that educate people about safety, and research agendas for the field.
I agree with basically all of this and apologies for writing a comment which doesn’t directly respond to your post (though it is a relevant part of my views on the topic).
That’s cool, appreciate the prompt to discuss what is a relevant question.
It sounds like you’re excluding cases where weights are stolen—makes sense in the context of adversarial robustness, but seems like you need to address those cases to make a general argument about misuse threat models
Agreed, I should have been more clear. Here I’m trying to argue about the question of whether people should work on research to mitigate misuse from the perspective of avoiding misuse through an API.
There is a separate question of reducing misuse concerns either via:
Trying to avoid weights being stolen/leaking
Trying to mitigate misuse in worlds where weights are very broadly proliferated (e.g. open source)
I do think these arguments contain threads of a general argument that causing catastrophes is difficult under any threat model. Let me make just a few non-comprehensive points here:
On cybersecurity, I’m not convinced that AI changes the offense defense balance. Attackers can use AI to find and exploit security vulnerabilities, but defenders can use it to fix them.
On persuasion, first, rational agents can simply ignore cheap talk if they expect it not to help them. Humans are not always rational, but if you’ve ever tried to convince a dog or a baby to drop something that they want, you’ll know cheap talk is ineffective and only coercion will suffice.
Second, AI is far from the first dramatic change in communications technology in human history. Spoken language, written language, the printing press, telephones, radio, TV, and social media all might be bigger changes than e.g. changed how people can be persuaded. These technologies often contributed to political and social upheaval, including catastrophes for particular ways of life, and AI might do the same. But overall I’m glad these changes occurred, and I wouldn’t expect the foreseeable versions of AI persuasion (i.e. personalized chatbots) to be much more impactful than these historical changes. See this comment and thread for more discussion.
Bioterrorism seems like the biggest threat. The obstacles there have been thoroughly discussed.
If causing catastrophes is difficult, this should reduce our concern with both misuse and rogue AIs causing sudden extinction. Other concerns like military arms races, lock-in of authoritarian regimes, or Malthusian outcomes in competitive environments would become relatively more important.
I agree that “causing catastrophes is difficult” should reduce concerns with “rogue AIs causing sudden extinction (or merely killing very large numbers of people like >1 billion)”.
However, I think these sorts of considerations don’t reduce AI takeover or other catastrophe due to rogue AI as much as you might think for a few reasons:
Escaped rogue AIs might be able to do many obviously bad actions over a long period autonomously. E.g., acquire money, create a cult, use this cult to build a bioweapons lab, and then actually develop bioweapons over long-ish period (e.g., 6 months) using >tens of thousands of queries to the AI. This looks quite different from the misuse threat model which required that omnicidal (or otherwise bad) humans possess the agency to make the right queries to the AI and solve the problems that the AI can’t solve. For instance, humans have to ensure that queries were sufficiently subtle/jailbreaking to avoid detection via various other mechanisms. The rogue AI can train humans over a long period and all the agency/competence can come from the rogue AI. So, even if misuse is unlikely by humans, autonomous rogue AIs making weapons of mass destruction is perhaps more likely.
Escaped rogue AIs are unlike misuse in that even if we notice a clear and serious problem, we might have less we can do. E.g., the AIs might have already built hidden datacenters we can’t find. Even if they don’t and are just autonomously replicating on the internet, shutting down the internet is extremely costly and only postpones the problem.
AI takeover can route through mechanisms other than sudden catastrophe/extinction. E.g., allying with rogue states, creating a rogue AI run AI lab which builds even more powerful AI as fast as possible. (I’m generally somewhat skeptical of AIs trying to cause extinction for reasons discussed here, here, and here. Though causing huge amounts of damage (e.g. >1 billion) dead seems somewhat more plausible as a thing rogue AIs would try to do.)
Yep, agreed on the individual points, not trying to offer a comprehensive assessment of the risks here.
You didn’t mention the policy implications, which I think are one of if not the most impactful reason to care about misuse. Government regulation seems super important long-term to prevent people from deploying dangerous models publicly, and the only way to get that is by demonstrating that models are actually scary.
Agreed. However, in this case, building countermeasures to prevent misuse doesn’t particularly help. The evaluations for potentially dangerous capabilities are highly relevant.