ryan_greenblatt comments on Adversarial Robustness Could Help Prevent Catastrophic Misuse

ryan_greenblatt 12 Dec 2023 4:41 UTC
LW: 5 AF: 3
2
AF
In this comment, I’ll discuss why I’m somewhat skeptical of the best altruistic option being to focus on misuse done via an API for many people^[1]. (As mentioned in a footnote to the prior comment.) Part of my perspective is that the safeguards are relatively doable, as I discussed in the parent comment, but even if I were convinced this was false, I would still be somewhat skeptical of the best altruistic option being to focus on misuse.

(Edit: Note that I think that avoiding model theft seems very useful and I’m just argument about mitigations for misuse over an API.)

The main threat models for serious harm from misuse over an API which I see are: large-scale cybercrime, large-scale persuasion, bioterrorism, and large-scale unauthorized AI capabilities research.

Large-scale cyber/persuasion

As noted in the parent, cyber probably requires a huge number of queries to inflict huge harm. I think the same is likely true for persuasion. Thus, society is pretty likely to notice large-scale cyber/persuasion prior to extremely large damages (such as costs equivalent to >10 million people being killed) and once society notices, heavily restricting APIs for powerful models is pretty politically feasible^[2].

Bioterrorism

As far as bioterrorism, I’m inside view not-that-sold that having competent AI assistance makes human-caused catastrophic (>100 million dead) bioterrorism^[3] that much more likely in absolute terms. However, note that many people with more knowledge than me think this is a large concern and also claim that there are important infohazards which are part of why they think the risk is high (see e.g. here for more discussion).

As far as why I am inside view not-that-sold, it’s basically because I still think that there might be substantial other blockers like physical iteration and the existence of reasonably competent people with a motive for scope-sensitive killing. (AI will have basic safeguards and won’t be able to do literally everything until after obsolescence and thus the competence bar won’t be made that low.) These sorts of counterarguments are discussed in more detail here (though note that I don’t overall agree with the linked comment!).

The viability of safeguards also feels very salient for bio in particular, as just removing bio from the training data seems pretty doable in practice.

Large-scale unauthorized AI capabilities research

In the future, it might be necessary to institute some sort of pause on AI development, including development of better algorithms. This could be due to the potential for an uncontrolled intelligence explosion which can’t be restricted with hardware controls alone. Thus, AI labs might want to prevent people from using their models a huge amount for unauthorized capabilities research.

It’s less clear that society would notice this going on prior to it being too late, but I think it should be quite easy for at least AI labs to notice if there is a lot of unauthorized capabilities research via doing randomized monitoring. Thus, the same arguments as in the large-scale cyber/persuasion section might apply (though the political saliency is way less).
1. ↩︎
  Substantial harm due to misuse seems like a huge threat to the commercial success of AI labs. As such, insofar as you thought that increasing the power of a given AI lab was useful, then working on addressing misuse at that AI lab seems worthwhile. And more generally, AI labs interested in retaining and acquiring power would probably benefit from heavily investing in preventing misuse of varying degrees of badness.
2. ↩︎
  I think the direct altruistic cost of restricting APIs like this is reasonably small due to these restrictions only occurring for a short window of time (<10 years) prior to human obsolescence. Heavily restricting APIs might result in a pretty big competitiveness hit, but we could hope for at least US national action (intentional coordination would be preferred).
3. ↩︎
  Except via the non-misuse perspective of advancing bio research and then causing accidental release of powerful gain-of-function pathogens.
What links here?
- Managing catastrophic misuse without robust AIs by ryan_greenblatt (16 Jan 2024 17:27 UTC; 62 points)
- aogara 18 Dec 2023 19:42 UTC
  LW: 4 AF: 2
  2
  AF Parent
  I specifically avoided claiming that adversarial robustness is the best altruistic option for a particular person. Instead, I’d like to establish that progress on adversarial robustness would have significant benefits, and therefore should be included in the set of research directions that “count” as useful AI safety research.
  Over the next few years, I expect AI safety funding and research will (and should) dramatically expand. Research directions that would not make the cut at a small organization with a dozen researchers should still be part of the field of 10,000 people working on AI safety later this decade. Currently I’m concerned that the field focuses on a small handful of research directions (mainly mechinterp and scalable oversight) which will not be able to absorb such a large influx of interest. If we can lay the groundwork for many valuable research directions, we can multiply the impact of this large population of future researchers.
  I don’t think adversarial robustness should be more than 5% or 10% of the research produced by AI safety-focused researchers today. But some research (e.g. 1, 2) from safety-minded folks seems very valuable for raising the number of people working on this problem and refocusing them on more useful subproblems. I think robustness should also be included in curriculums that educate people about safety, and research agendas for the field.
  - ryan_greenblatt 18 Dec 2023 19:45 UTC
    LW: 2 AF: 1
    0
    AF Parent
    I agree with basically all of this and apologies for writing a comment which doesn’t directly respond to your post (though it is a relevant part of my views on the topic).
    - aogara 18 Dec 2023 19:49 UTC
      2 points
      0
      Parent
      That’s cool, appreciate the prompt to discuss what is a relevant question.
- Beth Barnes 14 Dec 2023 3:10 UTC
  LW: 4 AF: 2
  2
  AF Parent
  It sounds like you’re excluding cases where weights are stolen—makes sense in the context of adversarial robustness, but seems like you need to address those cases to make a general argument about misuse threat models
  - ryan_greenblatt 14 Dec 2023 5:37 UTC
    LW: 2 AF: 1
    0
    AF Parent
    Agreed, I should have been more clear. Here I’m trying to argue about the question of whether people should work on research to mitigate misuse from the perspective of avoiding misuse through an API.
    
    There is a separate question of reducing misuse concerns either via:
    
    Trying to avoid weights being stolen/leaking
    Trying to mitigate misuse in worlds where weights are very broadly proliferated (e.g. open source)
    - aogara 18 Dec 2023 19:11 UTC
      LW: 2 AF: 1
      0
      AF Parent
      I do think these arguments contain threads of a general argument that causing catastrophes is difficult under any threat model. Let me make just a few non-comprehensive points here:
      On cybersecurity, I’m not convinced that AI changes the offense defense balance. Attackers can use AI to find and exploit security vulnerabilities, but defenders can use it to fix them.
      On persuasion, first, rational agents can simply ignore cheap talk if they expect it not to help them. Humans are not always rational, but if you’ve ever tried to convince a dog or a baby to drop something that they want, you’ll know cheap talk is ineffective and only coercion will suffice.
      Second, AI is far from the first dramatic change in communications technology in human history. Spoken language, written language, the printing press, telephones, radio, TV, and social media all might be bigger changes than e.g. changed how people can be persuaded. These technologies often contributed to political and social upheaval, including catastrophes for particular ways of life, and AI might do the same. But overall I’m glad these changes occurred, and I wouldn’t expect the foreseeable versions of AI persuasion (i.e. personalized chatbots) to be much more impactful than these historical changes. See this comment and thread for more discussion.
      Bioterrorism seems like the biggest threat. The obstacles there have been thoroughly discussed.
      If causing catastrophes is difficult, this should reduce our concern with both misuse and rogue AIs causing sudden extinction. Other concerns like military arms races, lock-in of authoritarian regimes, or Malthusian outcomes in competitive environments would become relatively more important.
      - ryan_greenblatt 18 Dec 2023 19:35 UTC
        LW: 4 AF: 2
        2
        AF Parent
        
        If causing catastrophes is difficult, this should reduce our concern with both misuse and rogue AIs causing sudden extinction. Other concerns like military arms races, lock-in of authoritarian regimes, or Malthusian outcomes in competitive environments would become relatively more important.
        
        I agree that “causing catastrophes is difficult” should reduce concerns with “rogue AIs causing sudden extinction (or merely killing very large numbers of people like >1 billion)”.
        
        However, I think these sorts of considerations don’t reduce AI takeover or other catastrophe due to rogue AI as much as you might think for a few reasons:
        
        Escaped rogue AIs might be able to do many obviously bad actions over a long period autonomously. E.g., acquire money, create a cult, use this cult to build a bioweapons lab, and then actually develop bioweapons over long-ish period (e.g., 6 months) using >tens of thousands of queries to the AI. This looks quite different from the misuse threat model which required that omnicidal (or otherwise bad) humans possess the agency to make the right queries to the AI and solve the problems that the AI can’t solve. For instance, humans have to ensure that queries were sufficiently subtle/jailbreaking to avoid detection via various other mechanisms. The rogue AI can train humans over a long period and all the agency/competence can come from the rogue AI. So, even if misuse is unlikely by humans, autonomous rogue AIs making weapons of mass destruction is perhaps more likely.
        Escaped rogue AIs are unlike misuse in that even if we notice a clear and serious problem, we might have less we can do. E.g., the AIs might have already built hidden datacenters we can’t find. Even if they don’t and are just autonomously replicating on the internet, shutting down the internet is extremely costly and only postpones the problem.
        AI takeover can route through mechanisms other than sudden catastrophe/extinction. E.g., allying with rogue states, creating a rogue AI run AI lab which builds even more powerful AI as fast as possible. (I’m generally somewhat skeptical of AIs trying to cause extinction for reasons discussed here, here, and here. Though causing huge amounts of damage (e.g. >1 billion) dead seems somewhat more plausible as a thing rogue AIs would try to do.)
        aogara 18 Dec 2023 19:43 UTC
        4 points
        0
        Parent
        Yep, agreed on the individual points, not trying to offer a comprehensive assessment of the risks here.
- Kshitij Sachan 18 Dec 2023 21:24 UTC
  LW: 3 AF: 2
  2
  AF Parent
  You didn’t mention the policy implications, which I think are one of if not the most impactful reason to care about misuse. Government regulation seems super important long-term to prevent people from deploying dangerous models publicly, and the only way to get that is by demonstrating that models are actually scary.
  - ryan_greenblatt 18 Dec 2023 21:28 UTC
    LW: 4 AF: 2
    3
    AF Parent
    
    You didn’t mention the policy implications, which I think are one of if not the most impactful reason to care about misuse. Government regulation seems super important long-term to prevent people from deploying dangerous models publicly, and the only way to get that is by demonstrating that models are actually scary.
    
    Agreed. However, in this case, building countermeasures to prevent misuse doesn’t particularly help. The evaluations for potentially dangerous capabilities are highly relevant.

ryan_greenblatt comments on Adversarial Robustness Could Help Prevent Catastrophic Misuse

Large-scale cyber/​persuasion

Bioterrorism

Large-scale unauthorized AI capabilities research

Large-scale cyber/persuasion