Buck comments on Buck’s Shortform

Buck 24 Jun 2024 16:27 UTC
LW: 39 AF: 24
16
AF
AI safety people often emphasize making safety cases as the core organizational approach to ensuring safety. I think this might cause people to anchor on relatively bad analogies to other fields.
Safety cases are widely used in fields that do safety engineering, e.g. airplanes and nuclear reactors. See e.g. “Arguing Safety” for my favorite introduction to them. The core idea of a safety case is to have a structured argument that clearly and explicitly spells out how all of your empirical measurements allow you to make a sequence of conclusions that establish that the risk posed by your system is acceptably low. Safety cases are somewhat controversial among safety engineering pundits.
But the AI context has a very different structure from those fields, because all of the risks that companies are interested in mitigating with safety cases are fundamentally adversarial (with the adversary being AIs and/or humans). There’s some discussion of adapting the safety-case-like methodology to the adversarial case (e.g. Alexander et al, “Security assurance cases: motivation and the state of the art”), but this seems to be quite experimental and it is not generally recommended. So I think it’s very unclear whether a safety-case-like structure should actually be an inspiration for us.
More generally, I think we should avoid anchoring on safety engineering as the central field to draw inspiration from. Safety engineering mostly involves cases where the difficulty arises from the fact that you’ve built extremely complicated systems and need to manage the complexity; here our problems arise from adversarial dynamics on top of fairly simple systems built out of organic, hard-to-understand parts. We should expect these to be fairly dissimilar.
(I think information security is also a pretty bad analogy—it’s adversarial, but like safety engineering it’s mostly about managing complexity, which is not at all our problem.)
- Adam Scholl 25 Jun 2024 18:03 UTC
  LW: 12 AF: 6
  8
  AF Parent
  I agree we don’t currently know how to prevent AI systems from becoming adversarial, and that until we do it seems hard to make strong safety cases for them. But I think this inability is a skill issue, not an inherent property of the domain, and traditionally the core aim of alignment research was to gain this skill.
  Plausibly we don’t have enough time to figure out how to gain as much confidence that transformative AI systems are safe as we typically have about e.g. single airplanes, but in my view that’s horrifying, and I think it’s useful to notice how different this situation is from the sort humanity is typically willing to accept.
  - Buck 25 Jun 2024 18:16 UTC
    LW: 7 AF: 5
    6
    AF Parent
    Yeah I agree the situation is horrifying and not consistent with eg how risk-aversely we treat airplanes.
    - Adam Scholl 25 Jun 2024 18:22 UTC
      9 points
      2
      Parent
      Right, but then from my perspective it seems like the core problem is that the situations are currently disanalogous, and so it feels reasonable and important to draw the analogy.
      - ryan_greenblatt 25 Jun 2024 19:26 UTC
        4 points
        0
        Parent
        Part of Buck’s point is that airplanes are disanalogous for another reason beyond being fundamentally adversarial: airplanes consist of many complex parts we can understand while AIs systems are ~~simple but built out of simple parts we can understand~~ simple systems built out of a small number of (complex) black-box components.
        
        Separately, it’s noting that users being adversarial will be part of the situation. (Though maybe this sort of misuse poses relatively minimal risk.)
        Adam Scholl 25 Jun 2024 20:05 UTC
        2 points
        0
        Parent
        What’s the sense in which you think they’re more simple? Airplanes strike me as having a much simpler fail surface.
        ryan_greenblatt 25 Jun 2024 20:09 UTC
        2 points
        0
        Parent
        I messed up the wording for that part of the sentence. Does it make more sense now?
        Adam Scholl 25 Jun 2024 21:02 UTC
        2 points
        0
        Parent
        I’m still confused what sort of simplicity you’re imagining? From my perspective, the type of complexity which determines the size of the fail surface for alignment mostly stems from things like e.g. “degree of goal stability,” “relative detectability of ill intent,” and other such things that seem far more complicated than airplane parts.
        ryan_greenblatt 25 Jun 2024 21:17 UTC
        5 points
        1
        Parent
        I think the system built out of AI components will likely be pretty simple—as in the scaffolding and bureaucracy surronding the AI will be simple.
        
        The AI components themselves will likely be black-box.
        Adam Scholl 25 Jun 2024 21:45 UTC
        2 points
        0
        Parent
        Maybe I’m just confused what you mean by those words, but where is the disanalogy with safety engineering coming from? That normally safety engineering focuses on mitigating risks with complex causes, whereas AI risk is caused by some sort of scaffolding/bureaucracy which is simpler?
        ryan_greenblatt 25 Jun 2024 22:26 UTC
        3 points
        1
        Parent
        Buck’s claim is that safety engineering is mostly focused on problems where there are a huge number of parts that we can understand and test (e.g. airplanes) and the main question is about ensuring failure rates are sufficiently low under realistic operating conditions.
        
        In the case of AIs, we might have systems that look more like a single (likely black-box) AI with access to various tools and the safety failures we’re worried about are concentrated in the AI system itself.
        
        This seems much more analogous to insider threat reduction work than the places where safety engineering is typically applied.
        Expand this thread
        Adam Scholl 26 Jun 2024 1:51 UTC
        2 points
        0
        Parent
        I agree we might end up in a world like that, where it proves impossible to make a decent safety case. I just think of the ~whole goal of alignment research as figuring out how to avoid that world, i.e. of figuring out how to mitigate/estimate the risk as much/precisely as needed to make TAI worth building.
        Currently, AI risk estimates are mostly just verbal statements like “I don’t know man, probably some double digit chance of extinction.” This is exceedingly unlike the sort of predictably tolerable risk humanity normally expects from its engineering projects, and which e.g. allows for decent safety cases. So I think it’s quite important to notice how far we currently are from being able to make them, since that suggests the scope and nature of the problem.
        ryan_greenblatt 26 Jun 2024 3:14 UTC
        5 points
        1
        Parent
        
        I agree we might end up in a world like that, where it proves impossible to make a decent safety case.
        
        I don’t think that thing I said is consistent with “impossible to make a safety case good enough to make TAI worth building”? I think you can probably make a safety case which gets to around 1-5% risk while having AIs that are basically black boxs and which are very powerful, but not arbitrarily powerful. (Such a safety case might require decently expensive measures.) (1-5% risk can be consistent with this being worth it—e.g. if there is an impending hot war. That said, it is still a horrifying level of risk that demands vastly more investment.)
        
        See e.g. control for what one part of this safety case could look like. I think that control can go quite far.
        
        Other parts could look like:
        
        A huge array of model organisms experiments which convince us that P(serious misalignment) is low (or low given XYZ adjustments to training).
        coup probes or other simple runtime detection/monitoring techiques.
        ELK/honesty techniques which seem to work in a wide variety of cases.
        Ensuring AIs closely imitate humans and generalize similarly to humans in the cases we checked.
        Ensuring that AIs are very unlikely to be schemers via making sure most of their reasoning happens in CoT and their forward passes are quite weak. (I often think about this from the perspective of control, but it can also be considered separately.)
        
        It’s possible you don’t think any of this stuff gets off the ground. Fair enough if so.
  - Buck 16 Jul 2024 15:16 UTC
    LW: 6 AF: 5
    0
    AF Parent
    I think that my OP was in hindsight taking for granted that we have to analyze AIs as adversarial. I agree that you could theoretically have safety cases where you never need to reason about AIs as adversarial; I shouldn’t have ignored that possibility, thanks for pointing it out.
- William_S 24 Jun 2024 23:13 UTC
  LW: 10 AF: 6
  0
  AF Parent
  IMO it’s unlikely that we’re ever going to have a safety case that’s as reliable as the nuclear physics calculations that showed that the Trinity Test was unlikely to ignite the atmosphere (where my impression is that the risk was mostly dominated by risk of getting the calculations wrong). If we have something that is less reliable, then will we ever be in a position where only considering the safety case gives a low enough probability of disaster for launching an AI system beyond the frontier where disastrous capabilities are demonstrated?
  Thus, in practice, decisions will probably not be made on a safety case alone, but also based on some positive case of the benefits of deployment (e.g. estimated reduced x-risk, advancing the “good guys” in the race, CEO has positive vibes that enough risk mitigation has been done, etc.). It’s not clear what role governments should have in assessing this, maybe we can only get assessment of the safety case, but it’s useful to note that safety cases won’t be the only thing informs these decisions.
  This situation is pretty disturbing, and I wish we had a better way, but it still seems useful to push the positive benefit case more towards “careful argument about reduced x-risk” and away from “CEO vibes about whether enough mitigation has been done”.
  - William_S 24 Jun 2024 23:43 UTC
    LW: 6 AF: 3
    0
    AF Parent
    Relevant paper discussing risks of risk assessments being wrong due to theory/model/calculation error. Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes
    
    Based on the current vibes, I think that suggest that methodological errors alone will lead to significant chance of significant error for any safety case in AI.
  - Buck 25 Jun 2024 14:42 UTC
    LW: 1 AF: 2
    0
    AF Parent
    I agree that “CEO vibes about whether enough mitigation has been done” seems pretty unacceptable.
    I agree that in practice, labs probably will go forward with deployments that have >5% probability of disaster; I think it’s pretty plausible that the lab will be under extreme external pressure (e.g. some other country is building a robot army that’s a month away from being ready to invade) that causes me to agree with the lab’s choice to do such an objectively risky deployment.
    - William_S 25 Jun 2024 21:01 UTC
      LW: 7 AF: 5
      2
      AF Parent
      Would be nice if it was based on “actual robot army was actually being built and you have multiple confirmatory sources and you’ve tried diplomacy and sabotage and they’ve both failed” instead of “my napkin math says they could totally build a robot army bro trust me bro” or “they totally have WMDs bro” or “we gotta blow up some Japanese civilians so that we don’t have to kill more Japanese civilians when we invade Japan bro” or “dude I’m seeing some missiles on our radar, gotta launch ours now bro”.
- ryan_greenblatt 24 Jun 2024 17:46 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Earlier, you said that maybe physical security work was the closest analogy. Do you still think this is true?
- Akash 25 Jun 2024 0:42 UTC
  LW: 3 AF: 2
  −1
  AF Parent
  Here’s how I understand your argument:
  1. Some people are advocating for safety cases– the idea that companies should be required to show that risks drop below acceptable levels.
  2. This approach is used in safety engineering fields.
  3. But AI is different from the safety engineering fields. For example, in AI we have adversarial risks.
  4. Therefore we shouldn’t support safety cases.
  I think this misunderstands the case for safety cases, or at least only argues against one particular justification for safety cases.
  Here’s how I think about safety cases (or really any approach in which a company needs to present evidence that their practices keep risks below acceptable levels):
  1. AI systems pose major risks. A lot of risks stem from race dynamics and competitive pressures.
  2. If companies were required to demonstrate that they kept risks below acceptable levels, this would incentivize a lot more safety research and curb some of the dangerous properties of race dynamics.
  3. Other fields also have similar setups, and we should try to learn from them when relevant. Of course, AI development will also have some unique properties so we’ll have to adapt the methods accordingly.
  I’d be curious to hear more about why you think safety cases fail to work when risks are adversarial (at first glance, it doesn’t seem like it should be too difficult to adapt the high-level safety case approach).
  I’m also curious if you have any alternatives that you prefer. I currently endorse the claim “safety cases are better than status quo” but I’m open to the idea that maybe “Alternative approach X is better than both safety cases and status quo.”
  - Buck 25 Jun 2024 1:25 UTC
    LW: 2 AF: 3
    3
    AF Parent
    Yeah, in your linked paper you write “In high-stakes industries, risk management practices often require affirmative evidence that risks are kept below acceptable thresholds.” This is right. But my understanding is that this is not true of industries that deal with adversarial high-stakes situations. So I don’t think you should claim that your proposal is backed up by standard practice. See here for a review of the possibility of using safety cases in adversarial situations.
- Richard_Ngo 24 Jun 2024 22:58 UTC
  LW: 3 AF: 3
  0
  AF Parent
  Another concern about safety cases: they feel very vulnerable to regulatory capture. You can imagine a spectrum from “legislate that AI labs should do X, Y, Z, as enforced by regulator R” to “legislate that AI labs need to provide a safety case, which we define as anything that satisfies regulator R”. In the former case, you lose flexibility, but the remit of R is very clear. In the latter case, R has a huge amount of freedom to interpret what counts as an adequate safety case. This can backfire badly if R is not very concerned about the most important threat models; and even if they are, the flexibility makes it easier for others to apply political pressure to them (e.g. “find a reason to approve this safety case, it’s in the national interest”).
  - Akash 25 Jun 2024 0:48 UTC
    2 points
    2
    Parent
    @Richard_Ngo do you have any alternative approaches in mind that are less susceptible to regulatory capture? At first glance, I think this broad argument can be applied to any situation where the government regulates anything. (There’s always some risk that R focuses on the wrong things or R experiences corporate/governmental pressure to push things through).
    I do agree that the broader or more flexible the regulatory regime is, the more susceptible it might be to regulatory capture. (But again, this feels like it doesn’t really have much to do with safety cases– this is just a question of whether we want flexible or fixed/inflexible regulations in general.)
    - Richard_Ngo 25 Jun 2024 4:52 UTC
      4 points
      2
      Parent
      On the spectrum I outlined, the “legislate that AI labs should do X, Y, Z, as enforced by regulator R” end is less susceptible to regulatory capture (at least after the initial bill is passed).
  - davekasten 28 Jun 2024 23:04 UTC
    1 point
    0
    Parent
    This is definitely a tradeoff space!
    YES, there is a tradeoff here and yes regulatory capture is real, but there are also plenty of benign agencies that balance these things fairly well. Most people on these forums live in nations where regulators do a pretty darn good job on the well-understood problems and balance these concerns fairly well. (Inside Context Problems?)
    
    You tend to see design of regulatory processes that requires stakeholder input; in particular, the modern American regulatory state’s reliance on the Administrative Procedure Act means that it’s very difficult for a regulator to regulate without getting feedback from a wide variety of external stakeholders, ensuring that they have some flexibility without being arbitrary.
    I also think, contrary to conventional wisdom, that your concern is part of why many regulators end up in a “revolving-door” mechanism—you often want individuals moving back and forth between those two worlds to cross-populate assumptions and check for areas where regulation has gotten misaligned with end goals
- Seth Herd 27 Jun 2024 22:16 UTC
  LW: 2 AF: 1
  0
  AF Parent
  What about this alternate twist: safety cases are the right model, but it just happens to be extremely difficult to make an adequate safety case for competent agentic AGI (or anything close).
  
  Introducing the safety case model for near-future AI releases could normalize that. It should be pretty easy to make a safety case for GPT4 and Claude 3.5. When people want to deploy real AGI, they won’t be able to make the safety case without real advances in alignment. And that’s the point.
  - Buck 27 Jun 2024 23:15 UTC
    LW: 4 AF: 4
    2
    AF Parent
    My guess is that it’s infeasible to ask for them to delay til they have a real safety case, due to insufficient coordination (including perhaps international competition).
    - Seth Herd 2 Jul 2024 5:59 UTC
      2 points
      0
      Parent
      Isn’t that an argument against almost any regulation? The bar on “safety case” can be adjusted up or down, and for better or worse will be.
      
      I think real safety cases are currently very easy: sure, a couple of eggs might be broken by making an information and idea search somewhat better than google. Some jackass will use it for ill, and thus cause slightly more harm than they would’ve with Google. But the increase in risk of major harms is tiny compared to the massive benefits of giving everyone something like a competent assistant who’s near-expert in almost every domain.
      
      Maybe we’re too hung up on downsides as a society to make this an acceptable safety case. Our societal and legal risk-aversion might make safety cases a nonstarter, even though they seem likely to be the correct model if we could collectively think even close to clearly.