simeon_c comments on A Playbook for AI Risk Reduction (focused on misaligned AI)

simeon_c 9 Jun 2023 2:22 UTC
11 points
6
Thanks for writing that up.
I believe that by not touching the “decrease the race” or “don’t make the race worse” interventions, this playbook misses a big part of the picture of “how one single think could help massively”. And this core consideration is also why I don’t think that the “Successful, careful AI lab” is right.
Staying at the frontier of capabilities and deploying leads the frontrunner to feel the heat which accelerates both capabilities & the chances of uncareful deployment which increases pretty substantially the chances of extinction.
- HoldenKarnofsky 10 Jun 2023 6:23 UTC
  2 points
  −2
  Parent
  Thanks for this comment—I get vibes along these lines from a lot of people but I don’t think I understand the position, so I’m enthused to hear more about it.
  > I believe that by not touching the “decrease the race” or “don’t make the race worse” interventions, this playbook misses a big part of the picture of “how one single think could help massively”.
  “Standards and monitoring” is the main “decrease the race” path I see. It doesn’t seem feasible to me for the world to clamp down on AI development unconditionally, which is why I am more focused on the conditional (i.e., “unless it’s demonstrably safe”) version.
  But is there another “decrease the race” or “don’t make the race worse” intervention that you think can make a big difference? Based on the fact that you’re talking about a single thing that can help massively, I don’t think you are referring to “just don’t make things worse”; what are you thinking of?
  > Staying at the frontier of capabilities and deploying leads the frontrunner to feel the heat which accelerates both capabilities & the chances of uncareful deployment which increases pretty substantially the chances of extinction.
  
  I agree that this is an effect, directionally, but it seems small by default in a setting with lots of players (I imagine there will be, and is, a lot of “heat” to be felt regardless of any one player’s actions). And the potential benefits seem big. My rough impression is that you’re confident the costs outweigh the benefits for nearly any imaginable version of this; if that’s right, can you give some quantitative or other sense of how you get there?
  - simeon_c 15 Jul 2023 16:15 UTC
    1 point
    0
    Parent
    Thanks for the clarifications.
    But is there another “decrease the race” or “don’t make the race worse” intervention that you think can make a big difference? Based on the fact that you’re talking about a single thing that can help massively, I don’t think you are referring to “just don’t make things worse”; what are you thinking of?
    1. I think we agree on the fact that “unless it’s provably safe” is the best version of trying to get a policy slowdown.
    2. I believe there are many interventions that could help on the slowdown side, most of which are unfortunately not compatible with the successful careful AI lab. The main struggle that a successful careful AI lab encounters is that it has to trade-off tons of safety principles along the way, essentially bc it needs to attract investors & talent and that attracting investors & talent is hard if you’re say too loudly that we should slow down as long as our thing is not provably safe.
    So de facto a successful careful AI lab will be a force against slowdown & a bunch of other relevant policies in the policy world. It will also be a force for the perceived race which is making things harder for every actor.
    Other interventions for slowdown are mostly in the realm of public advocacy.
    Mostly drawing upon the animal welfare activism playbook, you could use public campaigns to de facto limit the ability of labs to race, via corporate or policy advocacy campaigns.
    
    I agree that this is an effect, directionally, but it seems small by default in a setting with lots of players (I imagine there will be, and is, a lot of “heat” to be felt regardless of any one player’s actions). And the potential benefits seem big. My rough impression is that you’re confident the costs outweigh the benefits for nearly any imaginable version of this; if that’s right, can you give some quantitative or other sense of how you get there?
    I guess, heuristically, I tend to take arguments of the form “but others would have done this bad thing anyway” with some skepticism because I think it tends to assume too much certainty over the counterfactual, in part due to many second order effects (e.g. the existence of one marginal key player increases the chances that more player invest, show that competition is possible etc.) that tend to be hard to compute (but are sometimes observable ex post).
    On this specific case I think it’s not right that there are “lots of players” close from the frontier. If we take the case of OA and Anthropic for example, there are about 0 players at their level of deployed capabilities. Maybe Google will deploy at some point but they haven’t been serious players for the past 7 months. So if Anthropic hadn’t been around, OA could have chilled longer at ChatGPT level, and then at GPT-4 without plugins + code interpreter & without suffering from any threat. And now they’ll need to do something very impressive against the 100k context etc.
    The compound effects of this are pretty substantial because for each new differentiation, it accelerates the whole field and pressures teams to find something new, causing a significantly more powerful race to the bottom.
    If I had to be quantitative (vaguely) for the past 9 months, I’d guess that the existence of Anthropic has caused (/will cause, if we count the 100k thing) 2 significant counterfactual features and 3-5 months of timelines (which will probably compound into more due to self-improvement effects). I’d guess there are other effects (e.g. pressure on compute, scaling for driving costs down etc.) I’m not able to give vague estimates for.
    My guess for the 3-5 months is mostly driven by the release of ChatGPT & GPT-4 which have both likely been released earlier than without Anthropic.
    - HoldenKarnofsky 26 Jul 2023 5:04 UTC
      2 points
      0
      Parent
      Thanks for the response!
      Re: your other interventions—I meant for these to be part of the “Standards and monitoring” category of interventions (my discussion of that mentions advocacy and external pressure as important factors).
      I think it’s far from obvious that an AI company needs to be a force against regulation, both conceptually (if it affects all players, it doesn’t necessarily hurt the company) and empirically.
      Thanks for giving your take on the size of speedup effects. I disagree on a number of fronts. I don’t want to get into the details of most of them, but will comment that it seems like a big leap from “X product was released N months earlier than otherwise” to “Transformative AI will now arrive N months earlier than otherwise.” (I think this feels fairly clear when looking at other technological breakthroughs and how much they would’ve been affected by differently timed product releases.)
      - simeon_c 2 Aug 2023 22:19 UTC
        1 point
        0
        Parent
        I meant for these to be part of the “Standards and monitoring” category of interventions (my discussion of that mentions advocacy and external pressure as important factors).
        I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira’s playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they’re doing on the alignment front. I would guess you wouldn’t agree with that, but I’m not sure.
        I think it’s far from obvious that an AI company needs to be a force against regulation, both conceptually (if it affects all players, it doesn’t necessarily hurt the company) and empirically.
        I’m not saying that it would be a force against regulation in general but that it would be a force against any regulation which slows down substantially the current capabilities progress rate of labs. And empirics don’t demonstrate the opposite as far as I can tell.
        Labs have been pushing for the rule that we should wait for evals to say “it’s dangerous” before we consider what to do, rather than do like in most other industries, i.e. that something is assumed dangerous until proven safe.
        Most mentions of slowdown have been described as necessary potentially at some point in the distant future, while most people in those labs have <5y timelines.
        Finally, on your conceptual part, as some argued, it’s in fact probably not possible to affect all players equally without a drastic regime of control (which is a true downside of slowing down now, but IMO still much less worse than slowing down once a leak or a jailbreak of an advanced system can cause a large-scale engineered pandemic) bc smaller actors will use the time to try to catch up as close as possible from the frontier.
        will comment that it seems like a big leap from “X product was released N months earlier than otherwise” to “Transformative AI will now arrive N months earlier than otherwise.”
        I agree, but if anything, my sense is that due to various compound effects (due to AI accelerating AI, to investment, to increased compute demand, and to more talent earlier), an earlier product release of N months just gives a lower bound for TAI timelines shortening (hence greater than N). Moreover, I think that the ChatGPT product release is, ex-post at least, not in the typical product release reference class. It was clearly a massive game changer for OpenAI and the entire ecosystem.
        HoldenKarnofsky 2 Oct 2023 17:18 UTC
        2 points
        0
        Parent
        (Apologies for slow reply!)
        I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira’s playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they’re doing on the alignment front. I would guess you wouldn’t agree with that, but I’m not sure.
        I think an adversarial social movement could have a positive impact. I have tended to think of the impact as mostly being about getting risks taken more seriously and thus creating more political will for “standards and monitoring,” but you’re right that there could also be benefits simply from buying time generically for other stuff.
        I’m not saying that it would be a force against regulation in general but that it would be a force against any regulation which slows down substantially the current capabilities progress rate of labs. And empirics don’t demonstrate the opposite as far as I can tell.
        I said it’s “far from obvious” empirically what’s going on. I agree that discussion of slowing down has focused on the future rather than now, but I don’t think it has been pointing to a specific time horizon (the vibe looks to me more like “slow down at a certain capabilities level”).
        Finally, on your conceptual part, as some argued, it’s in fact probably not possible to affect all players equally without a drastic regime of control (which is a true downside of slowing down now, but IMO still much less worse than slowing down once a leak or a jailbreak of an advanced system can cause a large-scale engineered pandemic) bc smaller actors will use the time to try to catch up as close as possible from the frontier.
        It’s true that no regulation will affect everyone precisely the same way. But there is plenty of precedent for major industry players supporting regulation that generally slows things down (even when the dynamic you’re describing applies).
        I agree, but if anything, my sense is that due to various compound effects (due to AI accelerating AI, to investment, to increased compute demand, and to more talent earlier), an earlier product release of N months just gives a lower bound for TAI timelines shortening (hence greater than N). Moreover, I think that the ChatGPT product release is, ex-post at least, not in the typical product release reference class. It was clearly a massive game changer for OpenAI and the entire ecosystem.
        I don’t agree that we are looking at a lower bound here, bearing in mind that (I think) we are just talking about when ChatGPT was released (not when the underlying technology was developed), and that (I think) we should be holding fixed the release timing of GPT-4. (What I’ve seen in the NYT seems to imply that they rushed out functionality they’d otherwise have bundled with GPT-4.)
        If ChatGPT had been held for longer, then:
        Scaling and research would have continued in the meantime. And even with investment and talent flooding in, I expect that there’s very disproportionate impact from players who were already active before ChatGPT came out, who were easily well capitalized enough to go at ~max speed for the few months between ChatGPT and GPT-4.
        GPT-4 would have had the same absolute level of impressiveness, revenue potential, etc. (as would some other things that I think have been important factors in bringing in investment, such as Midjourney). You could have a model like “ChatGPT maxed out hype such that the bottleneck on money and talent rushing into the field became calendar time alone,” which would support your picture; but you could also have other models, e.g. where the level of investment is more like a function of the absolute level of visible AI capabilities such that the timing of ChatGPT mattered little, holding fixed the timing of GPT-4. I’d guess the right model is somewhere in between those two; in particular, I’d guess that it matters a lot how high revenue is from various sources, and revenue seems to behave somewhere in between these two things (there are calendar-time bottlenecks, but absolute capabilities matter a lot too; and parallel progress on image generation seems important here as well.)
        Attention from policymakers would’ve been more delayed; the more hopeful you are about slowing things via regulation, the more you should think of this as an offsetting factor, especially since regulation may be more of a pure “calendar-time-bottlenecked response to hype” model than research and scaling progress.
        (I also am not sure I understand your point for why it could be more than 3 months of speedup. All the factors you name seem like they will nearly-inevitably happen somewhere between here and TAI—e.g., there will be releases and demos that galvanize investment, talent, etc. - so it’s not clear how speeding a bunch of these things up 3 months speeds the whole thing up more than 3 months, assuming that there will be enough time for these things to matter either way.)
        But more important than any of these points is that circumstances have (unfortunately, IMO) changed. My take on the “successful, careful AI lab” intervention was quite a bit more negative in mid-2022 (when I worried about exactly the kind of acceleration effects you point to) than when I did my writing on this topic in 2023 (at which point ChatGPT had already been released and the marginal further speedup of this kind of thing seemed a lot lower). Since I wrote this post, it seems like the marginal downsides have continued to fall, although I do remain ambivalent.
- Noosphere89 9 Jun 2023 17:21 UTC
  2 points
  −2
  Parent
  This is only true if we assume that there are little to no differences in which company takes the lead in AI, or which types of AI are preferable, and I think this is wrong, and there fairly massive differences between OpenAI or Anthropic winning the race, compared to Deepmind winning the race to AGI.
  - simeon_c 10 Jun 2023 0:30 UTC
    1 point
    0
    Parent
    So I guess first you condition over alignment being solved when we win the race. Why do you think OpenAI/Anthropic are very different from DeepMind?
    - HoldenKarnofsky 10 Jun 2023 6:26 UTC
      8 points
      6
      Parent
      Noting that I don’t think alignment being “solved” is a binary. As discussed in the post, I think there are a number of measures that could improve our odds of getting early human-level-ish AIs to be aligned “enough,” even assuming no positive surprises on alignment science. This would imply that if lab A is more attentive to alignment and more inclined to invest heavily in even basic measures for aligning its systems than lab B, it could matter which lab develops very capable AI systems first.
    - Noosphere89 10 Jun 2023 1:18 UTC
      1 point
      0
      Parent
      I don’t exactly condition on alignment being solved. I instead point to a very important difference between OpenAI/Anthropic’s AI vs Deepmind’s AI, and the biggest difference between the two is that OpenAI/Anthropic’s AI has a lot less incentive to develop instrumental goals due to having way fewer steps between the input and output, and incentivizes constraining goals, compared to Deepmind which uses RL, which essentially requires instrumental goals/instrumental convergence to do anything.
      
      This is an important observation by porby, which I’d lossily compress it to “Instrumental goals/Instrumental convergence is at best a debatable assumption for LLMs and Non-RL AI, and may not be there at all for LLMs/Non-RL AI.”
      
      And this matters, because the assumption of instrumental convergence/powerseeking underlies basically all of the pessimistic analyses on AI, and arguably a supermajority of why AI is fundamentally dangerous, because instrumental convergence/powerseeking is essentially why it’s so difficult to gain AI safety. LLMs/Non-RL AI probably bypass all of the AI safety concerns that isn’t related to misuse or ethics, and this has massive implications. So massive, I covered them in it’s own post here:
      
      https://www.lesswrong.com/posts/8SpbjkJREzp2H4dBB/a-potentially-high-impact-differential-technological
      
      One big implication is obvious: OpenAI and Anthropic are much safer companies to win the AI race, relative to Deepmind, because of the probably non-existent instrumental convergence/powerseeking issue.
      
      It also makes the initial alignment problem drastically easier, as it’s a non-adversarial problem that doesn’t need security mindset to make the LLM/Non-RL AI Alignment researcher plan work, as described here:
      
      https://openai.com/blog/our-approach-to-alignment-research
      
      And thus makes the whole problem easier as we don’t need to worry much about the first AI researcher’s alignment, resulting in a stable foundation for their recursive/meta alignment plan.
      
      The fact that instrumental convergence/powerseeking/instrumental goals are at best debatable and probably false is probably the biggest reason why I claim that the different companies are fundamentally different in the probability of extinction, with the p(DOOM) radically reducing conditional on OpenAI or Anthropic winning the race, due to their AIs having a very desirable safety property, which is the lack of an incentive to have instrumental goals/instrumental convergence/powerseeking by default.
      - boazbarak 10 Jun 2023 23:31 UTC
        1 point
        0
        Parent
        I agree that there is a difference between strong AI that has goals and one that is not an agent. This is the point I made here https://www.lesswrong.com/posts/wDL6wiqg3c6WFisHq/gpt-as-an-intelligence-forklift
        
        But this has less to do with the particular lab (eg DeepMind trained Chinchilla) and more with the underlying technology. If the path to stronger models goes through scaling up LLMs then it does seem that they will be 99.9% non agentic (measured in FLOPs https://www.lesswrong.com/posts/f8joCrfQemEc3aCk8/the-local-unit-of-intelligence-is-flops )
        Noosphere89 11 Jun 2023 0:24 UTC
        1 point
        0
        Parent
        You’re right, it is the technology that makes the difference, but my point is that specific companies focus more on specific technology paths to safe AGI. And OpenAI/Anthropic’s approach tends not to have instrumental convergence/powerseeking, compared to Deepmind, given that Deepmind focuses on RL, which essentially requires instrumental convergence. To be clear, I actually don’t think OpenAI/Anthropic’s path can work to AGI, but their alignment plans probably do work. And given instrumental convergence/powerseeking is basically the reason why AI is more dangerous than standard technology, that is a very big difference between the companies rushing to AGI.
        
        Thanks for the posts on non-agentic AGI.
        
        My other points are that the non-existence of instrumental convergence/powerseeking even at really high scales, if true, has very, very large implications for the dangerousness of AI, and consequently basically everything has to change with respect to AI safety, given that it’s a foundational assumption of why AI is so dangerous at all.