ryan_greenblatt comments on A Rocket–Interpretability Analogy

ryan_greenblatt 21 Oct 2024 16:18 UTC
26 points
5
I’m sympathetic to ‘a high fraction of “alignment/safety” work done at AI companies is done due to commercial incentives and has negligible effect on AI takeover risk (or at least much smaller effects than work which isn’t influenced by commercial incentives)’.

I also think a decent number of ostensibly AI x-risk focused people end up being influenced by commercial incentives sometimes knowingly and directly (my work will go into prod if it is useful and this will be good for XYZ reason; my work will get more support if it is useful; it is good if the AI company I work for is more successful/powerful, so I will do work which is commercially useful) and sometimes unknowingly or indirectly (a direction gets more organizational support because of usefulness; people are misled into thinking something is more x-risk-helpful than it actually is).

(And a bunch of originally AI x-risk focused people end up working on things which they would agree aren’t directly useful for mitigating x-risk, but have some more complex story.)

I also think AI companies generally are a bad epistemic environment for x-risk safety work for various reasons.

However, I’m quite skeptical that (mechanistic) interpretability research in particularly gets much more funding due to it directly being a good commercial bet (as in, it is worth it because it ends up directly being commercially useful). And, my guess is that alignment/safety people at AI companies which are ostensibly focused on x-risk/AI takeover prevention are less than ¹⁄₂ funded via the directly commercial case. (I won’t justify this here, but I think this due to a combination of personal experience and thinking about what these teams tend to work on.)

(That said, I think putting some effort on (mech) interp (or something similar) might end up being a decent commercial bet via direct usage, though I’m skeptical.)

I think there are some adjacent reasons alignment/safety work might be funded/encouraged at AI companies beyond direct commercial usage:
- Alignment/safety teams or work might be subsidized via being good internal/external PR for AI companies. As in good PR to all of: the employees who care about safety, the AI-safety-adjacent community who you recruit from, and the broader public. I think probably most of this effect is on the “appeasing internal/external people who might care about safety” rather than the broader public.
- Having an internal safety team might help reduce/avoid regulatory cost for your company or the industry. Both via helping in compliance and via reducing the burden.
- habryka 21 Oct 2024 17:00 UTC
  32 points
  16
  Parent
  However, I’m quite skeptical that (mechanistic) interpretability research in particularly gets much more funding due to it directly being a good commercial bet (as in, it is worth it because it ends up directly being commercially useful). And, my guess is that alignment/safety people at AI companies which are ostensibly focused on x-risk/AI takeover prevention are less than ¹⁄₂ funded via the directly commercial case. (I won’t justify this here, but I think this due to a combination of personal experience and thinking about what these teams tend to work on.)
  I think the primary commercial incentive on mechanistic interpretability research is that it’s the alignment research that most provides training and education to become a standard ML engineer who can then contribute to commercial objectives. I am quite confident that a non-trivial fraction of young alignment researchers are going into mech-interp because it also gets them a lot of career capital as a standard ML engineer.
  What links here?
  - plex's comment on Conjecture: A Roadmap for Cognitive Software and A Humanist Future of AI by Connor Leahy (2 Dec 2024 23:17 UTC; 9 points)
  - ryan_greenblatt 21 Oct 2024 17:53 UTC
    11 points
    4
    Parent
    
    I think the primary commercial incentive on mechanistic interpretability research is that it’s the alignment research that most provides training and education to become a standard ML engineer who can then contribute to commercial objectives.
    
    Is your claim here that a major factor in why Anthropic and GDM do mech interp is to train employees who can later be commercially useful? I’m skeptical of this.
    
    Maybe the claim is that many people go into mech interp so they can personally skill up and later might pivot into something else (including jobs which pay well)? This seems plausible/likely to me, though it is worth noting that this is a pretty different argument with very different implications from the one in the post.
    - habryka 21 Oct 2024 19:16 UTC
      6 points
      4
      Parent
      Yep, I am saying that supply for mech-interp alignment researchers is plenty because of career capital being much more fungible with extremely well-paying ML jobs, and Anthropic and GDM seem interested in sponsoring things like mech-interp MATS streams or other internship and junior-positions, because those fit neatly into their existing talent pipeline, they know how to evaluate that kind of work, and they think that those hires are also more likely to convert into people working on capabilities work.
      - ryan_greenblatt 21 Oct 2024 19:27 UTC
        10 points
        14
        Parent
        I’m pretty skeptical that Neel’s MATS stream is partially supported/subsidized by GDM’s desire to generally hire for capabilities . (And I certainly don’t think they directly fund this.) Same for other mech interp hiring at GDM, I doubt that anyone is thinking “these mech interp employees might convert into employees for capabilities”. That said, this sort of thinking might subsidize the overall alignment/safety team at GDM to some extent, but I think this would mostly be a mistake for the company.
        
        Seems plausible that this is an explicit motivation for junior/internship hiring on the Anthropic interp team. (I don’t think the Anthropic interp team has a MATS stream.)
        habryka 21 Oct 2024 19:29 UTC
        6 points
        2
        Parent
        I think Neel seems to have a somewhat unique amount of freedom, so I have less of a strong take there, but I am confident that GDM would be substantially less excited about its employees taking time off to mentor a bunch of people if the kind of work they were doing would produce artifacts that were substantially less well-respected by the ML crowd, or did not look like they are demonstrating the kind of skills that are indicative of good ML engineering capability.
        ryan_greenblatt 21 Oct 2024 21:29 UTC
        4 points
        0
        Parent
        (I think random (non-leadership) GDM employees generally have a lot of freedom while employees of other companies have much less in-practice freedom (except for maybe longer time OpenAI employees who I think have a lot of freedom).)
        habryka 21 Oct 2024 22:35 UTC
        4 points
        0
        Parent
        (My sense is this changed a lot after the Deepmind/GBrain merger and ChatGPT, and the modern GDM seems to give people a lot less slack in the same way, though you are probably still directionally correct)
        ryan_greenblatt 22 Oct 2024 1:18 UTC
        2 points
        0
        Parent
        (Huh, good to know this changed. I wasn’t aware of this.)
  - Quiche Eater 23 Oct 2024 8:05 UTC
    1 point
    0
    Parent
    Why are you confident it’s not the other way around? People who decide to pursue alignment research may have prior interest or experience in ML engineering that drives them towards mech-interp.
- ZY 21 Oct 2024 23:07 UTC
  1 point
  0
  Parent
  Agree with this, and wanted to add that I am also not completely sure if mechanistic interpretability is a good “commercial bet” yet based on my experience and understanding, with my definition of commercial bet being materialization of revenue or simply revenue generating.
  One revenue generating path I can see for LLMs is the company uses them to identify data that are most effective for particular benchmarks, but my current understanding (correct me if I am wrong) is that it is relatively costly to first research a reliable method, and then run interpretability methods for large models for now; additionally, it would be generally very intuitive to researchers on what datasets could be useful to specific benchmarks already. On the other hand, the method would be much useful to look into nuanced and hard to tackle safety problems. In fact there are a lot of previous efforts in using interpretability generally for safety mitigations.