Thomas Kwa comments on Prometheus’s Shortform

Thomas Kwa 4 Jun 2024 14:14 UTC
4 points
−31
I suspect that working on capabilities (edit: preferably applications rather than building AGI) in some non-maximally-harmful position is actually the best choice for most junior x-risk concerned people who want to do something technical. Safety is just too crowded and still not very tractable.
- Thomas Kwa 4 Jun 2024 17:23 UTC
  5 points
  −4
  Parent
  Since many people seem to disagree, I’m going to share some reasons why I believe this:
  - AGI that poses serious existential risks seems at least 6 years away, and safety work seems much more valuable at crunch time, such that I think more than half of most peoples’ impact will be more than 5 years away. So skilling up quickly should be the primary concern, until your timelines shorten.
  - Mentorship for safety is still limited. If you can get an industry safety job or get into MATS, this seems better than some random AI job, but most people can’t.
    I think there are also many policy roles that are not crowded and potentially time-critical, which people should also take if available.
  - Funding is also limited in the current environment. I think most people cannot get funding to work on alignment if they tried? This is fairly cruxy and I’m not sure of it, so someone should correct me if I’m wrong.
  - The relative impact of working on capabilities is smaller than working on alignment—there are still 10x as many people doing capabilities as alignment, so unless returns don’t diminish or you are doing something unusually harmful, you can work for 1 year on capabilities and 1 year on alignment and only reduce your impact 10%.
    However, it seems bad to work at places that have particularly bad safety cultures and are explicitly trying to create AGI, possibly including OpenAI.
    Safety could get even more crowded, which would make upskilling to work on safety net negative. This should be a significant concern, but I think most people can skill up faster than this.
  - Skills useful in capabilities are useful for alignment, and if you’re careful about what job you take there isn’t much more skill penalty in transferring them than, say, switching from vision model research to language model research.
  - Capabilities often has better feedback loops than alignment because you can see whether the thing works or not. Many prosaic alignment directions also have this property. Interpretability is getting there, but not quite. Other areas, especially in agent foundations, are significantly worse.
  - mesaoptimizer 4 Jun 2024 22:24 UTC
    7 points
    0
    Parent
    These are pretty sane takes (conditional on my model of Thomas Kwa of course), and I don’t understand why people have downvoted this comment. Here’s an attempt to unravel my thoughts and potential disagreements with your claims.
    
    AGI that poses serious existential risks seems at least 6 years away, and safety work seems much more valuable at crunch time, such that I think more than half of most peoples’ impact will be more than 5 years away.
    
    I think safety work gets less and less valuable at crunch time actually. I think you have this Paul Christiano-like model of getting a prototypical AGI and dissecting it and figuring out how it works—I think it is unlikely that any individual frontier lab would perceive itself to have the slack to do so. Any potential “dissection” tools will need to be developed beforehand, such as scalable interpretability tools (SAEs seem like rudimentary examples of this). The problem with “prosaic alignment” IMO is that a lot of this relies on a significant amount of schlep—a lot of empirical work, a lot of fucking around. That’s probably why, according to the MATS team, frontier labs have a high demand for “iterators”—their strategy involves having a lot of ideas about stuff that might work, and without a theoretical framework underlying their search path, a lot of things they do would look like trying things out.
    
    I expect that once you get AI researcher level systems, the die is cast. Whatever prosaic alignment and control measures you’ve figured out, you’ll now be using that in an attempt to play this breakneck game of getting useful work out of a potentially misaligned AI ecosystem, that would also be modifying itself to improve its capabilities (because that is the point of AI researchers). (Sure, its easier to test for capability improvements. That doesn’t mean you can’t transfer information embedded into these proposals such that modified models will be modified in ways the humans did not anticipate or would not want if they had a full understanding of what is going on.)
    
    Mentorship for safety is still limited. If you can get an industry safety job or get into MATS, this seems better than some random AI job, but most people can’t.
    
    Yeah—I think most “random AI jobs” are significantly worse for trying to do useful work in comparison to just doing things by yourself or with some other independent ML researchers. If you aren’t in a position to do this, however, it does make sense to optimize for a convenient low-cognitive-effort set of tasks that provides you the social, financial and/or structural support that will benefit you, and perhaps look into AI safety stuff as a hobby.
    
    I agree that mentorship is a fundamental bottleneck to building mature alignment researchers. This is unfortunate, but it is the reality we have.
    
    Funding is also limited in the current environment. I think most people cannot get funding to work on alignment if they tried? This is fairly cruxy and I’m not sure of it, so someone should correct me if I’m wrong.
    
    Yeah, post-FTX, I believe that funding is limited enough that you have to be consciously optimizing for getting funding (as an EA-affiliated organization, or as an independent alignment researcher). Particularly for new conceptual alignment researchers, I expect that funding is drastically limited since funding organizations seem to explicitly prioritize funding grantees who will work on OpenPhil-endorsed (or to a certain extent, existing but not necessarily OpenPhil-endorsed) agendas. This includes stuff like evals.
    
    The relative impact of working on capabilities is smaller than working on alignment—there are still 10x as many people doing capabilities as alignment, so unless returns don’t diminish or you are doing something unusually harmful, you can work for 1 year on capabilities and 1 year on alignment and gain 10x.
    
    This is a very Paul Christiano-like argument—yeah sure the math makes sense, but I feel averse to agreeing with this because it seems like you may be abstracting away significant parts of reality and throwing away valuable information we already have.
    
    Anyway, yeah I agree with your sentiment. It seems fine to work on non-SOTA AI / ML / LLM stuff and I’d want people to do so such that they live a good life. I’d rather they didn’t throw themselves into the gauntlet of “AI safety” and get chewed up and spit out by an incompetent ecosystem.
    
    Safety could get even more crowded, which would make upskilling to work on safety net negative. This should be a significant concern, but I think most people can skill up faster than this.
    
    I still don’t understand what causal model would produce this prediction. Here’s mine: One big limiting factor to the amount of safety researchers the current SOTA lab ecosystem can handle is bottlenecked by their expectations for how many researchers they want or need. On one hand, more schlep during pre-AI-researcher-era means more hires. On the other hand, more hires requires more research managers or managerial experience. Anecdotally, it seems like many AI capabilities and alignment organizations (both in the EA space and in the frontier lab space) seemed to have been historically bottlenecked on management capacity. Additionally, hiring has a cost (both the search process and the onboarding), and it is likely that as labs get closer to creating AI researchers, they’d believe that the opportunity cost of hiring continues to increase.
    
    Skills useful in capabilities are useful for alignment, and if you’re careful about what job you take there isn’t much more skill penalty in transferring them than, say, switching from vision model research to language model research.
    
    Nah, I found very little stuff from my vision model research work (during my undergrad) contributed to my skill and intuition related to language model research work (again during my undergrad, both around 2021-2022). I mean, specific skills of programming and using PyTorch and debugging model issues and data processing and containerization—sure, but the opportunity cost is ridiculous when you could be actually working with LLMs directly and reading papers relevant to the game you want to play. High quality cognitive work is extremely valuable and spending it on irrelevant things like the specifics of diffusion models (for example) seems quite wasteful unless you really think this stuff is relevant.
    
    Capabilities often has better feedback loops than alignment because you can see whether the thing works or not. Many prosaic alignment directions also have this property. Interpretability is getting there, but not quite. Other areas, especially in agent foundations, are significantly worse.
    
    Yeah this makes sense for extreme newcomers. If someone can get a capabilities job, however, I think they are doing themselves a disservice by playing the easier game of capabilities work. Yes, you have better feedback loops than alignment research / implementation work. That’s like saying “Search for your keys under the streetlight because that’s where you can see the ground most clearly.” I’d want these people to start building the epistemological skills to thrive even with a lower intensity of feedback loops such that they can do alignment research work effectively.
    
    And the best way to do that is to actually attempt to do alignment research, if you are in a position to do so.
    - ryan_greenblatt 5 Jun 2024 3:20 UTC
      4 points
      0
      Parent
      
      I think safety work gets less and less valuable at crunch time actually. [...] Whatever prosaic alignment and control measures you’ve figured out, you’ll now be using that in an attempt to play this breakneck game of getting useful work out of a potentially misaligned AI ecosystem
      
      Sure, but you have to actually implement these alignment/control methods at some point? And likely these can’t be (fully) implemented far in advance. I usually use the term “crunch time” in a way which includes the period where you scramble to implement in anticipation of the powerful AI.
      
      One (oversimplified) model is that there are two trends:
      
      Implementation and research on alignment/control methods becomes easier because of AIs (as test subjects).
      AIs automate away work on alignment/control.
      
      Eventually, the second trend implies that safety work is less valuable, but probably safety work has already massively gone up in value by this point.
      
      (Also, note that the default way of automating safety work will involve large amounts of human labor for supervision. Either due to issues with AIs or because of lack of trust in these AIs systems (e.g. human labor is needed for a control scheme.)
  - Jozdien 4 Jun 2024 18:07 UTC
    3 points
    1
    Parent
    My biggest reason for disagreeing (though there are others) is thinking that people often underestimate the effects that your immediate cultural environment has on your beliefs over time. I don’t think humans have the kind of robust epistemics necessary to fully combat changes in priors from prolonged exposure to something (for example, I know someone who was negative on ads, joined Google and found they were becoming more positive on it without coming from or leading to any object-level changes in their views, and back after they left.)
  - Chris_Leong 4 Jun 2024 23:05 UTC
    2 points
    0
    Parent
    Why not work on a role related to application though?
    - Thomas Kwa 5 Jun 2024 0:39 UTC
      2 points
      0
      Parent
      Application seems better than other kinds of capabilities. I was thinking of capabilities as everything other than alignment work, so inclusive of application.
      - Chris_Leong 5 Jun 2024 1:26 UTC
        4 points
        7
        Parent
        I would suggest that it’s important to make this distinction to avoid sending people off down the wrong path.