Rohin Shah comments on Ways to buy time

Rohin Shah 13 Nov 2022 12:22 UTC
43 points
23
On the margin, we think more alignment researchers should work on “buying time” interventions instead of technical alignment research (or whatever else they were doing).
I’m quite a bit more pessimistic about having lots of people doing these approaches than you seem to be. In the abstract my concerns are somewhat similar to Habryka’s, but I think I can make them a lot more concrete given this post. The TL;DR is: (1) for half the things, I think they’re net negative if done poorly, and I think that’s probably the case on the current margin, and (2) for the other half of things, I think they’re great, and the way you accomplish them is by joining safety / governance teams at AI labs, which are already doing them and are in a much better position to do them than anyone else.
(When talking about industry labs here I’m thinking more about Anthropic and DeepMind—I know less about OpenAI, though I’d bet it applies to them too.)
Direct outreach to AGI researchers
Currently, I’d estimate there are ~50 people in the world who could make a case for working on AI alignment to me that I’d think wasn’t clearly flawed. (I actually ran this experiment with ~20 people recently, 1 person succeeded. EDIT: I looked back and explicitly counted—I ran it with at least 19 people, and 2 succeeded: one gave an argument for “AI risk is non-trivially likely”, another gave an argument for “this is a speculative worry but worth investigating” which I wasn’t previously counting but does meet my criterion above.) Those 50 people tend to be busy and in any case your post doesn’t seem to be directed at them. (Also, if we require people to write down an argument in advance that they defend, rather than changing it somewhat based on pushback from me, my estimate drops to, idk, 20 people.)
Now, even arguments that are clearly flawed to me could convince AGI researchers that AI risk is important. I tend to think that the sign of this effect is pretty unclear. On the one hand I don’t expect these researchers to do anything useful, partly because in my experience “person says AI safety is good” doesn’t translate into “person does things”, and partly because incorrect arguments lead to incorrect beliefs which lead to useless solutions. On the other hand maybe we’re just hoping for a general ethos of “AI risk is real” that causes political pressure to slow down AI.
But it really doesn’t seem great that my case for wide-scale outreach being good is “maybe if we create a mass delusion of incorrect beliefs that implies that AGI is risky, then we’ll slow down, and the extra years of time will help”. So overall my guess is that this is net negative.
(On my beliefs, which I acknowledge not everyone shares, expecting something better than “mass delusion of incorrect beliefs that implies that AGI is risky” if you do wide-scale outreach now is assuming your way out of reality.)
(Fwiw I do expect that there will be a major shift towards AI risk being taken more seriously, as AGI becomes more visceral to people, as outreach efforts continue, and as it becomes more of a culturally expected belief. I often view my job as trying to inject some good beliefs about AI risk among the oncoming deluge of beliefs about AI risk.)
Develop new resources that make AI x-risk arguments & problems more concrete
Seems good if done by one of the 20 people who can make a good argument without pushback from me. If you instead want this to be done on a wide scale I think you have basically the same considerations as above.
Demonstrate concerning capabilities & alignment failures
Seems probably net negative when done at a wide scale, as we’ll see demonstrations of “alignment failures” that aren’t actually related to the way I expect alignment failures to go, and then the most viral one (which won’t be the most accurate one) will be the one that dominates discourse.
Break and red team alignment proposals (especially those that will likely be used by major AI labs)
For the examples of work that you cite, my actual prediction is that they have had ~no effect on the broader ML community, but if they did have an effect, I’d predict that the dominant one is “wow these alignment folks have so much disagreement and say pretty random stuff, they’re not worth paying attention to”. So overall my take is that this is net-negative from the “buying time” perspective (though I think it is worth doing for other reasons).
Organize coordination events
I’m not seeing why any of the suggestions here are better than the existing strategy of “create alignment labs at industry orgs which do this sort of coordination”.
(But I do like the general goal! If you’re interested in doing this, consider trying to get hired at an industry alignment lab. It’s way easier to do this when you don’t have to navigate all of the confidentiality protocols because you’re a part of the company.)
I guess one benefit is that you can have some coordination between top alignment people who aren’t at industry labs? I’m much more keen on having those people just doing good alignment work, and coordinating with the industry alignment labs. This seems way more efficient.
Support safety and governance teams at major AI labs
Strongly in favor of the goal, but how do you do this other than by joining the teams?
Note that people should be aware of the risk that alignment-concerned people joining labs can lead to differential increases in capabilities, as reported here.
The linked article is about capabilities roles in labs, not safety / governance teams in labs. I’d guess that most people including many of those 11 anonymous experts would be pretty positive on having people join safety / governance teams in labs.
Develop and promote reasonable safety standard for AI labs
Sounds great! Seems like you should do it by joining the relevant teams at the AI labs, or at least having a lot of communication with them. (I think it’s way way harder to do outside of the labs because you are way less informed about what the constraints are and what standards would be feasible to coordinate on.)
You could do abstract research on safety standards with the hope that this turns into something useful a few years down the line. I’m somewhat pessimistic on this but much less confident in my pessimism here.
What links here?
- JanB 14 Nov 2022 15:47 UTC
  5 points
  1
  Parent
  
  Currently, I’d estimate there are ~50 people in the world who could make a case for working on AI alignment to me that I’d think wasn’t clearly flawed. (I actually ran this experiment with ~20 people recently, 1 person succeeded.)
  
  I wonder if this is because people haven’t optimised for being able to make the case. You don’t really need to be able to make a comprehensive case for AI risk to do productive research on AI risk. For example, I can chip away at the technical issues without fully understanding the governance issues, as long as I roughly understand something like “coordination is hard, and thus finding technical solutions seems good”.
  
  Put differently: The fact that there are (in your estimation) few people who can make the case well doesn’t mean that it’s very hard to make the case well. E.g., for me personally, I think I could not make a case for AI risk right now that would convince you. But I think I could relatively easily learn to do so (in maybe one to three months???)
  - Rohin Shah 15 Nov 2022 9:38 UTC
    3 points
    1
    Parent
    (I’ve edited the quote to say it’s ²⁄₁₉.)
    I agree you don’t need to have a comprehensive case for risk to do productive research on it, and overall I am glad that people do in fact work on relevant stuff without getting bogged down in ensuring they can justify every last detail.
    I agree it’s possible that people could learn to make a good case. I don’t expect it, because I don’t expect most people to try to learn to make a case that would convince me. You in particular might do so, but I’ve heard of a lot of “outreach to ML researchers” proposals that did not seem likely to do this.
- Chris_Leong 21 Nov 2022 7:18 UTC
  4 points
  0
  Parent
  Why do you think that the number of people who could make a convincing case to you is so low? Where do they normally mess up?
  - LawrenceC 23 Nov 2022 2:20 UTC
    7 points
    0
    Parent
    Not Rohin (who might disagree with me on what constitutes a “good” case) but I’ve also tried to do a similar experiment.
    Besides the “why does RLHF not work” question, which is pretty tricky, another classic theme is people misciting the ML literature, or confidently citing papers that are outliers in the literature as if they were settled science. If you’re going to back up your claims with citations, it’s very important to get them right!
    - Chris_Leong 23 Nov 2022 4:25 UTC
      4 points
      0
      Parent
      I’d encourage you to write up a blog post on common mistakes if you can find the time.
  - Rohin Shah 21 Nov 2022 18:29 UTC
    2 points
    0
    Parent
    Why do you think that the number of people who could make a convincing case to you is so low?
    Because I ran the experiment and very few people passed. (The extrapolation from that to an estimate for the world is guesswork.)
    Where do they normally mess up?
    There’s a lot of different arguments people give, that I dislike for different reasons, but one somewhat common theme was that their argument was not robust to “it seems like InstructGPT is basically doing what its users want when it is capable of it, why not expect scaled up InstructGPT to just continue doing what its users want?”
    (And when I explicitly said something like that, they didn’t have a great response.)
    - Chris_Leong 22 Nov 2022 1:05 UTC
      2 points
      0
      Parent
      Yeah… I suppose you could go through Evan Hubringer’s arguments in “How likely is deceptive alignment?”, but I suppose you’d probably have some further pushback which would be hard to answer.
- LawrenceC 23 Nov 2022 2:17 UTC
  3 points
  0
  Parent
  I agree with you that one of the best ways to “buy time” is to join the alignment or governance teams at major AI labs (in part b/c confidentiality agreements). I also agree that most things are easy to implement poorly by default. However, I think 1) comparative advantage is real; some people are a lot better at writing/teaching/exposition relative to research and vice versa and 2) there are other ways to instantiate some of the proposals that aren’t literally just “Join OpenAI/Deepmind/Anthropic/etc”:
  Direct outreach to AGI researchers
  While I agree that most people are pretty bad at making the alignment case, I do think vibes matter! In particular, I think you’re underestimating the value of a ‘general ethos of “AI risk is real”’. (Though I still agree that the average direct outreach attempt will probably be slightly negative.)
  Demonstrate concerning capabilities & alignment failures
  Presumably, the way you’d do this is to work with one of the scaling labs?
  Break and red team alignment proposals (especially those that will likely be used by major AI labs
  I think the reason most of these examples have failed is some combination of: 1) literally not addressing what people in labs are doing, 2) not being phrased in the right way (eg with ML/deep learning terminology), and 3) being published in venues that aren’t really visible to most people in labs?
  I think 1) is the most concerning one—I’ve heard many people make informal arguments in favor of/against Jan’s RRM + Alignment research proposal, but I don’t think a serious critical analysis of that approach has been written up anywhere. Instead, a lot of effort was spent yelling at stuff like CHAI/CIRL. My guess is many people (~5-10 people I know) can write a good steelman/detailed explainer of Jan’s stuff and also critique it.
  Organize coordination events
  [...]
  I guess one benefit is that you can have some coordination between top alignment people who aren’t at industry labs? I’m much more keen on having those people just doing good alignment work, and coordinating with the industry alignment labs. This seems way more efficient.
  You can also coordinate top alignment people not at labs <> people at labs, etc. But I do agree that doing good alignment work is important!
  - Rohin Shah 23 Nov 2022 10:53 UTC
    2 points
    1
    Parent
    However, I think 1) comparative advantage is real; some people are a lot better at writing/teaching/exposition relative to research and vice versa
    Sure. Of the small number of people who can do any of these well, they should split them up based on comparative advantage. This seems orthogonal to my main claim (roughly: if you do this at a large scale then it starts becoming net negative due to lower quality).
    I do think vibes matter! In particular, I think you’re underestimating the value of a ‘general ethos of “AI risk is real”’.
    I very much agree that vibes matter! Do you have in mind some benefit other than the one I mentioned above:
    But it really doesn’t seem great that my case for wide-scale outreach being good is “maybe if we create a mass delusion of incorrect beliefs that implies that AGI is risky, then we’ll slow down, and the extra years of time will help”.
    (More broadly it increases willingness to pay an alignment tax, with “slowing down” as one example.)
    Importantly, vibes are not uniformly beneficial. If the vibe is “AI systems aren’t robust and so we can’t deploy them in high-stakes situations” then maybe everyone coordinates not to let the AI control the nukes and ignores the people who are saying that we also need to worry about the generalist foundation models because it’s fine, those models aren’t deployed in high-stakes situations.
    Presumably, the way you’d do this is to work with one of the scaling labs?
    Sure, that could work. (Again my main claim is “you can’t usefully throw hundreds of people at this” and not “this can never be done well”.)
    I think the reason most of these examples have failed is some combination of: 1) literally not addressing what people in labs are doing, 2) not being phrased in the right way (eg with ML/deep learning terminology), and 3) being published in venues that aren’t really visible to most people in labs?
    I think 1) is the most concerning one—I’ve heard many people make informal arguments in favor of/against Jan’s RRM + Alignment research proposal, but I don’t think a serious critical analysis of that approach has been written up anywhere. Instead, a lot of effort was spent yelling at stuff like CHAI/CIRL. My guess is many people (~5-10 people I know) can write a good steelman/detailed explainer of Jan’s stuff and also critique it.
    I’m confused. Are you trying to convince Jan or someone else? How does it buy time?
    (I interpreted the OP as saying that you convince AGI researchers who are not (currently) working on safety. I think a good steelman + critique of RRM wouldn’t have much effect on that population, though I think it’s pretty plausible I’m wrong about that because the situation at OpenAI is different from DeepMind.)
    You can also coordinate top alignment people not at labs <> people at labs, etc.
    As a person at a lab I’m currently voting for less coordination of this sort, not more, but I agree that this is also a thing you can do. (As with everything else, my main claim is that this isn’t a scalable intervention.)
    - LawrenceC 23 Nov 2022 11:48 UTC
      4 points
      0
      Parent
      This seems orthogonal to my main claim (roughly: if you do this at a large scale then it starts becoming net negative due to lower quality).
      Fair. I think I failed to address this point entirely.
      I do think there’s a nonzero amount of people who would not be that good at novel alignment research and would still be good at the tasks mentioned here, but I agree that there isn’t a scalable intervention here, or at least not more so than standard AI alignment research (especially when compared to some appraoches like the brute-force mechanistic interp many people are doing).
      (I interpreted the OP as saying that you convince AGI researchers who are not (currently) working on safety. I think a good steelman + critique of RRM wouldn’t have much effect on that population, though I think it’s pretty plausible I’m wrong about that because the situation at OpenAI is different from DeepMind.)
      Yeah, I also messed up here—I think this would plausibly have little effect on that population. I do think that a good answer to “why does RLHF not work” would help a nonzero amount, though.
      As a person at a lab I’m currently voting for less coordination of this sort, not more
      Agree that it’s not scalable, but could you share why you’d vote for less?
      - Rohin Shah 23 Nov 2022 18:25 UTC
        3 points
        0
        Parent
        Agree that it’s not scalable, but could you share why you’d vote for less?
        Idk, it’s hard to explain—it’s the usual thing where there’s a gazillion things to do that all seem important and you have to prioritize anyway. (I’m just worried about the opportunity cost, not some other issue.)
        I think the biggest part of coordination between non-lab alignment people and lab alignment people is making sure that people know about each other’s research; it mostly feels like the simple method of “share info through personal connections + reading posts and papers” is working pretty well right now. Maybe I’m missing some way in which this could be way better, idk.
        LawrenceC 23 Nov 2022 22:35 UTC
        3 points
        0
        Parent
        My guess is most of the value in coordination work here is either in making posts/papers easier to write or ship, or in discovering new good researchers?
        Rohin Shah 25 Nov 2022 7:05 UTC
        2 points
        0
        Parent
        Those weren’t what I thought of when I read “coordination” but I agree those things sound good :)
        Another good example would be better communication tech (e.g. the sort of thing that LessWrong / Alignment Forum aims for, although not those in particular because most lab people don’t use it very much).
        LawrenceC 25 Nov 2022 11:03 UTC
        3 points
        2
        Parent
        I feel like most of the barrier in practice for people not “coordinating” in the relevant ways is people not knowing what other people are doing. And a big reason for this is that writing is really hard to write, especially if you have high standards and don’t want to ship.
        And yeah, better communication tech in general would be good, but I’m not sure how to start on that (while it’s pretty obvious what a few candidate steps toward making posts/papers easier to write/ship would look like?)
        Rohin Shah 25 Nov 2022 15:19 UTC
        2 points
        0
        Parent
        I agree it’s not clear what to do on better communication tech.
        I feel like most of the barrier in practice for people not “coordinating” in the relevant ways is people not knowing what other people are doing. And a big reason for this is that writing is really hard to write, especially if you have high standards and don’t want to ship.
        Idk, a few years ago I would have agreed with you, but now my impression is that people mostly don’t read things and instead talk to each other for this purpose. I wouldn’t really expect that to change with more writing, unless the writing is a lot better?
        (I do think that e.g. mech interp researchers read each other’s mech interp papers, though my impression from the outside is that they also often hear about each other’s results well before they’re published. Similarly for scalable oversight.)
- Nicholas / Heather Kross 21 Jan 2024 17:31 UTC
  2 points
  0
  Parent
  (On my beliefs, which I acknowledge not everyone shares, expecting something better than “mass delusion of incorrect beliefs that implies that AGI is risky” if you do wide-scale outreach now is assuming your way out of reality.)
  I’m from the future, January 2024, and you get some Bayes Points for this!
  The “educated savvy left-leaning online person” consensus (as far as I can gather) is something like: “AI art is bad, the real danger is capitalism, and the extinction danger is some kind of fake regulatory-capture hype techbro thing which (if we even bother to look at the LW/EA spaces at all) is adjacent to racists and cryptobros”.
  Still seems too early to tell whether or not people are getting lots of false beliefs that are still pushing them towards believing-AGI-is-an-X-risk, especially since that case seems to be made (in the largest platform) indirectly in congressional hearings that nobody outside tech/politics actually watches.
  But it really doesn’t seem great that my case for wide-scale outreach being good is “maybe if we create a mass delusion of incorrect beliefs that implies that AGI is risky, then we’ll slow down, and the extra years of time will help”. So overall my guess is that this is net negative.
  To devil’s steelman some of this: I think there’s still an angle that few have tried in a really public way. namely, ignorance and asymmetry. (There is definitely a better term or two for what I’m about to describe, but I forgot it. Probably from Taleb or something.)
  A high percentage of voting-eligible people in the US… don’t vote. An even higher percentage vote in only the presidential elections, or only some presidential elections. I’d bet a lot of money that most of these people aren’t working under a Caplan-style non-voting logic, but instead under something like “I’m too busy” or “it doesn’t matter to me / either way / from just my vote”.
  Many of these people, being politically disengaged, would not be well-informed about political issues (or even have strong and/or coherent values related to those issues). What I want to see is an empirical study that asks these people “are you aware of this?” and “does that awareness, in turn, factor into you not-voting?”.
  I think there’s a world, which we might live in, where lots of non-voters believe something akin to “Why should I vote, if I’m clueless about it? Let the others handle this lmao, just like how the ~~nice~~ smart people somewhere make my bills come in.”
  In a relevant sense, I think there’s an epistemically-legitimate and persuasive way to communicate “AGI labs are trying to build something smarter than humans, and you don’t have to be an expert (or have much of a gears-level view of what’s going on) to think this is scary. If our smartest experts still disagree on this, and the mistake-asymmetry is ‘unnecessary slowdown VS human extinction’, then it’s perfectly fine to say ‘shut it down until [someone/some group] figures out what’s going on’”.
  To be clear, there’s still a ton of ways to get this wrong, and those who think otherwise are deluding themselves out of reality. I’m claiming that real-human-doable advocacy can get this right, and it’s been mostly left untried.
  EXTRA RISK NOTE: Most persuasion, including digital, is one-to-many “broadcast”-style; “going viral” usually just means “some broadcast happened that nobody heard of”, like an algorithm suggesting a video to a lot of people at once. Given this, plus anchoring bias, you should expect and be very paranoid about the “first thing people hear = sets the conversation” thing. (Think of how many people’s opinions are copypasted from the first ~~classy video essay~~ mass-market John Oliver video they saw about the subject, or the first Fox News commentary on it.)
  Not only does the case for X-risk need to be made first, but it needs to be right (even in a restricted way like my above suggestion) the first time. Actually, that’s another reason why my restricted-version suggestion should be prioritized, since it’s more-explicitly robust to small issues.
  (If somebody does this in real life, you need to clearly end on something like “Even if a minor detail like [name a specific X] or [name a specific Y] is wrong, it doesn’t change the underlying danger, because the labs are still working towards Earth’s next intelligent species, and there’s nothing remotely strong about the ‘safety’ currently in place.”)
  What links here?
  - Is principled mass-outreach possible, for AGI X-risk? by Nicholas / Heather Kross (EA Forum; 21 Jan 2024 17:45 UTC; 12 points)
  - Is principled mass-outreach possible, for AGI X-risk? by Nicholas / Heather Kross (21 Jan 2024 17:45 UTC; 9 points)