Seth Herd comments on Communications in Hard Mode (My new job at MIRI)

Seth Herd 15 Dec 2024 6:13 UTC
8 points
2
Beautiful. Thank you for applying your considerable skills to this task.

A few thoughts on directions:

Respectability may be overrated, but credibility is not, and we really don’t want to blow it.

I very much agree with the strategy of moving slowly to conserve credibility.

The policy of stating what you mean is also very good for credibility. It is foreign to most public discourse, in which speaking to persuade rather than inform is the norm. I hope that a statement like

“we come from an intellectual tradition in which speaking to inform rather than persuade is the norm. Exaggerating one’s claims might help in the short term, but it will cost credibility and confuse everyone in the longer term. So we try to say just what we believe, and try to point to our reasons for believing it so that everyone can judge for themselves.”

I’m not sure how useful that would be, but my perception is that many people have some affinity for truth and rationality, so making that foundational claim and then trying to follow it may be appealing and be taken seriously by enough people to matter.

Working within that aspirational goal, I think it’s important to practice epistemic modesty.

From what I’ve seen of the public discourse, overstating one’s certainty of the dangers of AGI and the difficulty of alignment is a substantial drain on the credibility of the movement.

It is both more reasonable and more effective to say “alignment might be very difficult, so we should have a very sound safety case before actually building agentic systems smarter than humans” rather than exaggerating our collective knowledge by saying “if we build it we all die”. Saying that and exaggerating our certainty opens up a line of distracting counterattack in which the “doomers” are mocked for either foolishly overestimating their knowledge, or overstating their case as a deliberate deception.

It should be quite adequate and vastly more credible to say “if we build it without knowing, humanity may well not survive for long. And the people building it will not stop in time to know if it’s safe, if we do not demand they proceed safely. Move fast and break things is an exciting motto, but it isn’t a reasonable way to approach the next stage of evolution.”

And that does not exceed our true collective epistemic uncertainty. I have my own informed and developed opinion about the likely difficulty of alignment. But looking at the range of expert opinions, I feel it is most fair and modest to say we do not yet know.

Proceeding toward potential doom without knowing the risks is a catastrophic error born from excitement (from the optimistic) and apathy (from the cautious).
- yams 15 Dec 2024 22:31 UTC
  4 points
  3
  Parent
  (I work at MIRI but views are my own)
  I don’t think ‘if we build it we all die’ requires that alignment be hard [edit: although it is incompatible with alignment by default]. It just requires that our default trajectory involves building ASI before solving alignment (and, looking at our present-day resource allocation, this seems very likely to be the world we are in, conditional on building ASI at all).
  [I want to note that I’m being very intentional when I say “ASI” and “solving alignment” and not “AGI” and “improving the safety situation”]
  - ryan_greenblatt 15 Dec 2024 22:41 UTC
    16 points
    14
    Parent
    ASI alignment could be trivial (happens by default), easy, or hard. If it is trivial, then “if we build it, we all die” is false.
    
    Separately, I don’t buy that misaligned ASI with totally alien goals and that full takes over will certainly kill everyone due to^[1] trade arguments like this one. I also think it’s plausilbe that such an AI will be at least very slightly kind such that it is willing to spend a tiny amount of resources keeping humans alive if this is cheap. Thus, “the situation is well described as ‘we all die’ conditional on misaligned ASI with almost totally alien goals and that full takes over” seems more like 25-50% likely to me (and in some of these scenarios, not literally everyone dies, maybe a tiny fraction of humans survive but have negligable astronomical autonomy and control).
    
    ↩︎
    This presupposes that we’re in base reality rather than in a simulation designed for some purpose relating to human civilization. The situation differs substantially in the simulation case.
    - yams 15 Dec 2024 22:57 UTC
      1 point
      0
      Parent
      Good point—what I said isn’t true in the case of alignment by default.
      
      Edited my initial comment to reflect this
      - Seth Herd 16 Dec 2024 19:09 UTC
        5 points
        0
        Parent
        It’s not true if alignment is easy, too, right? My timelines are short, but we do still have a little time to do alignment work. And the orgs are going to do a little alignment work. I wonder if there’s an assumption here that OpenAI and co don’t even believe that alignment is a concern? I don’t think that’s true, although I do think they probably dramatically underrate x-risk dangers based on incentive-driven biases, but they do seem to appreciate the basic arguments.
        
        And I expect them to get a whole lot more serious about it once they’re staring a capable agent in the face. It’s one thing to dismiss the dangers of tigers from a distance, another when there’s just a fence between you and it. I think proximity is going to sharpen everyone’s thinking a good bit by inspiring them to spend more time thinking about the dangers.
        yams 16 Dec 2024 19:11 UTC
        3 points
        0
        Parent
        Your version of events requires a change of heart (for ‘them to get a whole lot more serious’). I’m just looking at the default outcome. Whether alignment is hard or easy (although not if it’s totally trivial), it appears to be progressing substantially more slowly than capabilities (and the parts of it that are advancing are the most capabilities-synergizing, so it’s unclear what the oft-lauded ‘differential advancement of safety’ really looks like).
        Seth Herd 16 Dec 2024 19:55 UTC
        3 points
        1
        Parent
        I consider at least a modest change of heart to be the default.
        And I think it’s really hard to say how fast alignment is progressing relative to capabilities. If by “alignment” you mean formal proofs of safety then definitely we’re not on track. But there’s a real chance that we don’t need those. We are training networks to follow instructions, and it’s possible that weak type of tool “alignment” can be leveraged into true agent alignment for instruction-following or corrigibility. If so, we have solved AGI alignment. That would give us superhuman help solving ASI alignment, and the “societal alignment” problem of surviving intent-aligned AGIs with different masters.
        This seems like the default for how we’ll try to align AGI. We don’t know if it will work.
        When I get MIRI-style thinkers to fully engage with this set of ideas, they tend to say “hm maybe”. But I haven’t gotten enough engagement to have any confidence. Prosaic alignment, LLM thinkers usually aren’t engaging with the hard problems of alignment that crop up when we hit fully autonomous AGI entities, like strong optimization’s effects on goal misgeneralization, reflection and learning-based alignment shifts. And almost nobody is thinking that far ahead in societal coordination dynamics.
        So I’d really like to see agent foundations and prosaic alignment thinking converge on the types of LLM-based AGI agents we seem likely to get in the near future. We just really don’t know if we can align them or not, because we just really haven’t thought about it deeply yet.
        Links to all of those ideas in depth can be found in a couple link hops from my recent, brief Intent alignment as a stepping-stone to value alignment.
  - Seth Herd 15 Dec 2024 22:44 UTC
    4 points
    2
    Parent
    The people actually building AGI very publicly disagree that we are not on track to solve alignment before building AGI. So do many genuine experts. For instance, I strongly disagree with Pope and Belrose’s “AI is easy to control” but it’s sitting right there in public, and it’s hard to claim they’re not actually experts.
    
    And I just don’t see why you’d want to fight that battle.
    
    I’d say it’s probably pointless to use the higher probability; an estimated 50% chance of everyone dying on the current trajectory seems like plenty to alarm people. That’s vaguely what we’d get if we said “some experts think 99%, others think 1%, so we collectively just don’t know”.
    
    Stating MIRI’s collective opinion instead of a reasonable statement of the consensus is unnecessary and costs you credibility.
    
    To put it another way: someone who uses their own estimate instead of stating the range of credible estimates is less trustworthy on average to speak for a broad population. They’re demonstrating a blinkered, insular style of thinking. The public wants a broad view guiding public policy.
    
    And in this case I just don’t see why you’d take that credibility hit.
    
    Edit: having thought about it a little more, I do actually think that some people would accept a 50% chance of survival and say “roll the dice!”. That’s largely based on the wildly exaggerated fears of civilizational collapse from global warming. And I think that, if they expressed those beliefs clearly, the majority of humanity would still say “wait what that’s insane, we have to make progress on alignment before launching AGI”.
    - yams 16 Dec 2024 0:03 UTC
      3 points
      6
      Parent
      I do mean ASI, not AGI. I know Pope + Belrose also mean to include ASI in their analysis, but it’s still helpful to me if we just use ASI here, so I’m not constantly wondering if you’ve switched to thinking about AGI.
      
      Obligatory ‘no really, I am not speaking for MIRI here.’
      
      My impression is that MIRI is not trying to speak for anyone else. Representing the complete scientific consensus is an undue burden to place on an org that has not made that claim about itself. MIRI represents MIRI, and is one component voice of the ‘broad view guiding public policy’, not its totality. No one person or org is in the chair with the lever; we’re all just shouting what we think in directions we expect the diffuse network of decision-makers to be sitting in, with more or less success. It’s true that ‘claiming to represent the consensus’ is a tacking one can take to appear authoritative, and not (always) a dishonest move. To my knowledge, this is not MIRI’s strategy. This is the strategy of, ie, the CAIS letter (although not of CAIS as a whole!), and occasionally AIS orgs cite expert consensus or specific, otherwise-disagreeing experts as having directional agreement with the org (for an extreme case, see Yann LeCun shortening his timelines). This is not the same as attempting to draw authority from the impression that one’s entire aim is simply ‘sharing consensus.’
      
      And then my model of Seth says ‘Well we should have an org whose entire strategy is gathering and sharing expert consensus, and I’m disappointed that this isn’t MIRI, because this is a better strategy,’ or else cites a bunch of recent instances of MIRI claiming to represent scientific consensus (afaik these don’t exist, but it would be nice to know if they do). It is fair for you to think MIRI should be doing a different thing. Imo MIRI’s history points away from it being a good fit to take representing scientific consensus as its primary charge (and this is, afaict, part of why AI Impacts was a separate project).
      
      I think MIRI comms are by and large well sign-posted to indicate ‘MIRI thinks x’ or ‘Mitch thinks y’ or ‘Bengio said z.’ If you think a single org should build influence and advocate for a consensus view then help found one, or encourage someone else to do so. This just isn’t what MIRI is doing.
      - Seth Herd 16 Dec 2024 0:22 UTC
        2 points
        0
        Parent
        I thought the point of this post was that MIRI is still developing its comms strategy, and one criteria is preserving credibility. I really hope they’ll do that. It’s not violating rationalist principles to talk about beliefs beyond your own.
        
        You’re half right about what I think. I want to live, so I want MIRI to do a good job of comms. Lots of people are shouting their own opinion. I assumed MIRI wanted to be effective, not just shout along with the chorus.
        
        MIRI wouldn’t have to do a bit of extra work to do what I’m suggesting. They’d just have to note their existing knowledge of the (lack of) expert consensus, instead of just giving their own opinion.
        
        You haven’t really disagreed that that would be more effective.
        
        To put it this way: people (largely correctly) believe that MIRI’s beliefs are a product of one guy, EY. Citing more than one guy’s opinions is way more credible, no matter how expert that guy—and it avoids arguing about who’s more expert.
        yams 16 Dec 2024 0:56 UTC
        1 point
        0
        Parent
        Oh, I feel fine about saying ‘draft artifacts currently under production by the comms team ever cite someone who is not Eliezer, including experts with a lower p(doom)’ which, based on this comment, is what I take to be the goalpost. This is just regular coalition signaling though and not positioning yourself as, terminally, a neutral observer of consensus.
        
        “You haven’t really disagreed that [claiming to speak for scientific consensus] would be more effective.”
        
        That’s right! I’m really not sure about this. My experience has been that ~every take someone offers to normies in policy is preceded by ‘the science says…’, so maybe the market is kind of saturated here. I’d also worry that precommitting to only argue in line with the consensus might bind you to act against your beliefs (and I think EY et al have valuable inside-view takes that shouldn’t be stymied by the trends of an increasingly-confused and poisonous discourse). That something is a local credibility win (I’m not sure if it is, actually) doesn’t mean it’s got the best nth order effects among all options long-term (including on the dimension of credibility).
        
        I believe that Seth would find messaging that did this more credible. I think ‘we’re really not sure’ is a bad strategy if you really are sure, which MIRI leadership, famously, is.
        Seth Herd 16 Dec 2024 1:33 UTC
        1 point
        −3
        Parent
        MIRI leadership is famously very wrong about how sure they think they are. That’s my concern. It’s obvious to any rationalist that it’s not rational to believe >99% in something that’s highly theoretical. It’s almost certainly epistemic hubris if not outright foolishness.
        
        I have immense respect for EYs intellect. He seems to be the smartest human I’ve engaged with enough to judge their intellect. On this point he is either obviously or seemingly wrong. I have personally spent at least a hundred hours following his specific logic, (and lots more on the background knowledge it’s based on), and I’m personally quite sure he’s overestimating his certainty. His discussions with other experts always end up falling back on differing intuitions.He got there first, but a bunch of us have now put real time into following and extending his logic.
        
        I have a whole theory on how he wound up so wrong, involving massive frustration and underappreciating how biased people are to short-term thinking and motivated reasoning, but that’s beside the point.
        
        Whether he’s right doesn’t really matter; what matters is that >99.9% doom sounds crazy, and it’s really complex to argue it even could be right, let alone that it actually is.
        
        Since it sounds crazy, leaning on that point is the very best way to harm MIRIs credibility. And because they are one of the most publicly visible advocates of AGI x-risk caution (and planning to become even higher profile it seems), it may make the whole thing sound less credible—maybe by a lot.
        
        Please, please don’t do it or encourage others to do it.
        
        I’m actually starting to worry that MIRI could make us worse off if they insist on shouting loudly and leaning on our least credible point. Public discourse isn’t rational, so focusing on the worst point could make the vibes-based public discussion go against what is otherwise a very simple and sane viewpoint: don’t make a smarter species unless you’re pretty sure it won’t turn on you.
        
        Hopefully I needn’t worry, because MIRI has engaged communication experts, and they will resist just adopting EYs unreasonable doom estimate and bad comms strategy.
        
        To your specific point: “we’re really not sure” is not a bad strategy if “we” means humanity as a whole (if by bad you mean dishonest).
        
        If by bad you mean ineffective: do you seriously think people wouldn’t object to the push for AGI if they thought we were totally unsure?
        
        “One guy who’s thought about this for a long time and some other people he recruited think it’s definitely going to fail” really seems like a way worse argument than “expert opinion is utterly split, so any fool can see we collectively are completely unsure it’s safe”.
        yams 16 Dec 2024 2:05 UTC
        1 point
        0
        Parent
        By bad I mean dishonest, and by ‘we’ I mean the speaker (in this case, MIRI).
        I take myself to have two central claims across this thread:
        Your initial comment was straw manning the ‘if we build [ASI], we all die’ position.
        MIRI is likely not a natural fit to consign itself to service as the neutral mouthpiece of scientific consensus.
        I do not see where your most recent comment has any surface area with either of these claims.
        I do want to offer some reassurance, though:
        I do not take “One guy who’s thought about this for a long time and some other people he recruited think it’s definitely going to fail” to be descriptive of the MIRI comms strategy.
        Seth Herd 16 Dec 2024 19:03 UTC
        1 point
        0
        Parent
        I think we’re talking past each other, so we’d better park it and come back to the topic later and more carefully.
        
        I do feel like you’re misrepresenting my position, so I am going to respond and then quit there. You’re welcome to respond; I’ll try to resist carrying on, and move on to more productive things. I apologize for my somewhat argumentative tone. These are things I feel strongly about, since I think MIRIs communication might matter quite a lot, but that’s not a good reason to get argumentative.
        
        Strawmanning: I’m afraid you’r right that I’m probably exaggerating MIRI’s claims. I don’t think it’s quite a strawman; “if we build it we all die” is very much the tone I get from MIRI comms on LW and X (mostly EY), but I do note that I haven’t seen him use 99.9%+ in some time, so maybe he’s already doing some of what I suggest. And I haven’t surveyed all of MIRIs official comms. But what we’re discussing is a change in comms strategy.
        
        I have gotten more strident in repeated attempts to make my central point clearer. That’s my fault; you weren’t addressing my actual concern so I kept trying to highlight it. I still am not sure if you’re understanding my main point, but that’s fine; I can try to say it better in future iterations.
        
        This is the first place I can see you suggesting that I’m exaggerating MIRIs tone, so if it’s your central concern that’s weird. But again, it’s a valid complaint; I won’t make that characterization in more public places, lest it hurt MIRI’s credibility.
        
        MIRI claiming to accurately represent scientific consensus was never my suggestion, I don’t know where you got that. I clarified that I expect zero additional effort or strong claims, just “different experts believe a lot of different things”.
        
        Honesty: I tried to specify from the first that I’m not suggesting dishonesty by any normal standard. Accurately reporting a (vague) range of others’ opinions is just as honest as reporting your own opinion. Not saying the least convincing part the loudest might be dishonesty by radical honesty standards, but I thought rationalists had more or less agreed that those aren’t a reasonable target. That standard of honesty would kind of conflict with having a “comms strategy” at all.