RHollerith comments on Seth Herd’s Shortform

RHollerith 1 Dec 2024 15:52 UTC
5 points
−3

their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk.

The fact that your next sentence refers to Rohin Shah and Paul Christiano, but no one else, makes me worry that for you, only alignment researchers are serious thinkers about AGI risk. Please consider that anyone whose P(doom) is over 90% is extremely unlikely to become an alignment researcher (or to remain one if their P(doom) became high when they were an alignment researcher) because their model will tend predict that alignment research is futile or that it actually increases P(doom).

There is a comment here (which I probably cannot find again) by someone who was in AI research in the 1990s, then he realized that the AI project is actually quite dangerous, so he changed careers to something else. I worry that you are not counting people like him as people who have thought seriously about AGI risk.
- Seth Herd 1 Dec 2024 21:53 UTC
  2 points
  0
  Parent
  I shouldn’t have said “almost everyone else” but “most people who think seriously about AGI risk”.
  
  I can see that implication. I certainly don’t think that only paid alignment researchers have thought seriously about AGI risk.
  
  Your point about self-selection is quite valid.
  
  Depth of thought does count. A person who says “bridges seem like they’d be super dangerous, so I’d never want to try building one”, and so doesn’t become an engineer, does not have a very informed opinion on bridge safety.
  
  There is an interesting interaction between depth of thought and initial opinions. If someone thinks a moderate amount about alignment, concludes it’s super difficult, and so does something else, will probably cease thinking deeply about alignment—but they could’ve had some valid insights that led them to stop thinking about the topic. Someone who thinks for the same amount of time but from a different starting point and who thinks “seems like it should be fairly do-able” might then pursue alignment research and go on to think more deeply. Their different starting points will probably bias their ultimate conclusions—and so will the desire to follow the career path they’ve started on.
  
  So probably we should adjust our estimate of difficulty upward to account for the bias you mention.
  
  But even making an estimate at this point seems premature.
  
  I mention Christiano and Shah because I’ve seen them most visibly try to fully come to grips with the strongest arguments for alignment being very difficult. Ideally, every alignment researcher will do that. And every pause advocate would work just as hard to fully understand the arguments for alignment being achievable. Not everyone will have the time or inclination to do that.
  
  Judging alignment difficulty has to be done by gauging the amount of time-on-task combined with the amount of good-faith consideration of arguments one doesn’t like. That’s the case with everything.
  
  When I try to do that as carefully as I know how, I reach the conclusion that we collectively just don’t know.
  
  Having written that, I have a hard time identifying people who believe alignment is near-impossible who have visibly made an effort to steelman the best arguments that it won’t be that hard. I think that’s understandable; those folks, MIRI and some other individuals, spend a lot of effort trying to correct the thinking of people who are simply over-optimistic because they haven’t thought through the problem far enough yet.
  
  I’d like to write a post called “we should really figure out how hard alignment is”, because I don’t think anyone can reasonably claim to know yet. And without that, we can’t really make strong recommendations for policy and strategy.
  
  I guess that conclusion is enough to say wow, jeez, we should probably not rush toward AGI if we have no real idea how hard it will be to align. I’d much prefer to see that argument than e.g., Max Tegmark saying things along the lines of “we have no idea how to align AGI so it’s a suicide race”. We have lots of ideas at this point, we just don’t know if they will work.