ryan_greenblatt comments on Can we get an AI to “do our alignment homework for us”?

ryan_greenblatt 27 Feb 2024 1:34 UTC
4 points
4

Trivially, any AI smart enough to be truly dangerous is capable of doing our “alignment homework” for us, in the sense of having enough intelligence to solve the problem.

Is this trivial? People at least argue that the capability profile could be sufficiently unfortunate such that AIs are extremely dangerous prior to being extremely useful. (As a particularly extreme case, people often argue that AIs will be qualitatively wildly superhumans in dangerous skills (e.g. persuation) prior to being merely qualitatively human level at doing AI safety research. See here for some discussion and see here for an example of someone arguing for AIs having qualitatively wildly superhuman abilities prior to being sufficiently useful in general.)

Of course, this could lead to an amusing case where AIs takeover the world and then need to employ human safety/alignment researchers to solve their alignment homework for them : ).
- AnthonyC 27 Feb 2024 19:01 UTC
  2 points
  −2
  Parent
  Fair enough, “trivial” overstates the case. I do think it is overwhelmingly likely.
  That said, I’m not sure how much we actually disagree on this? I was mostly trying to highlight the gap between an AI have a capability and us having the control to use an AI to usefully benefit from that capability.
  - ryan_greenblatt 27 Feb 2024 19:42 UTC
    5 points
    1
    Parent
    I personally agree that on the default trajectory it’s very likely that at the point where AIs are quite existentially dangerous (in the absense of serious countermeasures) they also are capable of being very useful (though misalignment might make them hard to use).
    
    However, I think this is a key disagreement I have with more pessimistic people who think that at the point where models become useful, they’re also qualitiatively wildly superhumanly dangerous. And this implies (assuming some rough notion of continuity) that there were earlier AIs which weren’t very useful but which were still dangerous in some ways.
    - AnthonyC 27 Feb 2024 20:35 UTC
      4 points
      2
      Parent
      Yeah, there are lots of ways to be useful, and not all require any superhuman capabilities. How much is broadly-effective intelligence vs targeted capabilities development (seems like more the former lately), how much is cheap-but-good-enough compared to humans vs better-than-human along some axis, etc.