ryan_greenblatt comments on Can we get an AI to “do our alignment homework for us”?

ryan_greenblatt 26 Feb 2024 22:08 UTC
4 points
0

Then you need to additionally explain why the AI will do our alignment homework for us while simultaneously refraining from helping with the capabilities homework.

I think there is an important distinction between “If given substantial investment, would the plan to use the AIs to do alignment research work?” and “Will it work in practice given realistic investment?”.

The cost of the approach where the AIs do alignment research might look like 2 years of delay in median worlds and perhaps considerably more delay with some probability.

This is a substantial cost, but it’s not an insanely high cost.
- VojtaKovarik 26 Feb 2024 22:42 UTC
  3 points
  2
  Parent
  I feel a bit confused about your comment: I agree with each individual claim, but I feel like perhaps you meant to imply something beyond just the individual claims. (Which I either don’t understand or perhaps disagree with.)
  Are you saying something like: “Yeah, I think that while this plan would work in theory, I expect it to be hopeless in practice (or unneccessary because the homework wasn’t hard in the first place).”?
  If yes, then I agree—but I feel that of the two questions, “would the plan work in theory” is the much less interesting one. (For example, suppose that OpenAI could in theory use AI to solve alignment in 2 years. Then this won’t really matter unless they can refrain from using that same AI to build misaligned superintelligence in 1.5 years. Or suppose the world could solve AI alignment if the US government instituted a 2-year moratorium on AI research—then this won’t really matter unless the US government actually does that.)
  - ryan_greenblatt 26 Feb 2024 22:49 UTC
    4 points
    6
    Parent
    I just think that these are important concepts to distinguish because I think it’s useful to notice the extent to which problems could be solved by moderate amount of coordination and which asks could suffice for safety.
    
    I wasn’t particularly trying to make a broader claim, just trying to highlight something that seemed important.
    
    My overall guess is that people paying costs equivalent to 2 years of delay for existential safety reasons is about 50% likely. (Though I’m uncertain overall and this is possible to influence.) Thus, ensuring that the plan for spending that budget is as good as possible looks quite good. And not hopeless overall.
    
    By analogy, note that google bears substantial costs to improve security (e.g. running 10% slower).
    
    I think that if we could ensure the implementation of our best safety plans which just cost a few years of delay, we’d be in a much better position.