jacquesthibs comments on Try to solve the hard parts of the alignment problem

jacquesthibs 29 May 2024 15:40 UTC
3 points
0
Curious to hear what you have to say about this blog post (“Alignment likely generalizes further than capabilities”).
- Thomas Kwa 29 May 2024 18:11 UTC
  2 points
  0
  Parent
  I suggest people read both that and Deep Deceptiveness (which is not about deceptiveness in particular) and think about how both could be valid, because I think they both are.
  - Mikhail Samin 29 May 2024 18:20 UTC
    3 points
    0
    Parent
    Hmm, I’m confused. Can you say what you consider to be valid in the blog post above (some specific points or the whole thing)? The blog post seems to me to reply to claims that the author imagines Nate making, even though Nate doesn’t actually make these claims and occasionally probably holds a very opposite view to the one the author imagines the Sharp Left Turn post represented.
    - Thomas Kwa 29 May 2024 18:26 UTC
      7 points
      5
      Parent
      Points 1-3 and the idea that superintelligences will be able to understand our values (which I think everyone believes). But the conclusion needs a bunch of additional assumptions.
      - Mikhail Samin 29 May 2024 18:56 UTC
        1 point
        0
        Parent
        Thanks, that resolved the confusion!
        
        Yeah, my issue with the post is mostly that the author presents the points he makes, including the idea that superintelligence will be able to understand our values, as somehow contradicting/arguing against sharp left turn being a problem
- Mikhail Samin 29 May 2024 18:03 UTC
  2 points
  0
  Parent
  This post argues in a very invalid way that outer alignment isn’t a problem. It says nothing about the sharp left turn, as the author does not understand what the sharp left turn difficulty is about.
  
  the idea of ‘capabilities generalizing further than alignment’ is central
  
  It is one of the central problems; it is not the central idea behind the doom arguments. See AGI Ruin for the doom arguments, many disjoint.
  
  reward modelling or ability to judge outcomes is likely actually easy
  
  It would seem easy for a superintelligent AI to predict rewards given out by humans very accurately and generalize the prediction capability well from a small set of examples. Issue #1 here is that things from blackmail to brainhacking to finding vulnerabilities in between the human and the update that you get might predictably get you a very high reward, and a very good RLHF reward model doesn’t get you anything like alignment even if the reward is genuinely pursued. Even a perfect predictor of how a human judges an outcome that optimizes for it does something horrible instead of CEV. Issue #2 is that smart agents that try to maximize paperclips will perform on whatever they understand is giving out rewards just as well as agents that try to maximize humanity’s CEV, so SGD doesn’t discriminate between to and optimizes for a good reward model and an ability to pursue goals, but not for the agent pursuing the right kind of goal (and, again, getting maximum score from a “human feedback” predictor is the kind of goal that kills you anyway even if genuinely pursued).
  
  (The next three points in the post seem covered by the above or irrelevant.)
  
  Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa
  
  The problem isn’t that AI won’t know what humans want or won’t predict what reward signal it’ll get; the issue is that it’s not going to care, and we don’t even know what is the kind of “care” we could attempt to point to; and the reward signal we know how to provide gets us killed of optimized for well enough.
  
  None of that is related to the sharp left turn difficulty, and I don’t think the post author understands it at all. (To their credit, many people in the community also don’t understand it.)
  
  Values are relatively computationally simple
  
  Irrelevant, but a sad-funny claim (go read Arbital I guess?)
  
  I wouldn’t claim it’s necessarily more complex than best-human-level agency, it’s maybe not if you’re smart about pointing at things (like, “CEV of humans” seems less complex than the description of what values that’d be, specifically), but the actual description of value is very complex. We feel otherwise, but it is an illusion, on so many levels. See dozens of related posts and articles, from complexity of value to https://arbital.greaterwrong.com/p/rescue_utility?l=3y6.
  
  the idea that our AI systems will be unable to understand our values as they grow in capabilities
  
  Yep, this idea is very clearly very wrong.
  
  I’m happy to bet that the author of the sharp left turn post Nate Soares will say he disagrees with this idea. People who think Soares or Yudkowsky claim that either didn’t actually read what they write, or failed badly at reading comprehension.
- quetzal_rainbow 29 May 2024 16:19 UTC
  2 points
  0
  Parent
  I am going to publish a post with the preliminary title “Alignment Doesn’t Generalize Further Than Capabilities, Come On” before the end of this week. The planned level of argumentation is “hot damn, check out this chart.” It won’t be an answer to Berens’ post, more like an answer to the generalized position.
  - jacquesthibs 29 May 2024 16:25 UTC
    2 points
    0
    Parent
    I think this warrants more discussion, but I think the post would be more valuable if it did try to answer to Beren’s post as well as the same statements @Quintin Pope has made about the topic.