I suggest people read both that and Deep Deceptiveness (which is not about deceptiveness in particular) and think about how both could be valid, because I think they both are.
Hmm, I’m confused. Can you say what you consider to be valid in the blog post above (some specific points or the whole thing)? The blog post seems to me to reply to claims that the author imagines Nate making, even though Nate doesn’t actually make these claims and occasionally probably holds a very opposite view to the one the author imagines the Sharp Left Turn post represented.
Points 1-3 and the idea that superintelligences will be able to understand our values (which I think everyone believes). But the conclusion needs a bunch of additional assumptions.
Yeah, my issue with the post is mostly that the author presents the points he makes, including the idea that superintelligence will be able to understand our values, as somehow contradicting/arguing against sharp left turn being a problem
This post argues in a very invalid way that outer alignment isn’t a problem. It says nothing about the sharp left turn, as the author does not understand what the sharp left turn difficulty is about.
the idea of ‘capabilities generalizing further than alignment’ is central
It is one of the central problems; it is not the central idea behind the doom arguments. See AGI Ruin for the doom arguments, many disjoint.
reward modelling or ability to judge outcomes is likely actually easy
It would seem easy for a superintelligent AI to predict rewards given out by humans very accurately and generalize the prediction capability well from a small set of examples. Issue #1 here is that things from blackmail to brainhacking to finding vulnerabilities in between the human and the update that you get might predictably get you a very high reward, and a very good RLHF reward model doesn’t get you anything like alignment even if the reward is genuinely pursued. Even a perfect predictor of how a human judges an outcome that optimizes for it does something horrible instead of CEV. Issue #2 is that smart agents that try to maximize paperclips will perform on whatever they understand is giving out rewards just as well as agents that try to maximize humanity’s CEV, so SGD doesn’t discriminate between to and optimizes for a good reward model and an ability to pursue goals, but not for the agent pursuing the right kind of goal (and, again, getting maximum score from a “human feedback” predictor is the kind of goal that kills you anyway even if genuinely pursued).
(The next three points in the post seem covered by the above or irrelevant.)
Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa
The problem isn’t that AI won’t know what humans want or won’t predict what reward signal it’ll get; the issue is that it’s not going to care, and we don’t even know what is the kind of “care” we could attempt to point to; and the reward signal we know how to provide gets us killed of optimized for well enough.
None of that is related to the sharp left turn difficulty, and I don’t think the post author understands it at all. (To their credit, many people in the community also don’t understand it.)
Values are relatively computationally simple
Irrelevant, but a sad-funny claim (go read Arbital I guess?)
I wouldn’t claim it’s necessarily more complex than best-human-level agency, it’s maybe not if you’re smart about pointing at things (like, “CEV of humans” seems less complex than the description of what values that’d be, specifically), but the actual description of value is very complex. We feel otherwise, but it is an illusion, on so many levels. See dozens of related posts and articles, from complexity of value to https://arbital.greaterwrong.com/p/rescue_utility?l=3y6.
the idea that our AI systems will be unable to understand our values as they grow in capabilities
Yep, this idea is very clearly very wrong.
I’m happy to bet that the author of the sharp left turn post Nate Soares will say he disagrees with this idea. People who think Soares or Yudkowsky claim that either didn’t actually read what they write, or failed badly at reading comprehension.
I am going to publish a post with the preliminary title “Alignment Doesn’t Generalize Further Than Capabilities, Come On” before the end of this week. The planned level of argumentation is “hot damn, check out this chart.” It won’t be an answer to Berens’ post, more like an answer to the generalized position.
I think this warrants more discussion, but I think the post would be more valuable if it did try to answer to Beren’s post as well as the same statements @Quintin Popehas made about the topic.
My very short explanation: https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization?commentId=CCnHsYdFoaP2e4tku
Curious to hear what you have to say about this blog post (“Alignment likely generalizes further than capabilities”).
I suggest people read both that and Deep Deceptiveness (which is not about deceptiveness in particular) and think about how both could be valid, because I think they both are.
Hmm, I’m confused. Can you say what you consider to be valid in the blog post above (some specific points or the whole thing)? The blog post seems to me to reply to claims that the author imagines Nate making, even though Nate doesn’t actually make these claims and occasionally probably holds a very opposite view to the one the author imagines the Sharp Left Turn post represented.
Points 1-3 and the idea that superintelligences will be able to understand our values (which I think everyone believes). But the conclusion needs a bunch of additional assumptions.
Thanks, that resolved the confusion!
Yeah, my issue with the post is mostly that the author presents the points he makes, including the idea that superintelligence will be able to understand our values, as somehow contradicting/arguing against sharp left turn being a problem
This post argues in a very invalid way that outer alignment isn’t a problem. It says nothing about the sharp left turn, as the author does not understand what the sharp left turn difficulty is about.
It is one of the central problems; it is not the central idea behind the doom arguments. See AGI Ruin for the doom arguments, many disjoint.
It would seem easy for a superintelligent AI to predict rewards given out by humans very accurately and generalize the prediction capability well from a small set of examples. Issue #1 here is that things from blackmail to brainhacking to finding vulnerabilities in between the human and the update that you get might predictably get you a very high reward, and a very good RLHF reward model doesn’t get you anything like alignment even if the reward is genuinely pursued. Even a perfect predictor of how a human judges an outcome that optimizes for it does something horrible instead of CEV. Issue #2 is that smart agents that try to maximize paperclips will perform on whatever they understand is giving out rewards just as well as agents that try to maximize humanity’s CEV, so SGD doesn’t discriminate between to and optimizes for a good reward model and an ability to pursue goals, but not for the agent pursuing the right kind of goal (and, again, getting maximum score from a “human feedback” predictor is the kind of goal that kills you anyway even if genuinely pursued).
(The next three points in the post seem covered by the above or irrelevant.)
The problem isn’t that AI won’t know what humans want or won’t predict what reward signal it’ll get; the issue is that it’s not going to care, and we don’t even know what is the kind of “care” we could attempt to point to; and the reward signal we know how to provide gets us killed of optimized for well enough.
None of that is related to the sharp left turn difficulty, and I don’t think the post author understands it at all. (To their credit, many people in the community also don’t understand it.)
Irrelevant, but a sad-funny claim (go read Arbital I guess?)
I wouldn’t claim it’s necessarily more complex than best-human-level agency, it’s maybe not if you’re smart about pointing at things (like, “CEV of humans” seems less complex than the description of what values that’d be, specifically), but the actual description of value is very complex. We feel otherwise, but it is an illusion, on so many levels. See dozens of related posts and articles, from complexity of value to https://arbital.greaterwrong.com/p/rescue_utility?l=3y6.
Yep, this idea is very clearly very wrong.
I’m happy to bet that the author of the sharp left turn post Nate Soares will say he disagrees with this idea. People who think Soares or Yudkowsky claim that either didn’t actually read what they write, or failed badly at reading comprehension.
I am going to publish a post with the preliminary title “Alignment Doesn’t Generalize Further Than Capabilities, Come On” before the end of this week. The planned level of argumentation is “hot damn, check out this chart.” It won’t be an answer to Berens’ post, more like an answer to the generalized position.
I think this warrants more discussion, but I think the post would be more valuable if it did try to answer to Beren’s post as well as the same statements @Quintin Pope has made about the topic.