Gordon Seidoh Worley comments on Deconfusing Human Values Research Agenda v1

Gordon Seidoh Worley 27 Mar 2020 18:57 UTC
LW: 4 AF: 2
AF
Suppose one of your friends became 10x more intelligent, or got a superpower where they could choose at will to stop time for everything except themselves and a laptop (that magically still has Internet access). Is this a net positive change to the world, or a net negative one?
I expect it to be net negative. My model is something like humans are not very agentic (able to reliably achieve/optimize for a goal) in absolute terms even though we may feel as though humans are especially agentic relative to other systems, and because humans bumble a lot they don’t tend to have a lot of impact and things work out well or poorly on average as a result of lots of moves that cancel each other out and only leave a small gain or loss in valued outcomes in the end. A 10x smarter human would be more agentic, and if they are not exactly right about how to do good they could more easily do harm that would normally be buffered by their ineffectiveness.
I build this intuition from, for example, the way dictators often screw things up even when they are well intentioned because they now have more power to achieve their goals and it amplifies their mistakes and misunderstandings in ways that cause more impact, more variance, and historically worse outcomes than less agentic methods of leadership.
Although this is not a perfect analogy because 10x smarter is not just 10x more powerful/agentic but 10x better able to think through consequences (which the dictators lacks), I also think the orthogonality thesis is robust enough that it’s more likely to me that 10x smarter will not mean a match in ability to think through consequences that will perfectly offset the risks of greater agency.
Wait, I infer alignment from way more than just observed behavior. In the case of my friends, I have a model of how humans work in general, informed both by theory (e.g. evolutionary psychology) and empirical evidence (e.g. reasoning about how I would do X, and projecting it onto them). In the case of AI systems, I would want similar additional information beyond just their behavior, e.g. an understanding of what their training process incentivizes, running counterfactual queries on them early in training when they are still relatively unintelligent and I can understand them, etc.
Exactly, because you can’t infer alignment from observed behavior without normative assumptions. I’m saying even with all that (or especially with all of that), the measurement gap is large and we should expect high deviance from the target that will readily lead to Goodharting.
It’s not obvious to me that modeling the generators of a thing is easier than modeling the thing. E.g. It’s much easier for me to model humans than to model evolution.
It’s definitely harder. That’s a reasonable consideration when we’re trying to engineer a system that will be good enough while racing against the clock, and I think it’s quite reasonable, for example, that we’re going to try to tackle value alignment via extensions to narrow value learning approaches first because that’s easier to build. But I also think those approaches will fail and so I’m looking ahead to where I see the limits of our knowledge for what we’ll have to do conditioned on this bet I’m making that value learning approaches similar in kind to those we’re trying now won’t produce aligned AIs.
- Rohin Shah 27 Mar 2020 20:24 UTC
  LW: 4 AF: 3
  AF Parent
  I expect it to be net negative.
  Man, I do not share that intuition.
  I’d be interested in specific examples of well-intentioned dictators that screwed things up (though I anticipate my objections will be that 1. they weren’t well-intentioned or 2. they didn’t have the power to actually impose decisions centrally, and had to spend most of their power ensuring that they remained in power).
  I’m saying even with all that (or especially with all of that), the measurement gap is large and we should expect high deviance from the target that will readily lead to Goodharting.
  I know you’re saying that, I just don’t see many arguments for it. From my perspective, you are asserting that Goodhart problems are robust, rather than arguing for it. That’s fine, you can just call it an intuition you have, but to the extent you want to change my mind, restating it in different words is not very likely to work.
  It’s definitely harder.
  This is an assertion, not an argument.
  Do you really believe that you can predict facts about humans better just by reasoning about evolution (and using no information you’ve learned by looking at humans), relative to building a model by looking at humans (and using no information you’ve learned from the theory of evolution)? I suspect you actually mean some other thing, but idk what.
  - Gordon Seidoh Worley 27 Mar 2020 23:39 UTC
    LW: 3 AF: 2
    AF Parent
    I’d be interested in specific examples of well-intentioned dictators that screwed things up (though I anticipate my objections will be that 1. they weren’t well-intentioned or 2. they didn’t have the power to actually impose decisions centrally, and had to spend most of their power ensuring that they remained in power).
    Some examples of actions taken by dictators that I think were well intentioned and meant to further goals that seemed laudable and not about power grabbing to the dictator but had net negative outcomes for the people involved and the world:
    Joseph Stalin’s collectivization of farms
    Tokugawa Iemitsu’s closing off of Japan
    Hugo Chávez’s nationalization of many industries
    I know you’re saying that, I just don’t see many arguments for it. From my perspective, you are asserting that Goodhart problems are robust, rather than arguing for it. That’s fine, you can just call it an intuition you have, but to the extent you want to change my mind, restating it in different words is not very likely to work.
    I’ve made my case for that here.
    Do you really believe that you can predict facts about humans better just by reasoning about evolution (and using no information you’ve learned by looking at humans), relative to building a model by looking at humans (and using no information you’ve learned from the theory of evolution)? I suspect you actually mean some other thing, but idk what.
    No, it’s not my goal that we not look at humans. I instead think we’re currently too focused on trying to figure out everything from only looking at the kinds of evidence we can easily collect today, and that we also don’t have detailed enough models to know what other evidence is likely relevant. I think understanding whatever is going on with values is hard because there is data further “down the stack”, if you will, from observations of behavior that is relevant. I think that because I look at issues like latent preferences that by definition exist because we didn’t have enough data to infer their existence but that need not necessarily exist if we gather more data about how those latent preferences are generated such that we could discover them in advance by looking earlier in the process that generates them.
    - Rohin Shah 28 Mar 2020 5:53 UTC
      LW: 3 AF: 3
      AF Parent
      Some examples of actions taken by dictators that I think were well intentioned and meant to further goals that seemed laudable and not about power grabbing to the dictator but had net negative outcomes for the people involved and the world:
      What’s your model for why those actions weren’t undone?
      
      To pop back up to the original question—if you think making your friend 10x more intelligent would be net negative, would you make them 10x dumber? Or perhaps it’s only good to make them 2x smarter, but after that more marginal intelligence is bad?
      It would be really shocking if we were at the optimal absolute level of intelligence, so I assume that you think we’re at the optimal relative level of intelligence, that is, the best situation is when your friends are about as intelligent as you are. In that case, let’s suppose that we increase/decrease all of your friends and your intelligence by a factor of X. For what range of X would you expect this intervention is net positive?
      (I’m aware that intelligence is not one-dimensional, but I feel like this is still a mostly meaningful question.)
      Just to be clear about my own position, a well intentioned superintelligent AI system totally could make mistakes. However, it seems pretty unlikely that they’d be of the existentially-catastrophic kind. Also, the mistake could be net negative, but the AI system overall should be net positive.
      - Gordon Seidoh Worley 28 Mar 2020 20:27 UTC
        LW: 3 AF: 2
        AF Parent
        What’s your model for why those actions weren’t undone?
        Not quite sure what you’re asking here. In the first two cases they eventually were undone after people got fed up with the situation, the last is recent enough I don’t consider it’s not having already been undone as evidence people like it, only that they don’t have the power to change it. My view is that these changes stayed in place because the dictators and their successors continued to believe the good out weighted the harm when either this was clearly contrary to the ground truth but served some narrow purpose that was viewed as more important or when the ground truth was too hard to discover at the time and we only believe it was net harmful through the lens of historical analysis.
        To pop back up to the original question—if you think making your friend 10x more intelligent would be net negative, would you make them 10x dumber? Or perhaps it’s only good to make them 2x smarter, but after that more marginal intelligence is bad?
        It would be really shocking if we were at the optimal absolute level of intelligence, so I assume that you think we’re at the optimal relative level of intelligence, that is, the best situation is when your friends are about as intelligent as you are. In that case, let’s suppose that we increase/decrease all of your friends and your intelligence by a factor of X. For what range of X would you expect this intervention is net positive?
        I’m not claiming we’re at some optimal level of intelligence for any particular purpose, only that more intelligence leads to greater agency which, in the absence of sufficient mechanisms to constrain actions to beneficial ones, results in greater risk of negative outcomes due to things like deviance and unilateral action. Thus I do in fact think we’d be safer from ourselves, for example screening off existential risks humanity faces due to outside threats like asteroids, if we were dumber.
        By comparison, chimpanzees may not live what look to us like very happy lives, they are some factor dumber than us, but also they aren’t at risk of making themselves extinct because one chimp really wanted a lot of bananas.
        I’m not sure how much smarter we could all get without putting us at too much risk. I think there’s an anthropic argument to be made that we are below whatever level of intelligence is dangerous to ourselves without greater safeguards because we wouldn’t exist in such universes due to having killed ourselves, but I feel like I have little evidence to make a judgement about how much smarter is safe given, for example, being, say, 95th percentile smart didn’t stop people from building things like atomic weapons or developing dangerous chemical applications. I would expect making my friends smarter to risk similarly bad outcomes. Making them dumber seems safer, especially when I’m in the frame of thinking about AGI.