Jsevillamol comments on AGI safety from first principles: Goals and Agency

Jsevillamol 2 Oct 2020 13:53 UTC
LW: 6 AF: 3
AF
Let me try to paraphrase this:
In the first paragraph you are saying that “seeking influence” is not something that a system will learn to do if that was not a possible strategy in the training regime. (but couldn’t it appear as an emergent property? Certainly humans were not trained to launch rockets—but they nevertheless did?)
In the second paragraph you are saying that common sense sometimes allows you to modify the goals you were given (but for this to apply to AI ststems, wouldn’t they need have common sense in the first place, which kind of assumes that the AI is already aligned?)
In the third paragraph it seems to me that you are saying that humans have some goals that have an built-in override mechanism in them—eg in general humans have a goal of eating delicious cake, but they will forego this goal in the interest of seeking water if they are about ot die of dehydratation (but doesn’t this seem to be a consequence of these goals being just instrumental things that proxy the complex thing that humans actually care about?)
I think I am confused because I do not understand your overall point, so the three paragraphs seem to be saying wildly different things to me.
- Richard_Ngo 21 Oct 2020 10:18 UTC
  LW: 5 AF: 3
  AF Parent
  Hey, thanks for the questions! It’s a very confusing topic so I definitely don’t have a fully coherent picture of it myself. But my best attempt at a coherent overall point:
  
  In the first paragraph you are saying that “seeking influence” is not something that a system will learn to do if that was not a possible strategy in the training regime.
  
  No, I’m saying that giving an agent a goal, in the context of modern machine learning, involves reinforcement in the training regime. It’s not clear to me exactly what goals will result from this, but we can’t just assume that we can “give an AI the final goal of evaluating the Riemann hypothesis” in a way that’s devoid of all context.
  
  you are saying that common sense sometimes allows you to modify the goals you were given (but for this to apply to AI systems, wouldn’t they need have common sense in the first place, which kind of assumes that the AI is already aligned?)
  
  It may be the case that it’s very hard to train AIs without common sense of some kind, potentially because a) that’s just the default for how minds work, they don’t by default extrapolate to crazy edge cases. And b) common sense is very useful in general. For example, if you train AIs on obeying human instructions, then they will only do well in the training environment if they have a common-sense understanding of what humans mean.
  
  humans have some goals that have an built-in override mechanism in them
  
  No, it’s more that the goal itself is only defined in a small-scale setting, because the agent doesn’t think in ways which naturally extrapolate small-scale goals to large scales.
  
  Perhaps it’s useful to think about having the goal of getting a coffee. And suppose there is some unusual action you can take to increase the chances that you get the coffee by 1%. For example, you could order ten coffees instead of one coffee, to make sure at least one of them arrives. There are at least two reasons you might not take this unusual action. In some cases it goes against your values—for example, if you want to save money. But even if that’s not true, you might just not think about what you’re doing as “ensure that I have coffee with maximum probability”, but rather just “get a coffee”. This goal is not high-stakes enough for you to actually extrapolate beyond the standard context. And then some people are just like that with all their goals—so why couldn’t an AI be too?
  - Jsevillamol 22 Oct 2020 10:42 UTC
    LW: 5 AF: 3
    AF Parent
    I think this helped me a lot understand you a bit better—thank you
    Let me try paraphrasing this:
    > Humans are our best example of a sort-of-general intelligence. And humans have a lazy, satisfying, ‘small-scale’ kind of reasoning that is mostly only well suited for activities close to their ‘training regime’. Hence AGIs may also be the same—and in particular if AGIs are trained with Reinforcement Learning and heavily rewarded for following human intentions this may be a likely outcome.
    
    Is that pointing in the direction you intended?