Richard_Ngo comments on AGI safety from first principles: Goals and Agency

Richard_Ngo Oct 21, 2020, 10:18 AM
LW: 5 AF: 3
AF
Hey, thanks for the questions! It’s a very confusing topic so I definitely don’t have a fully coherent picture of it myself. But my best attempt at a coherent overall point:

In the first paragraph you are saying that “seeking influence” is not something that a system will learn to do if that was not a possible strategy in the training regime.

No, I’m saying that giving an agent a goal, in the context of modern machine learning, involves reinforcement in the training regime. It’s not clear to me exactly what goals will result from this, but we can’t just assume that we can “give an AI the final goal of evaluating the Riemann hypothesis” in a way that’s devoid of all context.

you are saying that common sense sometimes allows you to modify the goals you were given (but for this to apply to AI systems, wouldn’t they need have common sense in the first place, which kind of assumes that the AI is already aligned?)

It may be the case that it’s very hard to train AIs without common sense of some kind, potentially because a) that’s just the default for how minds work, they don’t by default extrapolate to crazy edge cases. And b) common sense is very useful in general. For example, if you train AIs on obeying human instructions, then they will only do well in the training environment if they have a common-sense understanding of what humans mean.

humans have some goals that have an built-in override mechanism in them

No, it’s more that the goal itself is only defined in a small-scale setting, because the agent doesn’t think in ways which naturally extrapolate small-scale goals to large scales.

Perhaps it’s useful to think about having the goal of getting a coffee. And suppose there is some unusual action you can take to increase the chances that you get the coffee by 1%. For example, you could order ten coffees instead of one coffee, to make sure at least one of them arrives. There are at least two reasons you might not take this unusual action. In some cases it goes against your values—for example, if you want to save money. But even if that’s not true, you might just not think about what you’re doing as “ensure that I have coffee with maximum probability”, but rather just “get a coffee”. This goal is not high-stakes enough for you to actually extrapolate beyond the standard context. And then some people are just like that with all their goals—so why couldn’t an AI be too?
- Jsevillamol Oct 22, 2020, 10:42 AM
  LW: 5 AF: 3
  AF Parent
  I think this helped me a lot understand you a bit better—thank you
  Let me try paraphrasing this:
  > Humans are our best example of a sort-of-general intelligence. And humans have a lazy, satisfying, ‘small-scale’ kind of reasoning that is mostly only well suited for activities close to their ‘training regime’. Hence AGIs may also be the same—and in particular if AGIs are trained with Reinforcement Learning and heavily rewarded for following human intentions this may be a likely outcome.
  
  Is that pointing in the direction you intended?