Richard_Ngo comments on Seriously, what goes wrong with “reward the agent when it makes you smile”?

Richard_Ngo 12 Aug 2022 0:58 UTC
LW: 15 AF: 9
11
AF
- Another response is “The AI paralyzes your face into smiling.”
  But this is actually a highly nontrivial claim about the internal balance of value and computation which this reinforcement schedule carves into the AI. Insofar as this response implies that an AI will primarily “care about” literally making you smile, that seems like a highly speculative and unsupported claim about the AI internalizing a single powerful decision-relevant criterion / shard of value, which also happens to be related to the way that humans conceive of the situation (i.e. someone is being made to smile).
Who do you think would make the claim that the AI in this scenario would care about “literally making you smile”, as opposed to some complex, non-human-comprehensible goal somewhat related to humans smiling? E.g. Yudkowsky gives the example of an AI in that situation learning to optimize for “tiny molecular smiley faces”, which is a much weirder generalization than “making you smile”, although I think still less weird than the goal he’d actually expect such a system to learn (which wouldn’t be describable in a single four-word phrase).
I think the AI will very probably have a spread of situationally-activated computations which steer its actions towards historical reward-correlates (e.g. if near a person, then tell a joke), and probably not singularly value e.g. making people smile or reward.
I think this happens when you have less intelligent systems, and then as you have more intelligent systems those correlates end up unified into higher-level abstractions which correspond to large-scale goals. I outline some of the arguments for that position in phase 3 here.
- TurnTrout 15 Aug 2022 4:12 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Who do you think would make the claim that the AI in this scenario would care about “literally making you smile”, as opposed to some complex, non-human-comprehensible goal somewhat related to humans smiling?
  I don’t know? Seems like a representative kind of “potential risk” I’ve read about before, but I’m not going to go dig it up right now. (My post also isn’t primarily about who said what, so I’m confused by your motivation for posting this question?)
  - abramdemski 15 Aug 2022 16:01 UTC
    LW: 2 AF: 2
    0
    AF Parent
    I’ve often repeated scenarios like this, or like the paperclip scenario.
    My intention was never to state that the specific scenario was plausible or default or expected, but rather, that we do not know how to rule it out, and because of that, something similarly bad (but unexpected and hard to predict) might happen.
    The structure of the argument we eventually want is one which could (probabilistically, and of course under some assumptions) rule out this outcome. So to me, pointing it out as a possible outcome is a way of pointing to the inadequacy of our current ability to analyze the situation, not as part of a proto-model in which we are conjecturing that we will be able to predict “the AI will make paperclips” or “the AI will literally try to make you smile”.