Daniel Kokotajlo answers Seriously, what goes wrong with “reward the agent when it makes you smile”?

Daniel Kokotajlo 12 Aug 2022 2:31 UTC
LW: 7 AF: 7
5
AF
Quoting Rob Bensinger quoting Eliezer:
So what actually happens as near as I can figure (predicting future = hard) is that somebody is trying to teach their research AI to, god knows what, maybe just obey human orders in a safe way, and it seems to be doing that, and a mix of things goes wrong like:

The preferences not being really readable because it’s a system of neural nets acting on a world-representation built up by other neural nets, parts of the system are self-modifying and the self-modifiers are being trained by gradient descent in Tensorflow, there’s a bunch of people in the company trying to work on a safer version but it’s way less powerful than the one that does unrestricted self-modification, they’re really excited when the system seems to be substantially improving multiple components, there’s a social and cognitive conflict I find hard to empathize with because I personally would be running screaming in the other direction two years earlier, there’s a lot of false alarms and suggested or attempted misbehavior that the creators all patch successfully, some instrumental strategies pass this filter because they arose in places that were harder to see and less transparent, the system at some point seems to finally “get it” and lock in to good behavior which is the point at which it has a good enough human model to predict what gets the supervised rewards and what the humans don’t want to hear, they scale the system further, it goes past the point of real strategic understanding and having a little agent inside plotting, the programmers shut down six visibly formulated goals to develop cognitive steganography and the seventh one slips through, somebody says “slow down” and somebody else observes that China and Russia both managed to steal a copy of the code from six months ago and while China might proceed cautiously Russia probably won’t, the agent starts to conceal some capability gains, it builds an environmental subagent, the environmental agent begins self-improving more freely, undefined things happen as a sensory-supervision ML-based architecture shakes out into the convergent shape of expected utility with a utility function over the environmental model, the main result is driven by whatever the self-modifying decision systems happen to see as locally optimal in their supervised system locally acting on a different domain than the domain of data on which it was trained, the light cone is transformed to the optimum of a utility function that grew out of the stable version of a criterion that originally happened to be about a reward signal counter on a GPU or God knows what.

Perhaps the optimal configuration for utility per unit of matter, under this utility function, happens to be a tiny molecular structure shaped roughly like a paperclip.

That is what a paperclip maximizer is. It does not come from a paperclip factory AI. That would be a silly idea and is a distortion of the original example.
- TurnTrout 15 Aug 2022 4:05 UTC
  LW: 6 AF: 3
  0
  AF Parent
  Perhaps the optimal configuration for utility per unit of matter, under this utility function, happens to be a tiny molecular structure shaped roughly like a paperclip.
  I think this is very improbable, but thanks for the quote. Not sure if it addresses my question?
  - Daniel Kokotajlo 25 Aug 2022 3:10 UTC
    LW: 5 AF: 4
    2
    AF Parent
    Yudkowsky & I would of course agree that that is very improbable. It’s just an example.
    
    The point I was making with this quote is that the question you are asking is a Big Old Unsolved Problem in the literature. If we had any idea what sort of utility function the system would end up with, that would be great and an improvement over the status quo. Yudkowsky’s point in the quote is that it’s a complicated multi-step process we currently don’t have a clue about, it’s not nearly as simple as “the system will maximize reward.” A much better story would be “The system will maximize some proxy, which will gradually evolve via SGD to be closer and closer to reward, but at some point it’ll get smart enough to go for reward for instrumental convergence reasons and at that point its proxy goal will crystallize.” But this story is also way too simplistic. And it doesn’t tell us much at all about what the proxy will actually look like, because so much depends on the exact order in which various things are learned.
    
    I should have made it just a comment, not an answer.
    - TurnTrout 29 Aug 2022 21:30 UTC
      LW: 2 AF: 2
      0
      AF Parent
      because so much depends on the exact order in which various things are learned.
      I actually doubt that claim in its stronger forms. I think there’s some substantial effect, but e.g. whether a child loves their family doesn’t depend strongly on the precise curriculum at grade school.
      - Daniel Kokotajlo 30 Aug 2022 2:36 UTC
        LW: 4 AF: 4
        0
        AF Parent
        Yet whether a child grows up to work on x-risk reduction vs. homeless shelters vs. voting Democrats out of office vs. voting Republicans out of office does often depend on the precise curriculum in college+high school.
        
        (I think we are in agreement here. I’d be interested to hear if you can point to any particular value AGI will probably have, or (weaker) any particular value such that if AGI has it, it doesn’t depend strongly on the curriculum, order in which concepts are learned, etc.)