MikkW comments on Models Don’t “Get Reward”

MikkW 30 Dec 2022 15:21 UTC
16 points
2
I would say the metaphor of giving dogs biscuits is actually a better analogy than the one you suggest. Just like how a neural network never “gets reward” in the sense of some tangible, physical thing that is given to it, the (subcomponents of the) dog’s brain never gets the biscuit that the dog was fed. The biscuit goes into the dog’s stomach, not its brain.

The way the dog learns from the biscuit-giving process is that the dog’s tounge and nose send an electrical impulse to the dog’s brain, indicating that the dog just ate something tasty. In some part of the brain, those signals cause the brain to release chemicals that induce the dog’s brain to rearrange itself in a way that is quite similar in its effects (though not neccesarily its implementation, I dont know the details well enough) to the gradient descent that trains the NN. In this sense, the metaphor of giving a dog a biscuit is quite apt, in a way that the metaphor of breeding many dogs is not (in particular, usually in the gradient descent algorithms used in ML I’m familiar with, there is only one network that improves over time, unlike evolutionary algorithms which simulate many different agents per training step, selecting for the «fittest»)
What links here?
- MikkW 30 Dec 2022 15:29 UTC
  12 points
  6
  Parent
  One way in which what I just said isn’t completely right, is that animals have memories of its entire lifetime (or at least a big chunk of it), spanning all training events it has experienced, while NNs generally have no memory of previous training runs, and can use these memories to take better actions. However, the primary way the biscuit trick works (I believe) is not through the dog’s memories of having “gotten reward”, but through the more immediate process of having reward chemicals being released and reshaping the brain at the moment of receiving reward, which generally closely resembles widely used ML techniques.
  
  (This is related to the advice in habit building that one receive reward as close in time, ideally on the order of milliseconds, to the desired behavior)
  - Maxwell Clarke 18 Jan 2023 0:29 UTC
    7 points
    0
    Parent
    Fully agree—if the dog were only trying to get biscuits, it wouldn’t continue to sit later on in it’s life when you are no longer rewarding that behavior.Training dogs is actually some mix of the dog consciously expecting a biscuit, and raw updating on the actions previously taken.
    
    Hear sit → Get biscuit → feel good
    becomes
    Hear sit → Feel good → get biscuit → feel good
    becomes
    Hear sit → feel good
    At which point the dog likes sitting, it even reinforces itself, you can stop giving biscuits and start training something else
    - MikkW 18 Jan 2023 11:08 UTC
      2 points
      0
      Parent
      Yeah. I do think there’s also the aspect that dogs like being obedient to their humans, and so after it has first learned the habit, there continues to be a reward simply from being obedient, even after the biscuit gets taken away.
  - geduardo 19 Jan 2023 0:23 UTC
    5 points
    2
    Parent
    I don’t think I can agree with the affirmation that NNs don’t have memory of previous training runs. It depends a bit on the definition of memory, but in the weights distribution there’s certainly some information stored about previous episodes which could be view as memory.
    
    I don’t think memory in animals is much different, just that the neural network is much more complex. But memories do happen because updates in network structure, just as it happens in NNs during a RL training.