DanielFilan comments on Reward is not the optimization target

DanielFilan 13 Sep 2022 3:22 UTC
LW: 4 AF: 3
0
AF
One reason that I doubt this story is that “try new things in case they’re good” is itself the sort of thing that should be reinforced during training on a complicated environment, and would push towards some sort of obfuscated manipulation of humans (similar to how if you read about enough social hacks you’ll probably be a bit scammy even tho you like people and don’t want to scam them). In general, this motivation will push RL agents towards reward-optimal behaviour on the distribution of states they know how to reach and handle.
- TurnTrout 13 Sep 2022 18:47 UTC
  LW: 3 AF: 3
  0
  AF Parent
  similar to how if you read about enough social hacks you’ll probably be a bit scammy even tho you like people and don’t want to scam them
  IDK if this is causally true or just evidentially true. I also further don’t know why it would be mechanistically relevant to the heuristic you posit.
  Rather, I think that agents might end up with this heuristic at first, but over time it would get refined into “try new things which [among other criteria] aren’t obviously going to cause bad value drift away from current values.” One reason I expect the refinement in humans is that noticing your values drifted in a bad way is probably a negative reinforcement event, and so enough exploration-caused negative events might cause credit assignment to refine the heuristic into the shape I listed. This would convergently influence agents to not be reward-optimal, even on known-reachable-states. (I’m not super confident in this particular story porting over to AI, but think it’s a plausible outcome.)
  If that’s kind of heuristic is a major underpinning of what we call “curiosity” in humans, then that would explain why I am, in general, not curious about exploring a life of crime, but am curious about math and art and other activities which won’t cause bad value drift away from my current values.
  - Oliver Sourbut 30 Sep 2022 15:24 UTC
    3 points
    0
    Parent
    This is a really helpful thread, for me, thank you both.
    
    in humans… noticing your values drifted in a bad way is probably a negative reinforcement event
    
    Are you hypothesising a shardy explanation for this (like, former, now dwindled shards get activated for some reason, think ‘what have I done?’; they emit a strong negative reinforcement—maybe they predict low value and some sort of long-horizon temporal-difference credit assignment kicks in...? And squashes/weakens/adjusts the new driften shards...? (The horizon is potentially very long?)) Or just that this is a thing in humans in particular somehow?
- cfoster0 13 Sep 2022 5:20 UTC
  1 point
  0
  Parent
  Hard to say how strongly a decision-heuristic that says “try new things in case they’re good” will measure up against the countervailing “keep doing the things you know are good” (or even a conservative extension to it, like “try new things if they’re sufficiently similar to things you know are good”). The latter would seemingly also be reinforced if it were considered. I do not feel confident reasoning about abstract things like these yet.