similar to how if you read about enough social hacks you’ll probably be a bit scammy even tho you like people and don’t want to scam them
IDK if this is causally true or just evidentially true. I also further don’t know why it would be mechanistically relevant to the heuristic you posit.
Rather, I think that agents might end up with this heuristic at first, but over time it would get refined into “try new things which [among other criteria] aren’t obviously going to cause bad value drift away from current values.” One reason I expect the refinement in humans is that noticing your values drifted in a bad way is probably a negative reinforcement event, and so enough exploration-caused negative events might cause credit assignment to refine the heuristic into the shape I listed. This would convergently influence agents to not be reward-optimal, even on known-reachable-states. (I’m not super confident in this particular story porting over to AI, but think it’s a plausible outcome.)
If that’s kind of heuristic is a major underpinning of what we call “curiosity” in humans, then that would explain why I am, in general, not curious about exploring a life of crime, but am curious about math and art and other activities which won’t cause bad value drift away from my current values.
This is a really helpful thread, for me, thank you both.
in humans… noticing your values drifted in a bad way is probably a negative reinforcement event
Are you hypothesising a shardy explanation for this (like, former, now dwindled shards get activated for some reason, think ‘what have I done?’; they emit a strong negative reinforcement—maybe they predict low value and some sort of long-horizon temporal-difference credit assignment kicks in...? And squashes/weakens/adjusts the new driften shards...? (The horizon is potentially very long?)) Or just that this is a thing in humans in particular somehow?
IDK if this is causally true or just evidentially true. I also further don’t know why it would be mechanistically relevant to the heuristic you posit.
Rather, I think that agents might end up with this heuristic at first, but over time it would get refined into “try new things which [among other criteria] aren’t obviously going to cause bad value drift away from current values.” One reason I expect the refinement in humans is that noticing your values drifted in a bad way is probably a negative reinforcement event, and so enough exploration-caused negative events might cause credit assignment to refine the heuristic into the shape I listed. This would convergently influence agents to not be reward-optimal, even on known-reachable-states. (I’m not super confident in this particular story porting over to AI, but think it’s a plausible outcome.)
If that’s kind of heuristic is a major underpinning of what we call “curiosity” in humans, then that would explain why I am, in general, not curious about exploring a life of crime, but am curious about math and art and other activities which won’t cause bad value drift away from my current values.
This is a really helpful thread, for me, thank you both.
Are you hypothesising a shardy explanation for this (like, former, now dwindled shards get activated for some reason, think ‘what have I done?’; they emit a strong negative reinforcement—maybe they predict low value and some sort of long-horizon temporal-difference credit assignment kicks in...? And squashes/weakens/adjusts the new driften shards...? (The horizon is potentially very long?)) Or just that this is a thing in humans in particular somehow?