Here is an example story I wrote (that has been minorly edited by TurnTrout) about how an agent trained by RL could plausibly not optimize reward, forsaking actions that it knew during training would get it high reward. I found it useful as a way to understand his views, and he has signed off on it. Just to be clear, this is not his proposal for why everything is fine, nor is it necessarily an accurate representation of my views, just a plausible-to-TurnTrout story for how agents won’t end up wanting to game human approval:
Agent gets trained on a reward function that’s 1 if it gets human approval, 0 otherwise (or something).
During an intermediate amount of training, the agent’s honest and nice computations get reinforced by reward events.
That means it develops a motivation to act honestly and behave nicely etc., and no similarly strong motivation to gain human approval at all costs.
The agent then gets able to tell that it if it tricked the human, that would be reinforced.
It then decides to not get close in action-space to tricking the human, so that it doesn’t get reinforced into wanting to gain human approval by tricking the human.
This works because:
it’s enough action hops away and/or a small enough part of the space that epsilon-greedy strategies would be very unlikely to push it into the deception mode.
smarter exploration strategies will depend on the agent’s value function to know which states are more or less promising to explore (e.g. something like thompson sampling), and the agent really disvalues deceiving the human, so that doesn’t get reinforced.
One reason that I doubt this story is that “try new things in case they’re good” is itself the sort of thing that should be reinforced during training on a complicated environment, and would push towards some sort of obfuscated manipulation of humans (similar to how if you read about enough social hacks you’ll probably be a bit scammy even tho you like people and don’t want to scam them). In general, this motivation will push RL agents towards reward-optimal behaviour on the distribution of states they know how to reach and handle.
similar to how if you read about enough social hacks you’ll probably be a bit scammy even tho you like people and don’t want to scam them
IDK if this is causally true or just evidentially true. I also further don’t know why it would be mechanistically relevant to the heuristic you posit.
Rather, I think that agents might end up with this heuristic at first, but over time it would get refined into “try new things which [among other criteria] aren’t obviously going to cause bad value drift away from current values.” One reason I expect the refinement in humans is that noticing your values drifted in a bad way is probably a negative reinforcement event, and so enough exploration-caused negative events might cause credit assignment to refine the heuristic into the shape I listed. This would convergently influence agents to not be reward-optimal, even on known-reachable-states. (I’m not super confident in this particular story porting over to AI, but think it’s a plausible outcome.)
If that’s kind of heuristic is a major underpinning of what we call “curiosity” in humans, then that would explain why I am, in general, not curious about exploring a life of crime, but am curious about math and art and other activities which won’t cause bad value drift away from my current values.
This is a really helpful thread, for me, thank you both.
in humans… noticing your values drifted in a bad way is probably a negative reinforcement event
Are you hypothesising a shardy explanation for this (like, former, now dwindled shards get activated for some reason, think ‘what have I done?’; they emit a strong negative reinforcement—maybe they predict low value and some sort of long-horizon temporal-difference credit assignment kicks in...? And squashes/weakens/adjusts the new driften shards...? (The horizon is potentially very long?)) Or just that this is a thing in humans in particular somehow?
Hard to say how strongly a decision-heuristic that says “try new things in case they’re good” will measure up against the countervailing “keep doing the things you know are good” (or even a conservative extension to it, like “try new things if they’re sufficiently similar to things you know are good”). The latter would seemingly also be reinforced if it were considered. I do not feel confident reasoning about abstract things like these yet.
smarter exploration strategies will depend on the agent’s value function
I think this is plausible but overconfident.
FWIW I think with moderate confidence that smarter exploration strategies are fundamental to advanced agency—I think of things like play, ‘deliberate exploration’, experiment design, goal-backchaining and so-on. Mainly because epsilon exploration is scuppered for sparse rewards and real-world dynamics are super-duper highly-branching.
I also think we’ve barely scratched the surface of understanding exploration, though there are some interesting directions like EMPA[1], VariBAD[2], HER[3], and older stuff like pseudocount-based and prediction-error-based ‘curiosity’.
If humans (and/or supervised speedups of humans or similar) can provide dense signals, this claim is weaker, but I think the key problem for AGI learning is OOD dense signals, and I don’t think humans are capable of safe/accurate OOD dense reward/value signals.
Here is an example story I wrote (that has been minorly edited by TurnTrout) about how an agent trained by RL could plausibly not optimize reward, forsaking actions that it knew during training would get it high reward. I found it useful as a way to understand his views, and he has signed off on it. Just to be clear, this is not his proposal for why everything is fine, nor is it necessarily an accurate representation of my views, just a plausible-to-TurnTrout story for how agents won’t end up wanting to game human approval:
Agent gets trained on a reward function that’s 1 if it gets human approval, 0 otherwise (or something).
During an intermediate amount of training, the agent’s honest and nice computations get reinforced by reward events.
That means it develops a motivation to act honestly and behave nicely etc., and no similarly strong motivation to gain human approval at all costs.
The agent then gets able to tell that it if it tricked the human, that would be reinforced.
It then decides to not get close in action-space to tricking the human, so that it doesn’t get reinforced into wanting to gain human approval by tricking the human.
This works because:
it’s enough action hops away and/or a small enough part of the space that epsilon-greedy strategies would be very unlikely to push it into the deception mode.
smarter exploration strategies will depend on the agent’s value function to know which states are more or less promising to explore (e.g. something like thompson sampling), and the agent really disvalues deceiving the human, so that doesn’t get reinforced.
One reason that I doubt this story is that “try new things in case they’re good” is itself the sort of thing that should be reinforced during training on a complicated environment, and would push towards some sort of obfuscated manipulation of humans (similar to how if you read about enough social hacks you’ll probably be a bit scammy even tho you like people and don’t want to scam them). In general, this motivation will push RL agents towards reward-optimal behaviour on the distribution of states they know how to reach and handle.
IDK if this is causally true or just evidentially true. I also further don’t know why it would be mechanistically relevant to the heuristic you posit.
Rather, I think that agents might end up with this heuristic at first, but over time it would get refined into “try new things which [among other criteria] aren’t obviously going to cause bad value drift away from current values.” One reason I expect the refinement in humans is that noticing your values drifted in a bad way is probably a negative reinforcement event, and so enough exploration-caused negative events might cause credit assignment to refine the heuristic into the shape I listed. This would convergently influence agents to not be reward-optimal, even on known-reachable-states. (I’m not super confident in this particular story porting over to AI, but think it’s a plausible outcome.)
If that’s kind of heuristic is a major underpinning of what we call “curiosity” in humans, then that would explain why I am, in general, not curious about exploring a life of crime, but am curious about math and art and other activities which won’t cause bad value drift away from my current values.
This is a really helpful thread, for me, thank you both.
Are you hypothesising a shardy explanation for this (like, former, now dwindled shards get activated for some reason, think ‘what have I done?’; they emit a strong negative reinforcement—maybe they predict low value and some sort of long-horizon temporal-difference credit assignment kicks in...? And squashes/weakens/adjusts the new driften shards...? (The horizon is potentially very long?)) Or just that this is a thing in humans in particular somehow?
Hard to say how strongly a decision-heuristic that says “try new things in case they’re good” will measure up against the countervailing “keep doing the things you know are good” (or even a conservative extension to it, like “try new things if they’re sufficiently similar to things you know are good”). The latter would seemingly also be reinforced if it were considered. I do not feel confident reasoning about abstract things like these yet.
I think this is plausible but overconfident.
FWIW I think with moderate confidence that smarter exploration strategies are fundamental to advanced agency—I think of things like play, ‘deliberate exploration’, experiment design, goal-backchaining and so-on. Mainly because epsilon exploration is scuppered for sparse rewards and real-world dynamics are super-duper highly-branching.
I also think we’ve barely scratched the surface of understanding exploration, though there are some interesting directions like EMPA[1], VariBAD[2], HER[3], and older stuff like pseudocount-based and prediction-error-based ‘curiosity’.
If humans (and/or supervised speedups of humans or similar) can provide dense signals, this claim is weaker, but I think the key problem for AGI learning is OOD dense signals, and I don’t think humans are capable of safe/accurate OOD dense reward/value signals.
Tsividis et al—Human-Level Reinforcement Learning through Theory-Based Modeling, Exploration, and Planning
Zintgraf et al—VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
Andrychowicz et al—Hindsight Experience Replay