Thane Ruthenis answers Seriously, what goes wrong with “reward the agent when it makes you smile”?

Thane Ruthenis 12 Aug 2022 0:08 UTC
16 points
2
I think the AI will very probably have a spread of situationally-activated computations which steer its actions towards historical reward-correlates (e.g. if near a person, then tell a joke), and probably not singularly value e.g. making people smile or reward
I agree. My recent write-up is partly an attempt to model this dynamic in a toy causal-graph environment. Most relevantly, this section.
Imagine an environment represented as a causal graph, with some action-nodes $a$ an agent can set, observation-nodes whose values that agent can read off, and some reward-node $r$ whose value determines how much reinforcement the agent gets. The agent starts with no information about the environment structure or the environment state. If the reward-node is sufficiently distant from its action-nodes, it’ll take time for the agent’s world-model to become advanced enough to model it. However, the agent would start trying to develop good policies/heuristics for increasing the reward immediately. Thus, its initial policies will necessarily act on proxies: it’ll be focusing on the values of some intermediate nodes between its action-nodes and the reward-node.
And these proxies can be quite good. For example:
$x_{p}$ is a good proxy for controlling the value of $r$ if the $x_{k}, x_{e}$ chain doesn’t perturb it too much. So an agent that only cares about the environment up to $x_{p}$ can capture e. g. $X % > 90 %$ of the possible maximum reward.
It feels like it shouldn’t matter: that once the world-model is advanced enough to include $r$ directly, the agent should just recognize $r$ as the source of reinforcement, and optimize it directly.
But suppose the heuristics the agent develops have “friction”. That is: once a heuristic has historically performed well enough, the agent is reluctant to replace it with a better but more novel (and therefore untested) one. Or, at least, less willing the less counterfactual reward it promises to deliver. So a heuristic that performs 10x as well as the one it currently has will be able to win against a much older one, but a novel heuristic that only performs 1.1x as well won’t be.
In this case, the marginally more effective policy will not be able to displace a more established one.
(An alternate view: suppose that the agent has two mutually-exclusive heuristic on what to do in a given situation, A and B. A has a good track record, B is a new one, but it’s willing to try B out. Suppose it picks A with probability $p$ and B with $1 - p$ , with $p$ proportional to how long A’s track record is. If the reinforcement B receives is much larger than the reinforcement A receives, then even a rarely-picked B will eventually outpace A. If it’s not much larger, however, then A will be able to “keep up” with B by virtue of being picked more often, and eventually outrace B into irrelevancy.)
Therefore: Yes, the agent will end up optimized for good performance on some proxies of “the human presses the button”. What these proxies are depends on the causal structure of the environment, the percentage of max-reward $X$ optimizing for them allows the agent to capture, and some “friction” value that depends on the agent’s internal architecture.
Major caveat: This mainly only holds for less-advanced systems; for those that are optimized, but do not yet optimize at the strategic level. A hedonist wrapper-mind would have no problems with evaluating whether the new heuristic is actually better, testing it out, and implementing it, no matter how comparably novel it is.
Caveat to the caveat: Such strategic thinking will probably appear after the “values” have already been formed, and at that point the agent will do deceptive alignment to preserve them, instead of self-modifying into a reward-maximizer.
Route warning: This doesn’t mean the agent’s proxies will be friendly or even comprehensible to us. In particular, if the reward structure is
$Something makes me smile \to I smile \to I press the button$
Then it’s about as likely (very not) that the agent will end up focusing on “I smile” as on “I press the button”, since there’s basically just a single causal step. Much more likely is that it’ll value some stuff upstream of “something makes me smile”; possibly very strange stuff.
Note: Using the “antecedent-computation-reinforcer” term really makes all of this clearer, but it’s so unwieldy. Any ideas for coining a better term?