A Short Dialogue on the Meaning of Reward Functions

Produced as part of the SERI ML Alignment Theory Scholars Program—Winter 2022 Cohort.

The following is a short slack dialogue between Leon Lang, Quintin Pope, and Peli Grietzer that emerged as part of the SERI-MATS stream on shard theory. Alex Turner encouraged us to share it.

To follow the dialogue, it seems beneficial to have read Reward is not the optimization target before.

Leon:

I was thinking about what the “limits” of the idea “reward is not the optimization target” is. Playing devils advocate: Maybe the reward doesn’t matter at all! We spend all this time thinking about how to design reward functions that somehow capture our values. But maybe the only thing we need to do is have some kind-of-dumb reward, and an environment where the agent can only reach this reward by acting in accordance to our values.

E.g.: We might reward the agent for winning some board game which is (partially) cooperative. But then in the environment, real people only cooperate with the agent if it displays cooperative/​friendly/​nice behavior, and other behaviors are a losing strategy. Never did the reward itself say anything about being cooperative, but this still emerged in accordance with the environment, and so the agent ends up pretty nice.

Someone might argue against this view: Being nice is, in this environment, only instrumentally valuable for achieving reward, and a smart agent knows this, plays nice to win games and, once it has enough power, will overthrow humanity. This is a version of the problem of deceptive alignment.
However, I would argue (still playing devil’s advocate): The agent’s learning process doesn’t care about whether something is an instrumental value or the actual reward! The behavior to be friendly will be reinforced just as much, and the trained agent doesn’t necessarily reason about whether something is instrumentally or finally valuable, it just wants to be nice if this was reinforced.

Not playing devil’s advocate anymore, I’m a bit skeptical of this type of reasoning. Human’s are generally pretty social, but we also know that powerful humans often lose quite a lot of their morality.

Peli:

I like this a lot! I think there’s a novel and true insight here, which I’d maybe paraphrase as ‘the reinforcement schedule is a function of the reward function and environment together, and it’s the reinforcement schedule that matters.’

Quintin:

The way I’d put it is not “Maybe the reward doesn’t matter at all!”, but “the labels you assign your reward function do not matter at all”. You say “Never did the reward itself say anything about being cooperative”, but what physical fact does this correspond to? You have a physical system (the one that implements the reward function) whose actual behavior is to reinforce cognition that leads to cooperative behavior.
If you look at the implementation of the reward function, there isn’t any sort of classifier that checks if the agent’s behavior seems cooperative and only lets the reward event occur if the agent seems to have been sufficiently cooperative. You can take this fact and say that the reward isn’t “about” being cooperative. But the property of the reward being “about” something is a hanging reference without basis in the actual computational behavior of the reward implementation.
E.g., imagine we were to substitute the “winning board game” reward with a “win the game AND be nice” reward. However, by assumption, this doesn’t change the distribution of game trajectories X rewards at all, since winning always implies being nice. We’ve switched what the reward is “about”, without actually changing anything about its functional behavior.

Another way of putting it: there is no Cartesian boundary between the thing you call the reward function and the thing you call the environment. If your environment constrains events such that only nice behavior gets rewarded, why not call that aspect of environmental dynamics part of the “reward function”? The computations that causally influence which sorts of cognition becomes more frequent over time need not be concentrated in the thing you have labeled “reward function”.

Peli:

There’s probably still a useful distinction to be made purely from the viewpoint of software modularity, in a context where we’re experimenting with different training setups with different combinations of environment and reward function.

Quintin:

Yes, it’s sometimes useful to factor reality into more easily managed buckets. However, doing so can lead to confusion when you let causal influence flow from the labels you assign to the physical systems to which you assign the labels.

the reinforcement schedule is a function of the reward function and environment together, and it’s the reinforcement schedule that matters.

I’d also note that the system’s exploration policy also matters a lot. Maybe doing cocaine is very rewarding, and maybe the system is in an environment where it can easily do cocaine, but if the system decides not to explore doing cocaine, it won’t experience that reinforcement event.

(Of course, the exploration policy is also part of the environment, as is the system being trained.)