Comparing reward learning/reward tampering formalisms
Contrasting formalisms
Here I’ll contrast the approach we’re using in using in Pitfalls of Learning a Reward Online (summarised here), with that used by Tom Everitt and Marcu Hutter in the conceptually similar Reward Tampering Problems and Solutions in Reinforcement Learning. In the following, histories are sequences of actions and observations ; thus . The agent’s policy is given by , the environment is given by .
Then the causal graph for the “Pitfalls” approach is, in plate notation (which basically means that, for every value of from to , the graph inside the rectangle is true):
The is the set of reward functions (mapping “complete” histories of length to real numbers), the tells you which reward is correct, conditional on complete histories, and is the final reward.
In order to move to the reward tampering formalism, we’ll have to generalise the and , just a bit. We’ll allow to take partial histories - shorter than - and return a reward. Similarly, we’ll generalise to a conditional distribution on , conditional on all histories , not just on complete histories.
This leads to the following graph:
This graph is now general enough to include reward tampering formalism.
States, data, and actions
In reward tampering formalism, “observations” () decompose into two pieces: states () and data (). The idea is that data informs you about the reward function, while states get put into the reward function to get the actual reward.
So we can model this as this causal graph (adapted from graph 10b, page 22; this is a slight generalisation, as I haven’t assumed Markovian conditions):
Inside the rectangle, the histories split into data (), states (), and actions (). The reward function is defined by the data only, while the reward comes from this reward function and from the states only—actions don’t directly affect these (though they can indirectly affect them by deciding what states and data come up, of course). Note that in the reward tampering paper, the authors don’t distinguish explicitly between and , but they seem to do so implicitly.
Finally, is the “user’s reward function”, which the agent is estimating via ; this connects to the data only.
Almost all of the probability distributions at each node are “natural” ones that are easy to understand. For example, there are arrows into (the reward) from (the reward function) and (the states history); the “conditional distribution” of is just “apply to . The environment, action, and history naturally provide the next observations (state and data).
Two arrows point to more complicated relations: the arrow from to , and that from to . The two are related; the data is supposed to tell us about the user’s true reward function, while this information informs the choice of .
But the fact that the nodes and the probability distribution have been “designed” this way doesn’t affect the agent. It has a fixed process for estimating from ( stands for the probability function for the reward tampering formalism). It has access to , , and (and their histories) as well as its own policy, but has no direct access to or .
In fact, from the agent’s perspective, is essentially part of , the environment, though focusing on the only.
States and actions in “Pitfalls” formalism
Now, can we put this into the “Pitfalls” formalism? It seems we can, as so:
All conditional probability distributions in this graph are natural.
This graph look very similar to the “reward tampering” one, with the exception of and , pointing at and respectively.
In fact, play the role of in that, for the probability distribution for learning process,
Note that in that expression is natural and simple, while is complex; essentially carries the same information as .
The environment of the learning process plays the same role as the combined and from the reward tampering formalism.
So the isomorphism between the two approaches is, informally speaking:
On reward functions conditional on histories, .
.
Uninfluenceable similarities
If we make the processes uninfluenceable (a concept that exists for both formalisms), the causal graphs look even more similar:
Here the pair , for the learning process, play exactly the same role as the pair[1] , for reward tampering: determining reward functions and observations.
- ↩︎
There is an equivalence between the pairs, but not between the individual elements; thus carries more information than , while carries less information than .
It would be nice to draw out this distinction in more detail. One guess:
Uninfluencability seems similar to requiring zero individual treatment effect of D on R.
Riggability (from the paper) would then correspond to zero average treatment effect of D on R
Stuart, by ”Prt(R|D1;j) is complex” are you referring to their using R=R(.,E[ΘR∗|D1;j]) as the estimated reward function?
Also, what did you think of their arguement that their agents have no incentive to manipulate their beliefs because they evaluate future trajectories based of their current beliefs about how likely they are? Does that suffice to implement eq. 1) from your motivated value selection paper?
I mean that that defining Prt can be done in many different ways, and hence has a lot of contingent structure. In contrast, in Plp(R∣D1:j,ρ), the $\rho is a complex distribution on R, conditional on D1:j; hence Plp itself is trivial and just encodes “apply ρ to R and D1:j in the obvious way.