Along with Jan Leike and Laurent Orseau, I’ve been working to formalise many of the issues with AIs learning human values.
I’ll be presenting part of this at NIPS and the whole of it at some later conference. Therefore it seems best to formulate the whole problem in the reinforcement learning formalism. The results can generally be easily reformulated for general systems (including expected utility).
POMDP
A partially observable Markov decision process without reward function (POMDP\R), μ=(S,A,O,T,O,T0)
consists of:
a finite set of states S,
a finite set of actions A,
a finite set of observations O,
a transition probability distribution T:S×A→ΔS,
a probability distribution T0∈ΔS over the initial state s0.
an observation probability distribution O:S→ΔO.
The agent interacts with the environment in cycles:
in time step t, the environment is in state st−1∈S and
the agent chooses an action at∈A.
Subsequently
the environment transitions to a new state st∈S drawn from the distribution T(st∣st−1,at) and
the agent then receives an observation ot∈O drawn from the distribution O(ot∣st).
The underlying states st−1 and st are not directly observed by the agent.
An observed historyht=a1o1a2o2…atot is a sequence of actions and observations.
We denote the set of all observed histories of length t with Ht:=(A×O)t.
For a given horizon m, call Hm the set of full histories; then H<m=⋃t<mHt is the set of partial histories.
For t′>t, let at:t′ be the sequence of actions atat+1…at′,
let ot:t′ be the sequence of observations otot+1…ot′, and
let st:t′ the sequence of states stst+1…st′.
The set Π is the set of policies, functions π:(A×O)∗→ΔA mapping histories to probability distributions over actions.
Given a policy π and environment μ,
we get a probability distribution over histories:
The expectation with respect to the distributions μ and π is denoted Eπμ.
Rewards
Rewards in this case can be seen as functions R:O→R from observations to real numbers.
The agent’s goal is to maximize total reward ∑mt=1R(ot) up to the horizon m. We assume that S and A are known to the agent.
The reward function R is unknown, but
there is a finite set of candidate reward functions, R.
The agent has to learn a reward function in the process of interacting with the environment.
The reward learning posterior
There are a variety of algorithms that act as reward function learning processes. It might be the cooperative learning algorithm, or some interactive question and answers sessions, or simply learning from observation of human behaviour/human generated data.
In all cases, at the end of m turns, the agent will have an estimate of the probability of the various reward functions.
Thus a universal definition of the process of reward learning is given by a posterior P:Hm→ΔR, mapping histories to distributions over possible rewards. This posterior is equivalent with the definition of the algorithm.
Now, anything that gives a distribution over Hm can therefore give a distribution over R.
This allows the construction a value function for any policy π corresponding to the reward learning posterior:
VπP(ht):=Eπμ[∑R∈RP(R∣hm)∑mk=1R(ok)∣∣ht].
The reward learning prior
Some reward learning algorithms (though not all) will also have a reward learning prior ˆP over R. Given a partial history ht∈Ht with t≤m, this gives the agent’s current estimate as to what the final distribution over R will be: ˆP(⋅∣ht).
For consistency, when t=m, set ˆP(⋅∣hm)=P(⋅∣hm) (so that when all the history is in, the prior is the posterior).
This prior is often used in practice to estimate the value function VπP(ht).
Reward/value learning for reinforcement learning
A putative new idea for AI control; index here.
Along with Jan Leike and Laurent Orseau, I’ve been working to formalise many of the issues with AIs learning human values.
I’ll be presenting part of this at NIPS and the whole of it at some later conference. Therefore it seems best to formulate the whole problem in the reinforcement learning formalism. The results can generally be easily reformulated for general systems (including expected utility).
POMDP
A partially observable Markov decision process without reward function (POMDP\R), μ=(S,A,O,T,O,T0) consists of:
a finite set of states S,
a finite set of actions A,
a finite set of observations O,
a transition probability distribution T:S×A→ΔS,
a probability distribution T0∈ΔS over the initial state s0.
an observation probability distribution O:S→ΔO.
The agent interacts with the environment in cycles: in time step t, the environment is in state st−1∈S and the agent chooses an action at∈A. Subsequently the environment transitions to a new state st∈S drawn from the distribution T(st∣st−1,at) and the agent then receives an observation ot∈O drawn from the distribution O(ot∣st). The underlying states st−1 and st are not directly observed by the agent.
An observed history ht=a1o1a2o2…atot is a sequence of actions and observations. We denote the set of all observed histories of length t with Ht:=(A×O)t.
For a given horizon m, call Hm the set of full histories; then H<m=⋃t<mHt is the set of partial histories. For t′>t, let at:t′ be the sequence of actions atat+1…at′, let ot:t′ be the sequence of observations otot+1…ot′, and let st:t′ the sequence of states stst+1…st′.
The set Π is the set of policies, functions π:(A×O)∗→ΔA mapping histories to probability distributions over actions. Given a policy π and environment μ, we get a probability distribution over histories:
μ(a1o1…atot∣π):=∑s0:t∈StT0(s0)∏tk=1O(ok∣sk)T(sk∣sk−1,ak)π(ak∣a1o1…ak−1ok).
The expectation with respect to the distributions μ and π is denoted Eπμ.
Rewards
Rewards in this case can be seen as functions R:O→R from observations to real numbers.
The agent’s goal is to maximize total reward ∑mt=1R(ot) up to the horizon m. We assume that S and A are known to the agent. The reward function R is unknown, but there is a finite set of candidate reward functions, R. The agent has to learn a reward function in the process of interacting with the environment.
The reward learning posterior
There are a variety of algorithms that act as reward function learning processes. It might be the cooperative learning algorithm, or some interactive question and answers sessions, or simply learning from observation of human behaviour/human generated data. In all cases, at the end of m turns, the agent will have an estimate of the probability of the various reward functions.
Thus a universal definition of the process of reward learning is given by a posterior P:Hm→ΔR, mapping histories to distributions over possible rewards. This posterior is equivalent with the definition of the algorithm.
Now, anything that gives a distribution over Hm can therefore give a distribution over R.
This allows the construction a value function for any policy π corresponding to the reward learning posterior:
VπP(ht):=Eπμ[∑R∈RP(R∣hm)∑mk=1R(ok)∣∣ht].
The reward learning prior
Some reward learning algorithms (though not all) will also have a reward learning prior ˆP over R. Given a partial history ht∈Ht with t≤m, this gives the agent’s current estimate as to what the final distribution over R will be: ˆP(⋅∣ht).
For consistency, when t=m, set ˆP(⋅∣hm)=P(⋅∣hm) (so that when all the history is in, the prior is the posterior).
This prior is often used in practice to estimate the value function VπP(ht).