Let Wbe the set of Worlds, U:W→R be the set of all utility functions , and O the set of human observations, and A the set of human actions Let C={U×O→A} be the set of bounded optimization algorithms, so that an individual c∈C is a function from (Utility, Observation) pairs, to actions. Examples of c include AIXI-tl with specific time and length limits, and existing deep RL models. This consists of the AI’s idea about what kind of bounded agent we might be. There are various conditions of approximate correctness on C
Let O∗ and A∗be the AI’s observation and action space
The AI is only interacting with one human, and has a prior Π=O×A×C×U×W×O∗×A∗→R where W stands for the rest of the world. Note that parameters not given are summed over, Π(o,c,u,w,o∗,a∗)=∑a∈AΠ(o,a,c,u,w,o∗,a∗)
The AI performs Bayesian updates on Π as normal. Gathering part of an observation o′
Πnew∝{Πo∗⇒o′0else
If A∗ is the AI’s action space, it chooses argmaxa∗∈A∗(∑w∈W(EΠ(u(w))×P(w))
Of course, a lot of the magic here is happening in Π, bit if you can find a prior that favours fast and approximately correct optimization algorithms C over slow or totally defective ones and favours Simplicity of each terms.
Basically the humans utility function is
uh(w)=∑o∈O,c∈C,u∈UΠ(o,a(o),c,u)×u(w)
Where O is the set of all things the human could have seen, a(o) is whatever policy the human implements, and Π focuses on c∈C that are simple, stocastic, bounded maximization algorithms.
If you don’t find it very clear what I’m doing, thats ok. I’m not very cleasr what I’m doing. This is a bit of a point in the rough direction.
A lot of magic is happening in the prior over utility functions and optimization algorithms, removing that magic is the open problem.
(I’m pessimistic about making progress on that problem, and instead try to define value by using the human policy to guide a process of deliberation rather than trying to infer some underlying latent structure.)
I agree you should model the human as some kind of cognitively bounded agent. The question is how.
Let Wbe the set of Worlds, U:W→R be the set of all utility functions , and O the set of human observations, and A the set of human actions Let C={U×O→A} be the set of bounded optimization algorithms, so that an individual c∈C is a function from (Utility, Observation) pairs, to actions. Examples of c include AIXI-tl with specific time and length limits, and existing deep RL models. This consists of the AI’s idea about what kind of bounded agent we might be. There are various conditions of approximate correctness on C
Let O∗ and A∗be the AI’s observation and action space
The AI is only interacting with one human, and has a prior Π=O×A×C×U×W×O∗×A∗→R where W stands for the rest of the world. Note that parameters not given are summed over, Π(o,c,u,w,o∗,a∗)=∑a∈AΠ(o,a,c,u,w,o∗,a∗)
The AI performs Bayesian updates on Π as normal. Gathering part of an observation o′
If A∗ is the AI’s action space, it chooses argmaxa∗∈A∗(∑w∈W(EΠ(u(w))×P(w))
Of course, a lot of the magic here is happening in Π, bit if you can find a prior that favours fast and approximately correct optimization algorithms C over slow or totally defective ones and favours Simplicity of each terms.
Basically the humans utility function is
Where O is the set of all things the human could have seen, a(o) is whatever policy the human implements, and Π focuses on c∈C that are simple, stocastic, bounded maximization algorithms.
If you don’t find it very clear what I’m doing, thats ok. I’m not very cleasr what I’m doing. This is a bit of a point in the rough direction.
A lot of magic is happening in the prior over utility functions and optimization algorithms, removing that magic is the open problem.
(I’m pessimistic about making progress on that problem, and instead try to define value by using the human policy to guide a process of deliberation rather than trying to infer some underlying latent structure.)