Perhaps you should model humans as some kind of cognitively bound agent. An example algorithm would be AIXI-tl. Have the AI assume that humans are an AIXI-tl with an unknown utility function, and try to optimize that function. This means that your AI assumes that we get sufficiently trivial ethical problems correct, and have no clue about sufficiently hard problems.
A person is given a direct choice of shoot their own foot off, or don’t. They choose not to. The AI reasons that our utility function values having feet.
A person is asked if the N th digit of pi is even, with their foot being shot off if they get it wrong. (With N large) They get it wrong. The AI reasons that the human didn’t have enough computing power to solve that problem. As opposed to an AI that assumes humans always behave optimally, which will deduce that humans like having their foot shot off when being asked maths questions.
In practice you might want to use some other type of cognitively bound algorithm, as AIXI-tl probably makes different types of mistakes from humans. This simple model at least demonstrates the behavior of decisions on more understandable situations is a stronger indicator of goal.
If you want me to symbol this out formally, with an agent with priors over all tl limitations that “humans” might have, and all the goals they might have. (low complexity goals favored) I can do that.
Let Wbe the set of Worlds, U:W→R be the set of all utility functions , and O the set of human observations, and A the set of human actions Let C={U×O→A} be the set of bounded optimization algorithms, so that an individual c∈C is a function from (Utility, Observation) pairs, to actions. Examples of c include AIXI-tl with specific time and length limits, and existing deep RL models. This consists of the AI’s idea about what kind of bounded agent we might be. There are various conditions of approximate correctness on C
Let O∗ and A∗be the AI’s observation and action space
The AI is only interacting with one human, and has a prior Π=O×A×C×U×W×O∗×A∗→R where W stands for the rest of the world. Note that parameters not given are summed over, Π(o,c,u,w,o∗,a∗)=∑a∈AΠ(o,a,c,u,w,o∗,a∗)
The AI performs Bayesian updates on Π as normal. Gathering part of an observation o′
Πnew∝{Πo∗⇒o′0else
If A∗ is the AI’s action space, it chooses argmaxa∗∈A∗(∑w∈W(EΠ(u(w))×P(w))
Of course, a lot of the magic here is happening in Π, bit if you can find a prior that favours fast and approximately correct optimization algorithms C over slow or totally defective ones and favours Simplicity of each terms.
Basically the humans utility function is
uh(w)=∑o∈O,c∈C,u∈UΠ(o,a(o),c,u)×u(w)
Where O is the set of all things the human could have seen, a(o) is whatever policy the human implements, and Π focuses on c∈C that are simple, stocastic, bounded maximization algorithms.
If you don’t find it very clear what I’m doing, thats ok. I’m not very cleasr what I’m doing. This is a bit of a point in the rough direction.
A lot of magic is happening in the prior over utility functions and optimization algorithms, removing that magic is the open problem.
(I’m pessimistic about making progress on that problem, and instead try to define value by using the human policy to guide a process of deliberation rather than trying to infer some underlying latent structure.)
I think this is important, but I’d take it further.
In addition to computational limits for the class of decision where you need to compute to decide, there are clearly some heuristics that are being used by humans that give implicitly incoherent values. In those cases, you might want to apply the idea of computational limits as well. This would allow you to say that the reason they picked X not Y at time 1 for time 2, but Y not X at time 2, reflects the cost of thinking about what their future self will want.
Perhaps you should model humans as some kind of cognitively bound agent. An example algorithm would be AIXI-tl. Have the AI assume that humans are an AIXI-tl with an unknown utility function, and try to optimize that function. This means that your AI assumes that we get sufficiently trivial ethical problems correct, and have no clue about sufficiently hard problems.
A person is given a direct choice of shoot their own foot off, or don’t. They choose not to. The AI reasons that our utility function values having feet.
A person is asked if the N th digit of pi is even, with their foot being shot off if they get it wrong. (With N large) They get it wrong. The AI reasons that the human didn’t have enough computing power to solve that problem. As opposed to an AI that assumes humans always behave optimally, which will deduce that humans like having their foot shot off when being asked maths questions.
In practice you might want to use some other type of cognitively bound algorithm, as AIXI-tl probably makes different types of mistakes from humans. This simple model at least demonstrates the behavior of decisions on more understandable situations is a stronger indicator of goal.
If you want me to symbol this out formally, with an agent with priors over all tl limitations that “humans” might have, and all the goals they might have. (low complexity goals favored) I can do that.
I agree you should model the human as some kind of cognitively bounded agent. The question is how.
Let Wbe the set of Worlds, U:W→R be the set of all utility functions , and O the set of human observations, and A the set of human actions Let C={U×O→A} be the set of bounded optimization algorithms, so that an individual c∈C is a function from (Utility, Observation) pairs, to actions. Examples of c include AIXI-tl with specific time and length limits, and existing deep RL models. This consists of the AI’s idea about what kind of bounded agent we might be. There are various conditions of approximate correctness on C
Let O∗ and A∗be the AI’s observation and action space
The AI is only interacting with one human, and has a prior Π=O×A×C×U×W×O∗×A∗→R where W stands for the rest of the world. Note that parameters not given are summed over, Π(o,c,u,w,o∗,a∗)=∑a∈AΠ(o,a,c,u,w,o∗,a∗)
The AI performs Bayesian updates on Π as normal. Gathering part of an observation o′
If A∗ is the AI’s action space, it chooses argmaxa∗∈A∗(∑w∈W(EΠ(u(w))×P(w))
Of course, a lot of the magic here is happening in Π, bit if you can find a prior that favours fast and approximately correct optimization algorithms C over slow or totally defective ones and favours Simplicity of each terms.
Basically the humans utility function is
Where O is the set of all things the human could have seen, a(o) is whatever policy the human implements, and Π focuses on c∈C that are simple, stocastic, bounded maximization algorithms.
If you don’t find it very clear what I’m doing, thats ok. I’m not very cleasr what I’m doing. This is a bit of a point in the rough direction.
A lot of magic is happening in the prior over utility functions and optimization algorithms, removing that magic is the open problem.
(I’m pessimistic about making progress on that problem, and instead try to define value by using the human policy to guide a process of deliberation rather than trying to infer some underlying latent structure.)
I think this is important, but I’d take it further.
In addition to computational limits for the class of decision where you need to compute to decide, there are clearly some heuristics that are being used by humans that give implicitly incoherent values. In those cases, you might want to apply the idea of computational limits as well. This would allow you to say that the reason they picked X not Y at time 1 for time 2, but Y not X at time 2, reflects the cost of thinking about what their future self will want.