Suppose you have some positive utility functions A,B , mathematically considered to be random variables dependant on choice of policy.
Let K be a random variable that is 1 if the button is pressed, 0 otherwise. Again dependant on choice of policy. I am considering that there is a particular time that the button must be pressed to make this a single bit. You have a single chance to switch this AI between 2 utilities, and whatever you pick in that instant is baked in forevermore.
Let a=E(A(1−K)),b=E(BK) be the expected partial utilities.
Now consider all policies. In particular, consider the pareto frontier between a and b. Through the introduction of randomness, we can make that parito frontier continuous.
We want some policy off that frontier.
There will be some policy ha that throws all resources at maximizing a. And there will be some (usually different) policy hb that throws all available resources at maximizing b.
let q be a tradeoff rate. And u=a(1−q)+bq. The parito frontier can now be defined as the policies that maximize u for varying q.
For ha, this is the policy that is optimal when q=0 which has u(ha)=a(ha). Then dudE(K)<0.
Likewise at hb with q=1 have dudE(K)>0.
So somewhere in the middle, by the continuity granted by stochastic mixes of policies, their must be at least one point where dudE(K)=0. Use that policy, or stochastic mixture of policies.
This agent will never pay ϵ utility to set the value of the button one way or the other. Because the policy is one in which has E(u|K=0)=E(u|K=1) (I think) Which hopefully means that this policy does worse than the policy that doesn’t pay to change the button, but otherwise does everything else the same.
Put another way, if you pay to flip the button. Then you must care about the state of the button, if the chance the button is pressed changes, you are getting more or less for your ϵ utility. So the gradient can’t be 0.
And this is described as a choice of policy. A framework which automatically includes the actions of all sub agents.
So, how does this fail? Who can break this agent and get it to do crazy things?
[Question] What is wrong with this “utility switch button problem” approach?
Suppose you have some positive utility functions A,B , mathematically considered to be random variables dependant on choice of policy.
Let K be a random variable that is 1 if the button is pressed, 0 otherwise. Again dependant on choice of policy. I am considering that there is a particular time that the button must be pressed to make this a single bit. You have a single chance to switch this AI between 2 utilities, and whatever you pick in that instant is baked in forevermore.
Let a=E(A(1−K)), b=E(BK) be the expected partial utilities.
Now consider all policies. In particular, consider the pareto frontier between a and b. Through the introduction of randomness, we can make that parito frontier continuous.
We want some policy off that frontier.
There will be some policy ha that throws all resources at maximizing a. And there will be some (usually different) policy hb that throws all available resources at maximizing b.
let q be a tradeoff rate. And u=a(1−q)+bq. The parito frontier can now be defined as the policies that maximize u for varying q.
For ha, this is the policy that is optimal when q=0 which has u(ha)=a(ha). Then dudE(K)<0.
Likewise at hb with q=1 have dudE(K)>0.
So somewhere in the middle, by the continuity granted by stochastic mixes of policies, their must be at least one point where dudE(K)=0. Use that policy, or stochastic mixture of policies.
This agent will never pay ϵ utility to set the value of the button one way or the other. Because the policy is one in which has E(u|K=0)=E(u|K=1) (I think) Which hopefully means that this policy does worse than the policy that doesn’t pay to change the button, but otherwise does everything else the same.
Put another way, if you pay to flip the button. Then you must care about the state of the button, if the chance the button is pressed changes, you are getting more or less for your ϵ utility. So the gradient can’t be 0.
And this is described as a choice of policy. A framework which automatically includes the actions of all sub agents.
So, how does this fail? Who can break this agent and get it to do crazy things?