This specific situation looks unrealistic. But it’s not really trying to be too realistic, it’s trying to be a counterexample. In that spirit, you could also just use R2(a,sd)=1000, which is a reward function parametrized by θ that gives the same behavior but stops me from saying “Why Not Just set θ=−1”, which isn’t the point.
How something like this might actually happen: you try to have your R1 be a complicated neural network that can approximate any function. But you butcher the implementation and get something basically random instead, and this R2 cannot approximate the real human reward.
An important insight this highlights well: An off-switch is something that you press only when you’ve programmed the AI badly enough that you need to press the off-switch. But if you’ve programmed it wrong, you don’t know what it’s going to do, including, possibly, its off-switch behavior. Make sure you know under which assumptions your off-switch will still work!
Assigning high value to shutting down is incorrigible, because the AI shuts itself down. What about assigning high value to being in a button state?
The paper considers a situation where the shutdown button is hardcoded, which isn’t enough by itself. What’s really happening is that the human either wants or doesn’t want the AI to shut down, which sounds like a term in the human reward that the AI can learn.
One way to do this is for the AI to do maximum likelihood with a prior that assigns 0 probability to the human erroneously giving the shutdown command. I suspect there’s something less hacky related to setting an appropriate prior over the reward assigned to shutting down.
The footnote on page 7 confuses me a bit—don’t you want the AI to always defer to the human in button states? The answer feels like it will be clearer to me if I look into how “expected reward if the button state isn’t avoided” is calculated.
Also I did just jump into this paper. There are probably lots of interesting things that people have said about MDPs and CIRLs and Q-values that would be useful.
I don’t think it’s quite right to call this an off-switch—the model is fully general to the situation where the AI is choosing between two alternatives A and B (normalized in the paper so that U(B) = 0), and to me an off-switch is a hardware override that the AI need not want you to press.
The wisdom to take away from the paper: An AI will voluntarily defer to a human—in the sense that the AI thinks that it can get a better outcome by its own standards if it does what the human says—if it’s uncertain about the utilities, or if the human is rational.
This whole setup seems to be somewhat superseded by CIRL, which has the AI, uh, causally find UA by learning its value from the human actions, instead of evidentially(?) doing it by taking decisions that happen to land it on action A when UA is high because it’s acting in a weird environment where a human is present as a side-constraint.
Could some wisdom to gain be that the high-variance high-human-rationality is something of an explanation as to why CIRL works? I should read more about CIRL to see if this is needed or helpful and to compare and contrast etc.
Why does the reward gained drop when uncertainty is too high? Because the prior that the AI gets from estimating the human reward is more accurate than the human decisions, so in too-high-uncertainty situations it keeps mistakenly deferring to the flawed human who tells it to take the worse action more often?
The verbal description, that the human just types in a noisily sampled value of UA, is somewhat strange—if the human has explicit access to their own utility function, they can just take the best actions directly! In practice, though, the AI would learn this by looking at many past human actions (there’s some CIRL!) which does seem like it plausibly gives a more accurate policy than the human’s (ht Should Robots Be Obedient).
The human is Boltzmann-rational in the two-action situation (hence the sigmoid). I assume that it’s the same for the multi-action situation, though this isn’t stated. How much does the exact way in which the human is irrational matter for their results?
When I read Partial Agency, I was struck with a desire to try formalizing this partial agency thing. Defining Myopia seems like it might have a definition of myopia; one day I might look at it. Anyway,
Formalization of Partial Agency: Try One
A myopic agent is optimizing a reward function R(x0,y(x0)) where x is the vector of parameters it’s thinking about and y is the vector of parameters it isn’t thinking about. The gradient descent step picks the δx in the direction that maximizes R(x0+δx,y(x0)) (it is myopic so it can’t consider the effects on y), and then moves the agent to the point (x0+δx,y(x0+δx)).
This is dual to a stop-gradient agent, which picks the δx in the direction that maximizes f(x0+δx,y(x0+δx)) but then moves the agent to the point (x0+δx,y(x0)) (the gradient through y is stopped).
For example,
Nash equilibria - x are the parameters defining the agent’s behavior.y(x0) are the parameters of the other agents if they go up against the agent parametrized by x0. R is the reward given for an agent x going up against a set of agents y.
Image recognition with a neural network - x is the parameters defining the network, y(x0) are the image classifications for every image in the dataset for the network with parameters x0, and R is the loss function plus the loss of the network described by x on classifying the current training example.
Episodic agent - x are parameters describing the agents behavior.y(x0) are the performances of the agent x0 in future episodes.R is the sum of y, plus the reward obtained in the current episode.
Partial Agency due to Uncertainty?
Is it possible to cast partial agency in terms of uncertainty over reward functions? One reason I’d be myopic is if I didn’t believe that I could, in expectation, improve some part of the reward, perhaps because it’s intractable to calculate (behavior of other agents) or something I’m not programmed to care about (reward in other episodes).
Let R1 be drawn from a probability distribution over reward functions. Then one could decompose the true, uncertain, reward into R′=R0(x0)+R1(x0) defined in such a way that E(R1(x0+δx)−R1(x0))≈0 for any δx? Then this is would be myopia where the agent either doesn’t know or doesn’t care about R1, or at least doesn’t know or care what its output does to R1. This seems sufficient, but not necessary.
Now I have two things that might describe myopia, so let’s use both of them at once! Since you only end up doing gradient descent on R0, it would make sense to say R′(x)=R(x,y(x)) , R0(x)=R(x,y(x0)) , and hence that R1(x)=R(x,y(x))−R(x,y(x0)).
Since R1(x0+δx)=R1(x0)+δx∂R1∂x for small δx, this means that E(∂R1∂x)=0 , so substituting in my expression for R1 gives E(∂R∂x+∂R∂y∂y∂x−∂R∂x)=0 , so E(∂R∂y∂y∂x)=0 . Uncertainly is only over R, so this is just the claim that the agent will be myopic with respect to y if E(∂R∂y)=0. So it won’t want to include y in its gradient calculation if it thinks the gradients with respect to y are, on average, 0. Well, at least I didn’t derive something obviously false!
But Wait There’s More
When writing the examples for the gradient descenty formalisation, something struck me: it seems there’s a R(x)=r(x)+∑iyi(x) structure to a lot of them, where r is the reward on the current episode, and yi are rewards obtained on future episodes.
You could maybe even use this to have soft episode boundaries, like say the agent receives a reward rt on each timestep so R(x)=r0(x)+r1(x)α+r2(x)α2+∑i=3ri(x)αi , and saying that α3≪1 so that ∂R∂ri≪1 for i≥3, which is basically the criterion for myopia up above.
Unrelated Note
On a completely unrelated note, I read the Parable of Predict-O-Matic in the past, but foolishly neglected to read Partial Agency beforehand. The only thing that I took away from PoPOM the first time around was the bit about inner optimisers, coincidentally the only concept introduced that I had been thinking about beforehand. I should have read the manga before I watched the anime.
So the definition of myopia given in Defining Myopia was quite similar to my expansion in the But Wait There’s More section; you can roughly match them up by saying r(x)=∑ifiri(x) and yi(x)=(1−fi)ri(x) , where fi is a real number corresponding to the amount that the agent cares about rewards obtained in episode i and ri is the reward obtained in episode i. Putting both of these into the sum gives R(x)=∑iri(x), the undiscounted, non-myopic reward that the agent eventually obtains.
In terms of the R=R0+R1 definition that I give in the uncertainty framing, this is R0=R(x,y0)=∑ifiri(x)+∑i(1−fi)ri(x0), and R1=R(x,y)−R(x,y0)=∑i(1−fi)(ri(x)−ri(x0)).
So if you let r be a vector of the reward obtained on each step and f be a vector of how much the agent cares about each step then x→x+ϵ∑ifi∂ri∂x , and thus the change to the overall reward is R→R+ϵ∑i∂ri∂x∑jfj∂rj∂x , which can be negative if the two sums have different signs.
I was hoping that a point would reveal itself to me about now but I’ll have to get back to you on that one.
Thoughts on Ryan Carey’s Incorrigibility in the CIRL Framework (I am going to try to post these semi-regularly).
This specific situation looks unrealistic. But it’s not really trying to be too realistic, it’s trying to be a counterexample. In that spirit, you could also just use R2(a,sd)=1000, which is a reward function parametrized by θ that gives the same behavior but stops me from saying “Why Not Just set θ=−1”, which isn’t the point.
How something like this might actually happen: you try to have your R1 be a complicated neural network that can approximate any function. But you butcher the implementation and get something basically random instead, and this R2 cannot approximate the real human reward.
An important insight this highlights well: An off-switch is something that you press only when you’ve programmed the AI badly enough that you need to press the off-switch. But if you’ve programmed it wrong, you don’t know what it’s going to do, including, possibly, its off-switch behavior. Make sure you know under which assumptions your off-switch will still work!
Assigning high value to shutting down is incorrigible, because the AI shuts itself down. What about assigning high value to being in a button state?
The paper considers a situation where the shutdown button is hardcoded, which isn’t enough by itself. What’s really happening is that the human either wants or doesn’t want the AI to shut down, which sounds like a term in the human reward that the AI can learn.
One way to do this is for the AI to do maximum likelihood with a prior that assigns 0 probability to the human erroneously giving the shutdown command. I suspect there’s something less hacky related to setting an appropriate prior over the reward assigned to shutting down.
The footnote on page 7 confuses me a bit—don’t you want the AI to always defer to the human in button states? The answer feels like it will be clearer to me if I look into how “expected reward if the button state isn’t avoided” is calculated.
Also I did just jump into this paper. There are probably lots of interesting things that people have said about MDPs and CIRLs and Q-values that would be useful.
Thoughts on Dylan Hadfield-Menell et al.’s The Off-Switch Game.
I don’t think it’s quite right to call this an off-switch—the model is fully general to the situation where the AI is choosing between two alternatives A and B (normalized in the paper so that U(B) = 0), and to me an off-switch is a hardware override that the AI need not want you to press.
The wisdom to take away from the paper: An AI will voluntarily defer to a human—in the sense that the AI thinks that it can get a better outcome by its own standards if it does what the human says—if it’s uncertain about the utilities, or if the human is rational.
This whole setup seems to be somewhat superseded by CIRL, which has the AI, uh, causally find UA by learning its value from the human actions, instead of evidentially(?) doing it by taking decisions that happen to land it on action A when UA is high because it’s acting in a weird environment where a human is present as a side-constraint.
Could some wisdom to gain be that the high-variance high-human-rationality is something of an explanation as to why CIRL works? I should read more about CIRL to see if this is needed or helpful and to compare and contrast etc.
Why does the reward gained drop when uncertainty is too high? Because the prior that the AI gets from estimating the human reward is more accurate than the human decisions, so in too-high-uncertainty situations it keeps mistakenly deferring to the flawed human who tells it to take the worse action more often?
The verbal description, that the human just types in a noisily sampled value of UA, is somewhat strange—if the human has explicit access to their own utility function, they can just take the best actions directly! In practice, though, the AI would learn this by looking at many past human actions (there’s some CIRL!) which does seem like it plausibly gives a more accurate policy than the human’s (ht Should Robots Be Obedient).
The human is Boltzmann-rational in the two-action situation (hence the sigmoid). I assume that it’s the same for the multi-action situation, though this isn’t stated. How much does the exact way in which the human is irrational matter for their results?
Thoughts on Abram Demski’s Partial Agency:
When I read Partial Agency, I was struck with a desire to try formalizing this partial agency thing. Defining Myopia seems like it might have a definition of myopia; one day I might look at it. Anyway,
Formalization of Partial Agency: Try One
A myopic agent is optimizing a reward function R(x0,y(x0)) where x is the vector of parameters it’s thinking about and y is the vector of parameters it isn’t thinking about. The gradient descent step picks the δx in the direction that maximizes R(x0+δx,y(x0)) (it is myopic so it can’t consider the effects on y), and then moves the agent to the point (x0+δx,y(x0+δx)).
This is dual to a stop-gradient agent, which picks the δx in the direction that maximizes f(x0+δx,y(x0+δx)) but then moves the agent to the point (x0+δx,y(x0)) (the gradient through y is stopped).
For example,
Nash equilibria - x are the parameters defining the agent’s behavior.y(x0) are the parameters of the other agents if they go up against the agent parametrized by x0. R is the reward given for an agent x going up against a set of agents y.
Image recognition with a neural network - x is the parameters defining the network, y(x0) are the image classifications for every image in the dataset for the network with parameters x0, and R is the loss function plus the loss of the network described by x on classifying the current training example.
Episodic agent - x are parameters describing the agents behavior.y(x0) are the performances of the agent x0 in future episodes.R is the sum of y, plus the reward obtained in the current episode.
Partial Agency due to Uncertainty?
Is it possible to cast partial agency in terms of uncertainty over reward functions? One reason I’d be myopic is if I didn’t believe that I could, in expectation, improve some part of the reward, perhaps because it’s intractable to calculate (behavior of other agents) or something I’m not programmed to care about (reward in other episodes).
Let R1 be drawn from a probability distribution over reward functions. Then one could decompose the true, uncertain, reward into R′=R0(x0)+R1(x0) defined in such a way that E(R1(x0+δx)−R1(x0))≈0 for any δx? Then this is would be myopia where the agent either doesn’t know or doesn’t care about R1, or at least doesn’t know or care what its output does to R1. This seems sufficient, but not necessary.
Now I have two things that might describe myopia, so let’s use both of them at once! Since you only end up doing gradient descent on R0, it would make sense to say R′(x)=R(x,y(x)) , R0(x)=R(x,y(x0)) , and hence that R1(x)=R(x,y(x))−R(x,y(x0)).
Since R1(x0+δx)=R1(x0)+δx∂R1∂x for small δx, this means that E(∂R1∂x)=0 , so substituting in my expression for R1 gives E(∂R∂x+∂R∂y∂y∂x−∂R∂x)=0 , so E(∂R∂y∂y∂x)=0 . Uncertainly is only over R, so this is just the claim that the agent will be myopic with respect to y if E(∂R∂y)=0. So it won’t want to include y in its gradient calculation if it thinks the gradients with respect to y are, on average, 0. Well, at least I didn’t derive something obviously false!
But Wait There’s More
When writing the examples for the gradient descenty formalisation, something struck me: it seems there’s a R(x)=r(x)+∑iyi(x) structure to a lot of them, where r is the reward on the current episode, and yi are rewards obtained on future episodes.
You could maybe even use this to have soft episode boundaries, like say the agent receives a reward rt on each timestep so R(x)=r0(x)+r1(x)α+r2(x)α2+∑i=3ri(x)αi , and saying that α3≪1 so that ∂R∂ri≪1 for i≥3, which is basically the criterion for myopia up above.
Unrelated Note
On a completely unrelated note, I read the Parable of Predict-O-Matic in the past, but foolishly neglected to read Partial Agency beforehand. The only thing that I took away from PoPOM the first time around was the bit about inner optimisers, coincidentally the only concept introduced that I had been thinking about beforehand. I should have read the manga before I watched the anime.
So the definition of myopia given in Defining Myopia was quite similar to my expansion in the But Wait There’s More section; you can roughly match them up by saying r(x)=∑ifiri(x) and yi(x)=(1−fi)ri(x) , where fi is a real number corresponding to the amount that the agent cares about rewards obtained in episode i and ri is the reward obtained in episode i. Putting both of these into the sum gives R(x)=∑iri(x), the undiscounted, non-myopic reward that the agent eventually obtains.
In terms of the R=R0+R1 definition that I give in the uncertainty framing, this is R0=R(x,y0)=∑ifiri(x)+∑i(1−fi)ri(x0), and R1=R(x,y)−R(x,y0)=∑i(1−fi)(ri(x)−ri(x0)).
So if you let r be a vector of the reward obtained on each step and f be a vector of how much the agent cares about each step then x→x+ϵ∑ifi∂ri∂x , and thus the change to the overall reward is R→R+ϵ∑i∂ri∂x∑jfj∂rj∂x , which can be negative if the two sums have different signs.
I was hoping that a point would reveal itself to me about now but I’ll have to get back to you on that one.