In Chapter 16, we show that a machine has a positive incentive to allow itself to be switched off if and only if it is uncertain about the human objective.
This sentence really makes no sense to me. The proof that it can have an incentive to allow itself to be switched off even if it isn’t uncertain is trivial.
Just create a utility function that assigns intrinsic reward to shutting itself off, or create a payoff matrix that punishes it really hard if it doesn’t turn itself off. In this context using this kind of technical language feels actively deceitful to me, since it’s really obvious that the argument he is making in that chapter cannot actually be a proof.
In general, I… really don’t understand Stuart Russell’s thoughts on AI Alignment. The whole “uncertainty over utility functions” thing just doesn’t really help at all with solving any part of the AI Alignment problem that I care about, and I do find myself really frustrated with the degree to which both this preface and Human Compatible repeatedly indicate that it somehow is a solution to the AI Alignment problem (not only like, a helpful contribution, but both this and Human Compatible repeatedly say things that to me read like “if you make the AI uncertain about the objective in the right way, then the AI Alignment problem is solved”, which just seems obviously wrong to me, since it doesn’t even deal with inner alignment problems, and it also doesn’t solve really any major outer alignment problems, but that requires a bit more writing).
My read of Russel’s position is that if we can successfully make the agent uncertain about its model for human preferences then it will defer to the human when it might do something bad, which hopefully solves (or helps with) making it corrigible.
I do agree that this doesn’t seem to help with inner-alignment stuff though, but I’m still trying to wrap my head around this area.
If it’s certain about the human objective, then it would be certain that it knows what’s best, so there would be no reason to let a human turn it off. (Unless humans have a basic preference to turn it off, in which case it could prefer to be shut off.)
We know of many ways to get shut-off incentives, including the indicator utility function on being shut down by humans (which theoretically exists), and the AUP penalty term, which strongly incentivizes accepting shutdown in certain situations—without even modeling the human. So, it’s not an if-and-only-if.
Sure, but the theorem he proves in the setting where he proves it probably is if and only if. (I have not read the new edition, so, not really sure.)
It also seems to me like Stuart Russell endorses the if-and-only-if result as what’s desirable? I’ve heard him say things like “you want the AI to prevent its own shutdown when it’s sufficiently sure that it’s for the best”.
Of course that’s not technically the full if-and-only-if (it needs to both be certain about utility and think preventing shutdown is for the best), but it suggests to me that he doesn’t think we should add more shutoff incentives such as AUP.
Keep in mind that I have fairly little interaction with him, and this is based off of only a few off-the-cuff comments during CHAI meetings.
My point here is just that it seems pretty plausible that he meant “if and only if”.
My point here is just that it seems pretty plausible that he meant “if and only if”.
Sure. To clarify: I’m more saying “I think this statement is wrong, and I’m surprised he said this”. In fairness, I haven’t read the mentioned section yet either, but it is a very strong claim. Maybe it’s better phrased as “a CIRL agent has a positive incentive to allow shutdown iff it’s uncertain [or the human has a positive term for it being shut off]”, instead of “a machine” has a positive incentive iff.
It is an “iff” in §16.7.2 “Deference to Humans”, but the toy setting in which this is shown is pretty impoverished. It’s a story problem about a robot Robbie deciding whether to book an expensive hotel room for busy human Harriet, or whether to ask Harriet first.
Formally, let P(u) be Robbie’s prior probability density over Harriet’s utility for the proposed action a. Then the value of going ahead with a is
EU(a)=∫∞−∞P(u)⋅udu=∫0−∞P(u)⋅udu+∫∞0P(u)⋅udu
(We will see shortly why the integral is split up this way.) On the other hand, the value of action d, deferring to Harriet, is composed of two parts: if u > 0 then Harriet lets Robbie go ahead, so the value is us, but if u < 0 then Harriet switches Robbie off, so the value is 0:
EU(d)=∫0−∞P(u)⋅0du+∫∞0P(u)⋅udu
Comparing the expressions for EU(a) and EU(d), we see immediately that
EU(d)≥EU(a)
because the expression for EU(d) has the negative-utility region zeroed out. The two choices have equal value only when the negative region has zero probability—that is, when Robbie is already certain that Harriet likes the proposed action.
(I think this is fine as a topic-introducing story problem, but agree that the sentence in Chapter 1 referencing it shouldn’t have been phrased to make it sound like it applies to machines-in-general.)
Maybe it’s better phrased as “a CIRL agent has a positive incentive to allow shutdown iff it’s uncertain [or the human has a positive term for it being shut off]”, instead of “a machine” has a positive incentive iff.
I would further charitably rewrite it as:
“In chapter 16, we analyze an incentive which a CIRL agent has to allow itself to be switched off. This incentive is positive if and only if it is uncertain about the human objective.”
A CIRL agent should be capable of believing that humans terminally value pressing buttons, in which case it might allow itself to be shut off despite being 100% sure about values. So it’s just the particular incentive examined that’s iff.
Surely he only meant if it is uncertain?
This sentence really makes no sense to me. The proof that it can have an incentive to allow itself to be switched off even if it isn’t uncertain is trivial.
Just create a utility function that assigns intrinsic reward to shutting itself off, or create a payoff matrix that punishes it really hard if it doesn’t turn itself off. In this context using this kind of technical language feels actively deceitful to me, since it’s really obvious that the argument he is making in that chapter cannot actually be a proof.
In general, I… really don’t understand Stuart Russell’s thoughts on AI Alignment. The whole “uncertainty over utility functions” thing just doesn’t really help at all with solving any part of the AI Alignment problem that I care about, and I do find myself really frustrated with the degree to which both this preface and Human Compatible repeatedly indicate that it somehow is a solution to the AI Alignment problem (not only like, a helpful contribution, but both this and Human Compatible repeatedly say things that to me read like “if you make the AI uncertain about the objective in the right way, then the AI Alignment problem is solved”, which just seems obviously wrong to me, since it doesn’t even deal with inner alignment problems, and it also doesn’t solve really any major outer alignment problems, but that requires a bit more writing).
My read of Russel’s position is that if we can successfully make the agent uncertain about its model for human preferences then it will defer to the human when it might do something bad, which hopefully solves (or helps with) making it corrigible.
I do agree that this doesn’t seem to help with inner-alignment stuff though, but I’m still trying to wrap my head around this area.
If it’s certain about the human objective, then it would be certain that it knows what’s best, so there would be no reason to let a human turn it off. (Unless humans have a basic preference to turn it off, in which case it could prefer to be shut off.)
We know of many ways to get shut-off incentives, including the indicator utility function on being shut down by humans (which theoretically exists), and the AUP penalty term, which strongly incentivizes accepting shutdown in certain situations—without even modeling the human. So, it’s not an if-and-only-if.
Sure, but the theorem he proves in the setting where he proves it probably is if and only if. (I have not read the new edition, so, not really sure.)
It also seems to me like Stuart Russell endorses the if-and-only-if result as what’s desirable? I’ve heard him say things like “you want the AI to prevent its own shutdown when it’s sufficiently sure that it’s for the best”.
Of course that’s not technically the full if-and-only-if (it needs to both be certain about utility and think preventing shutdown is for the best), but it suggests to me that he doesn’t think we should add more shutoff incentives such as AUP.
Keep in mind that I have fairly little interaction with him, and this is based off of only a few off-the-cuff comments during CHAI meetings.
My point here is just that it seems pretty plausible that he meant “if and only if”.
Sure. To clarify: I’m more saying “I think this statement is wrong, and I’m surprised he said this”. In fairness, I haven’t read the mentioned section yet either, but it is a very strong claim. Maybe it’s better phrased as “a CIRL agent has a positive incentive to allow shutdown iff it’s uncertain [or the human has a positive term for it being shut off]”, instead of “a machine” has a positive incentive iff.
It is an “iff” in §16.7.2 “Deference to Humans”, but the toy setting in which this is shown is pretty impoverished. It’s a story problem about a robot Robbie deciding whether to book an expensive hotel room for busy human Harriet, or whether to ask Harriet first.
(I think this is fine as a topic-introducing story problem, but agree that the sentence in Chapter 1 referencing it shouldn’t have been phrased to make it sound like it applies to machines-in-general.)
I would further charitably rewrite it as:
“In chapter 16, we analyze an incentive which a CIRL agent has to allow itself to be switched off. This incentive is positive if and only if it is uncertain about the human objective.”
A CIRL agent should be capable of believing that humans terminally value pressing buttons, in which case it might allow itself to be shut off despite being 100% sure about values. So it’s just the particular incentive examined that’s iff.