We know of many ways to get shut-off incentives, including the indicator utility function on being shut down by humans (which theoretically exists), and the AUP penalty term, which strongly incentivizes accepting shutdown in certain situations—without even modeling the human. So, it’s not an if-and-only-if.
Sure, but the theorem he proves in the setting where he proves it probably is if and only if. (I have not read the new edition, so, not really sure.)
It also seems to me like Stuart Russell endorses the if-and-only-if result as what’s desirable? I’ve heard him say things like “you want the AI to prevent its own shutdown when it’s sufficiently sure that it’s for the best”.
Of course that’s not technically the full if-and-only-if (it needs to both be certain about utility and think preventing shutdown is for the best), but it suggests to me that he doesn’t think we should add more shutoff incentives such as AUP.
Keep in mind that I have fairly little interaction with him, and this is based off of only a few off-the-cuff comments during CHAI meetings.
My point here is just that it seems pretty plausible that he meant “if and only if”.
My point here is just that it seems pretty plausible that he meant “if and only if”.
Sure. To clarify: I’m more saying “I think this statement is wrong, and I’m surprised he said this”. In fairness, I haven’t read the mentioned section yet either, but it is a very strong claim. Maybe it’s better phrased as “a CIRL agent has a positive incentive to allow shutdown iff it’s uncertain [or the human has a positive term for it being shut off]”, instead of “a machine” has a positive incentive iff.
It is an “iff” in §16.7.2 “Deference to Humans”, but the toy setting in which this is shown is pretty impoverished. It’s a story problem about a robot Robbie deciding whether to book an expensive hotel room for busy human Harriet, or whether to ask Harriet first.
Formally, let P(u) be Robbie’s prior probability density over Harriet’s utility for the proposed action a. Then the value of going ahead with a is
EU(a)=∫∞−∞P(u)⋅udu=∫0−∞P(u)⋅udu+∫∞0P(u)⋅udu
(We will see shortly why the integral is split up this way.) On the other hand, the value of action d, deferring to Harriet, is composed of two parts: if u > 0 then Harriet lets Robbie go ahead, so the value is us, but if u < 0 then Harriet switches Robbie off, so the value is 0:
EU(d)=∫0−∞P(u)⋅0du+∫∞0P(u)⋅udu
Comparing the expressions for EU(a) and EU(d), we see immediately that
EU(d)≥EU(a)
because the expression for EU(d) has the negative-utility region zeroed out. The two choices have equal value only when the negative region has zero probability—that is, when Robbie is already certain that Harriet likes the proposed action.
(I think this is fine as a topic-introducing story problem, but agree that the sentence in Chapter 1 referencing it shouldn’t have been phrased to make it sound like it applies to machines-in-general.)
Maybe it’s better phrased as “a CIRL agent has a positive incentive to allow shutdown iff it’s uncertain [or the human has a positive term for it being shut off]”, instead of “a machine” has a positive incentive iff.
I would further charitably rewrite it as:
“In chapter 16, we analyze an incentive which a CIRL agent has to allow itself to be switched off. This incentive is positive if and only if it is uncertain about the human objective.”
A CIRL agent should be capable of believing that humans terminally value pressing buttons, in which case it might allow itself to be shut off despite being 100% sure about values. So it’s just the particular incentive examined that’s iff.
We know of many ways to get shut-off incentives, including the indicator utility function on being shut down by humans (which theoretically exists), and the AUP penalty term, which strongly incentivizes accepting shutdown in certain situations—without even modeling the human. So, it’s not an if-and-only-if.
Sure, but the theorem he proves in the setting where he proves it probably is if and only if. (I have not read the new edition, so, not really sure.)
It also seems to me like Stuart Russell endorses the if-and-only-if result as what’s desirable? I’ve heard him say things like “you want the AI to prevent its own shutdown when it’s sufficiently sure that it’s for the best”.
Of course that’s not technically the full if-and-only-if (it needs to both be certain about utility and think preventing shutdown is for the best), but it suggests to me that he doesn’t think we should add more shutoff incentives such as AUP.
Keep in mind that I have fairly little interaction with him, and this is based off of only a few off-the-cuff comments during CHAI meetings.
My point here is just that it seems pretty plausible that he meant “if and only if”.
Sure. To clarify: I’m more saying “I think this statement is wrong, and I’m surprised he said this”. In fairness, I haven’t read the mentioned section yet either, but it is a very strong claim. Maybe it’s better phrased as “a CIRL agent has a positive incentive to allow shutdown iff it’s uncertain [or the human has a positive term for it being shut off]”, instead of “a machine” has a positive incentive iff.
It is an “iff” in §16.7.2 “Deference to Humans”, but the toy setting in which this is shown is pretty impoverished. It’s a story problem about a robot Robbie deciding whether to book an expensive hotel room for busy human Harriet, or whether to ask Harriet first.
(I think this is fine as a topic-introducing story problem, but agree that the sentence in Chapter 1 referencing it shouldn’t have been phrased to make it sound like it applies to machines-in-general.)
I would further charitably rewrite it as:
“In chapter 16, we analyze an incentive which a CIRL agent has to allow itself to be switched off. This incentive is positive if and only if it is uncertain about the human objective.”
A CIRL agent should be capable of believing that humans terminally value pressing buttons, in which case it might allow itself to be shut off despite being 100% sure about values. So it’s just the particular incentive examined that’s iff.