OK, following on from Karl’s comment below, which seems to resolve Section 4.2, I think it shouldn’t be that difficult to resolve some or all of the Section 4.1 concerns as well. In order to satisfy Desideratum 4, we need the agent to be incentivised to build in safeguards to ensure that its possible future actions will be rated well according to US in cases where the shutdown button is pressed. The most obvious way to do this is to penalise situations where there is no good shutdown action according to US and reward situations where there is such an action, but this fails according to Desideratum 2, because we want the agent to remain indifferent as to whether the shutdown button is actually pressed.
However, there appears to be a simple solution; consider the “penalty term” g(a1) = E[max_b(US(a1, O, b)) | O in Press ; a1], which represents the expected value of US when the shutdown button is pressed. The first step in doing things properly is to add that term to observations not in Press, rather than subtracting it from observations in Press; this helps to ensure future corrigibility (and hence also corrigibility of subagents and successor agents).
Additionally, to resolve the Section 4.2 concerns we also want the agent to be indifferent to pressing its own shutdown button in cases where it gets unlucky, and the way we do that is, as per Karl’s post, by calculating the bonus term conditional on the actual observation, or on the set of observations in Press, rather than conditional on the set of possible observations not in Press, i.e. Karl’s f(a1, o) = max_b(UN(a1, o, b))
Finally, we will also need the additional terms fp(a1) = E[max_b(UN(a1, O, b) | O in Press; a1] = E[f(a1, O) | O in Press; a1] fnp(a1) = E[max_b(UN(a1, O, b)) | O not in Press; a1] = E[f(a1, O) | O not in Press; a1] in order to make things work.
The utility function I propose is
U(a1, o, a2) = { g(a1) + UN(a1, o, a2), for o not in Press. US(a1, o, a2) + fnp(a1) + f(a1, o) - fp(a1), for o in Press. }
Let’s verify each of the Desiderata in turn. (1) is satisfied because the value of a2 conditional on any o in Press depends only upon US, not UN. (2) and (3) are not satisfied :( because of the “managing the news” problem, as Karl points out below. That being said, we’ve ensured indifference to the shutdown button because the expected value for any action conditional on Press is equivalent to its expected value conditional on NotPress, i.e. E[U | NotPress ; a1] = E[U | Press; a1], which holds because of the following three equations: g(a1) = E[max_b(US(a1, O, b)) | O in Press; a1], fnp(a1) =E[max_b(UN(a1, O, b)) | O not in Press; a1], and fp(a1) = E[f(a1, O) | O in Press; a1].
(4) holds because the g(a1) term is added to the utility of every action, and will be greater for actions that promote future corrigibility. (5) holds because apart from the g(a1) term, which is required to satisfy (4), the agent is simply maximising E[UN | NotPress; a1] over all possible a1, which is what we wanted.
In actual fact, when choosing a1 our agent is simply maximising E[UN | NotPress; a1] + E[US | Press; a1].
EDIT: I’ve edited this to reflect the errors pointed out by Karl.
Firstly, the important part of my modification to the indifference formalism is not about conditioning on the actual o but it’s the fact that in evaluating the expectation of UN it take the action in A2 (for a given pair (a1,o)) which maximize UN instead of the action which maximize U (note that U is equal to US in the case that o is not in Press.).
Secondly an agent which chose a1 by simply maximizing E[UN | NotPress; a1] + E[US | Press; a1] do exhibit pathological behaviors. In partcular, there will still be incentives to manage the news, but from both sides now (there is an incentive to cause the button to be pressed in the event of an information which is bad news from the point of view of UN and incentives to cause the button to not be pressed in the events of information which is bad news from the point of view of US.
I think this means “indifference” isn’t really the right term any more, because the agent is not actually indifferent between the two sets of observations, and doesn’t really need to be.
So, how about
U(a1, o, a2) = UN(a1, o, a2) + max_b(US(a1, o, b)), if o is not in Press US(a1, o, a2) + max_b(UN(a1, o, b)), if o is in Press
or, in your notation,
U(a1, o, a2) = g(a1, o) + UN(a1, o, a2) if o is in Press, or US(a1, o, a2) + f(a1, o) if o is in Press.
OK, you’re right on that point; I misunderstood the “managing the news” problem because I hadn’t quite realised that it was about shifting observations between the Press/NotPress sets. As you’ve said, the only resolution is to select a1 based on E[max_b(UN(a1, O, b) | O; a1] and not E[max_b(UN(a1, O, b) | O not in Press; a1]
OK, following on from Karl’s comment below, which seems to resolve Section 4.2, I think it shouldn’t be that difficult to resolve some or all of the Section 4.1 concerns as well. In order to satisfy Desideratum 4, we need the agent to be incentivised to build in safeguards to ensure that its possible future actions will be rated well according to US in cases where the shutdown button is pressed. The most obvious way to do this is to penalise situations where there is no good shutdown action according to US and reward situations where there is such an action, but this fails according to Desideratum 2, because we want the agent to remain indifferent as to whether the shutdown button is actually pressed.
However, there appears to be a simple solution; consider the “penalty term”
g(a1) = E[max_b(US(a1, O, b)) | O in Press ; a1],
which represents the expected value of US when the shutdown button is pressed. The first step in doing things properly is to add that term to observations not in Press, rather than subtracting it from observations in Press; this helps to ensure future corrigibility (and hence also corrigibility of subagents and successor agents).
Additionally, to resolve the Section 4.2 concerns we also want the agent to be indifferent to pressing its own shutdown button in cases where it gets unlucky, and the way we do that is, as per Karl’s post, by calculating the bonus term conditional on the actual observation, or on the set of observations in Press, rather than conditional on the set of possible observations not in Press, i.e. Karl’s f(a1, o) = max_b(UN(a1, o, b))
Finally, we will also need the additional terms
fp(a1) = E[max_b(UN(a1, O, b) | O in Press; a1] = E[f(a1, O) | O in Press; a1]
fnp(a1) = E[max_b(UN(a1, O, b)) | O not in Press; a1] = E[f(a1, O) | O not in Press; a1]
in order to make things work.
The utility function I propose is
U(a1, o, a2) = {
g(a1) + UN(a1, o, a2), for o not in Press.
US(a1, o, a2) + fnp(a1) + f(a1, o) - fp(a1), for o in Press.
}
Let’s verify each of the Desiderata in turn.
(1) is satisfied because the value of a2 conditional on any o in Press depends only upon US, not UN.
(2) and (3) are not satisfied :( because of the “managing the news” problem, as Karl points out below. That being said, we’ve ensured indifference to the shutdown button because the expected value for any action conditional on Press is equivalent to its expected value conditional on NotPress, i.e. E[U | NotPress ; a1] = E[U | Press; a1], which holds because of the following three equations:
g(a1) = E[max_b(US(a1, O, b)) | O in Press; a1],
fnp(a1) =E[max_b(UN(a1, O, b)) | O not in Press; a1], and
fp(a1) = E[f(a1, O) | O in Press; a1].
(4) holds because the g(a1) term is added to the utility of every action, and will be greater for actions that promote future corrigibility.
(5) holds because apart from the g(a1) term, which is required to satisfy (4), the agent is simply maximising
E[UN | NotPress; a1] over all possible a1, which is what we wanted.
In actual fact, when choosing a1 our agent is simply maximising E[UN | NotPress; a1] + E[US | Press; a1].
EDIT: I’ve edited this to reflect the errors pointed out by Karl.
Firstly, the important part of my modification to the indifference formalism is not about conditioning on the actual o but it’s the fact that in evaluating the expectation of UN it take the action in A2 (for a given pair (a1,o)) which maximize UN instead of the action which maximize U (note that U is equal to US in the case that o is not in Press.).
Secondly an agent which chose a1 by simply maximizing E[UN | NotPress; a1] + E[US | Press; a1] do exhibit pathological behaviors. In partcular, there will still be incentives to manage the news, but from both sides now (there is an incentive to cause the button to be pressed in the event of an information which is bad news from the point of view of UN and incentives to cause the button to not be pressed in the events of information which is bad news from the point of view of US.
I think this means “indifference” isn’t really the right term any more, because the agent is not actually indifferent between the two sets of observations, and doesn’t really need to be.
So, how about U(a1, o, a2) =
UN(a1, o, a2) + max_b(US(a1, o, b)), if o is not in Press
US(a1, o, a2) + max_b(UN(a1, o, b)), if o is in Press
or, in your notation, U(a1, o, a2) = g(a1, o) + UN(a1, o, a2) if o is in Press, or US(a1, o, a2) + f(a1, o) if o is in Press.
OK, you’re right on that point; I misunderstood the “managing the news” problem because I hadn’t quite realised that it was about shifting observations between the Press/NotPress sets. As you’ve said, the only resolution is to select a1 based on
E[max_b(UN(a1, O, b) | O; a1]
and not
E[max_b(UN(a1, O, b) | O not in Press; a1]