Unfortunately, it will also assume that the human didn’t press the button, even after the human did press the button.
Hm, I don’t think so:
I guess one thing that should be mentioned is that if the AI was trained in the way described in the blog post, its policy wouldn’t use causal decision theory, at least not according to the utility function described in the post. Rather, causal decision theory would be applied on a policy-wide level; it would maximize E[U|do(policy), WM] rather than E[U|do(action), WM] (where WM denotes the world model).
I don’t know whether maximizing E[U|do(policy)] will result in a policy that maximizes E[U’|do(action), WM’] for some U’ and WM’. My best guess for how it’d correspond is what I hinted to in the text with “Essentially, it respects people’s “free will”, treating it as an important source of information about what it should do.”, that the WM’ would contain an additional unmodifiable variable corresponding to which of the two counterfactual cases it is in, and that WM’ would then claim that this variable is what determines whether people try to shut down the AI, and that this variable also determines whether U=B or U=V.
So if it sees a human pressing the button, the policy won’t conclude that the human didn’t press the button, but instead will conclude that it is in the U=B scenario. Though also hopefully in this scenario the AI would shut down.
Does the AI think the humans could be mistaken about this variable?
No. Though you could make a variant where it can think that, just by making the counterfactual more noisy.
I also don’t see why the AI doesn’t disable the shutdown button, and then observe whether the humans try to press it, and then go “Ah okay, so B is the one true utility function. Now I will crazily optimize B” and do something crazy rather than actually shutting down *because the button was pressed*.
B means “the button gets pressed and the AI shuts down”. So it can’t optimize B without shutting down. But yes, the AI would observe whether the humans try to press the shutdown button, that’s the point I made with “The AI will build an elaborate system to observe whether people want to press the stop button.”.
Does the AI think the humans could be mistaken about this variable?
I should have rather said, does the AI think the humans could mistakenly press the button, even when they of course correctly “know whether the AI should stop”.
So if it sees a human pressing the button, the policy won’t conclude that the human didn’t press the button, but instead will conclude that it is in the U=B scenario
I’m saying that it might see what looks like a button press.… but if V is potentially large compared to B, the cost of a mistake (e.g., the camera input was fabricated, the button wasn’t actually pressed) is large. So the AI has incentive to disable any mechanism that would *prematurely* shut it down, i.e. shut it down before it has time to be really sure; this is correct from V_f’s perspective, which is what the AI is up to before the button is pressed.
I should have rather said, does the AI think the humans could mistakenly press the button, even when they of course correctly “know whether the AI should stop”.
Under my proposed system where it is trained by having a separate physics and person model, and the counterfactuals then being implemented by substituting out the person model with one that wants different things, it thinks they could mistakenly press them under certain circumstances, e.g. if they trip and fall into the button, or if aliens force them to press the buttons, or similar. However, it doesn’t think that they could mistakenly want to press the button.
I’m saying that it might see what looks like a button press.… but if V is potentially large compared to B, the cost of a mistake (e.g., the camera input was fabricated, the button wasn’t actually pressed) is large. So the AI has incentive to disable any mechanism that would *prematurely* shut it down, i.e. shut it down before it has time to be really sure; this is correct from V_f’s perspective, which is what the AI is up to before the button is pressed.
You’d have to balance the sizes of the utilities against the actual costs of type 1 errors vs type 2 errors, yes.
However, it doesn’t think that they could mistakenly want to press the button.
Okay. (Seems fine to assume that this makes sense arguendo, since the problem is hard anyway, but worth keeping in mind that this is vague point in the proposal, and in particular for proposals like this to be progress, ISTM it has to be the case that “the human wants to press the button” is a simpler / easier / more concrete / more reliable / more specifiable thing for us to believe the AI can know, than other instances of “the human wants X”. Which seems plausible, but unclear to me.)
Hm, I don’t think so:
I guess one thing that should be mentioned is that if the AI was trained in the way described in the blog post, its policy wouldn’t use causal decision theory, at least not according to the utility function described in the post. Rather, causal decision theory would be applied on a policy-wide level; it would maximize E[U|do(policy), WM] rather than E[U|do(action), WM] (where WM denotes the world model).
I don’t know whether maximizing E[U|do(policy)] will result in a policy that maximizes E[U’|do(action), WM’] for some U’ and WM’. My best guess for how it’d correspond is what I hinted to in the text with “Essentially, it respects people’s “free will”, treating it as an important source of information about what it should do.”, that the WM’ would contain an additional unmodifiable variable corresponding to which of the two counterfactual cases it is in, and that WM’ would then claim that this variable is what determines whether people try to shut down the AI, and that this variable also determines whether U=B or U=V.
So if it sees a human pressing the button, the policy won’t conclude that the human didn’t press the button, but instead will conclude that it is in the U=B scenario. Though also hopefully in this scenario the AI would shut down.
No. Though you could make a variant where it can think that, just by making the counterfactual more noisy.
B means “the button gets pressed and the AI shuts down”. So it can’t optimize B without shutting down. But yes, the AI would observe whether the humans try to press the shutdown button, that’s the point I made with “The AI will build an elaborate system to observe whether people want to press the stop button.”.
I should have rather said, does the AI think the humans could mistakenly press the button, even when they of course correctly “know whether the AI should stop”.
I’m saying that it might see what looks like a button press.… but if V is potentially large compared to B, the cost of a mistake (e.g., the camera input was fabricated, the button wasn’t actually pressed) is large. So the AI has incentive to disable any mechanism that would *prematurely* shut it down, i.e. shut it down before it has time to be really sure; this is correct from V_f’s perspective, which is what the AI is up to before the button is pressed.
Under my proposed system where it is trained by having a separate physics and person model, and the counterfactuals then being implemented by substituting out the person model with one that wants different things, it thinks they could mistakenly press them under certain circumstances, e.g. if they trip and fall into the button, or if aliens force them to press the buttons, or similar. However, it doesn’t think that they could mistakenly want to press the button.
You’d have to balance the sizes of the utilities against the actual costs of type 1 errors vs type 2 errors, yes.
Okay. (Seems fine to assume that this makes sense arguendo, since the problem is hard anyway, but worth keeping in mind that this is vague point in the proposal, and in particular for proposals like this to be progress, ISTM it has to be the case that “the human wants to press the button” is a simpler / easier / more concrete / more reliable / more specifiable thing for us to believe the AI can know, than other instances of “the human wants X”. Which seems plausible, but unclear to me.)