I’m no longer sure the problem makes sense. Imagine an AI whose on-goal is to make money for you, and whose off-goal is to do nothing in particular. Imagine you turn it on, and it influences the government to pay a monthly stipend to people running money-making AIs, including you. By that action, is the AI making money for you in a legitimate way? Or is it bribing you to keep it running and avoid pressing the shutdown button? How do you even answer a question like that?
If we had great mechinterp, I’d answer the question by looking into the mind of the AI and seeing whether or not it considered the “this will reduce the probability of the shutdown button being pressed” possibility in its reasoning (or some similar thing), and if so, whether it considered it a pro, a con, or a neutral side-effect.
Then it seems to me that judging the agent’s purity of intentions is also a deep problem. At least for humans it is. For example, a revolutionary may only want to overthrow the unjust hierarchy, but then succeed and end up in power. So they didn’t consciously try to gain power, but maybe evolution gave them some behaviors that happen to gain power, without the agent explicitly encoding “the desire for power” at any level.
this assumes concepts like “shutdown button” are in the ontology of the AI. I’m not sure how much we understand about what ontology AIs likely end up with
How would those questions apply to the “trammeling” example from part 2 of the post? Where the AI is keeping the overall probability of outcome B the same, but intentionally changing which worlds get outcome B in order to indirectly trade A1 outcomes for A2 outcomes.
Good point. I revise it to “if so, whether it considered it a pro, a con, or an important thing to trammell, or none of the above.”
Come to think of it, why is trammelling so bad? If it keeps the probability of button being pressed the same, why do we care exactly? Is it because our ability to influence the button might be diminishing?
That’s my understanding of why it’s bad, yes. The point of the button is that we want to be able to choose whether it gets pressed or not. If the AI presses it in a bunch of world where we don’t want it pressed and stops it from being pressed in a bunch of worlds where we do want it pressed, those are both bad. The fact that the AI is trading an equal probability mass in both directions doesn’t make it any less bad from our perspective.
I’m no longer sure the problem makes sense. Imagine an AI whose on-goal is to make money for you, and whose off-goal is to do nothing in particular. Imagine you turn it on, and it influences the government to pay a monthly stipend to people running money-making AIs, including you. By that action, is the AI making money for you in a legitimate way? Or is it bribing you to keep it running and avoid pressing the shutdown button? How do you even answer a question like that?
If we had great mechinterp, I’d answer the question by looking into the mind of the AI and seeing whether or not it considered the “this will reduce the probability of the shutdown button being pressed” possibility in its reasoning (or some similar thing), and if so, whether it considered it a pro, a con, or a neutral side-effect.
Then it seems to me that judging the agent’s purity of intentions is also a deep problem. At least for humans it is. For example, a revolutionary may only want to overthrow the unjust hierarchy, but then succeed and end up in power. So they didn’t consciously try to gain power, but maybe evolution gave them some behaviors that happen to gain power, without the agent explicitly encoding “the desire for power” at any level.
I think this is not so big of a problem, if we have the assumed level of mechinterp.
this assumes concepts like “shutdown button” are in the ontology of the AI. I’m not sure how much we understand about what ontology AIs likely end up with
How would those questions apply to the “trammeling” example from part 2 of the post? Where the AI is keeping the overall probability of outcome B the same, but intentionally changing which worlds get outcome B in order to indirectly trade A1 outcomes for A2 outcomes.
Good point. I revise it to “if so, whether it considered it a pro, a con, or an important thing to trammell, or none of the above.”
Come to think of it, why is trammelling so bad? If it keeps the probability of button being pressed the same, why do we care exactly? Is it because our ability to influence the button might be diminishing?
That’s my understanding of why it’s bad, yes. The point of the button is that we want to be able to choose whether it gets pressed or not. If the AI presses it in a bunch of world where we don’t want it pressed and stops it from being pressed in a bunch of worlds where we do want it pressed, those are both bad. The fact that the AI is trading an equal probability mass in both directions doesn’t make it any less bad from our perspective.