Reading through this again, I think I have a better response to this part.
We could have valued human superiority at game-playing very highly, in which case AlphaGo would be catastrophic. How can an impact measure allow that without having at least some knowledge about values?
A low impact agent could beat us at games while still preserving our ability to beat it at games (by, for example, shutting it off). Of course, you could say “what if being beaten even once is a catastrophe, such that it destroys our ability to be undefeated forever”, but it seems like our goals are simply notof this form. In other words, it seems that actual catastrophes do destroy our ability to achieve different goals, while more benign things don’t. If the bad things the agent does can be recovered from, then I think the impact measure has done its job.
Of course, you could say “what if being beaten even once is a catastrophe, such that it destroys our ability to be undefeated forever”, but it seems like our goals are simply not of this form.
We might have a goal like “never cause an instance of extreme suffering, including in computer simulations” which seems pretty similar to “never let an AI defeat humans in Go”.
it’s true that impact measures, and AUP in particular, don’t do anything to mitigate mindcrime. Part of this is because aspects of the agent’s reasoning process can’t be considered impactful in the non-embedded formalisms we’re currently stuck with. Part of this is because it seems like a separate problem. Mindcrime would indeed be very bad, and a unique kind of catastrophe not meant to be covered by my claim above.
However, I’m skeptical that that goal is actually a component of our terminal preferences. What is doing the causing – are you thinking “never have an AI cause an instance of that”? Why would that be part of our terminal preferences? If you mean “never have this happen”, we’ve already lost.
It seems more like we really, really don’t want any of that to happen, and the less happens, the better. Like I said, the point isn’t that the agent will never do it, but that any bad things can be recovered from. This seems alright to me, as far as impact measures are concerned.
More generally, if we did have a goal of this type, it would be the case that if we learned that a particular thing had happened at any point in the past in our universe, that universe would be partially ruined for us forever. That just doesn’t sound right.
Mindcrime would indeed be very bad, and a unique kind of catastrophe not meant to be covered by my claim above.
Aside from mindcrime, I’m also concerned about AI deliberately causing extreme suffering as part of some sort of bargaining/extortion scheme. Is that something that impact measures can mitigate?
However, I’m skeptical that that goal is actually a component of our terminal preferences. What is doing the causing – are you thinking “never have an AI cause an instance of that”? Why would that be part of our terminal preferences?
An AI designer or humanity as a whole might want to avoid personal or collective responsibility for causing extreme suffering, which plausibly is part of our terminal preferences.
If you mean “never have this happen”, we’ve already lost.
Additionally, a superintelligent AI can probably cause much more extreme forms of suffering than anything that has occurred in the history of our universe so far, so even if the goal is defined as “never have this happen” I think we could lose more than we already have.
I think so. First, AUP seems to bound “how hard the agent tries” (in the physical world with its actions); the ambitions of such an agent seem rather restrained. Second, AUP provides a strong counterfactual approval incentive. While it doesn’t rule out the possibility of physical suffering, the agent is heavily dis-incentivized from actions which would substantially change the likelihood we keep it activated (comparing how likely it is to be turned off if it doesn’t do the thing, with the likelihood if it does the thing and then waits for a long time). It would basically have to be extremely sure it could keep it secret, which seems rather unlikely considering the other aspects of the behavior of AUP agents. If I understand the extortion scenario correctly, it would have to be extorting us, so it couldn’t keep it secret, so it would be penalized and it wouldn’t do it.
I think similar arguments involving counterfactual approval apply for similar things we may want to avoid.
First, AUP seems to bound “how hard the agent tries” (in the physical world with its actions); the ambitions of such an agent seem rather restrained.
But creating extreme suffering might not actually involve doing much in the physical world (compared to “normal” actions the AI would have to take to achieve the goals that we gave it). What if, depending on the goals we give the AI, doing this kind of extortion is actually the lowest impact way to achieve some goal?
If I understand the extortion scenario correctly, it would have to be extorting us, so it couldn’t keep it secret, so it would be penalized and it wouldn’t do it.
Maybe it could extort a different group of humans, and as part of the extortion force them to keep it secret from people who could turn it off? Or extort us and as part of the extortion force us to not turn it off (until we were going to turn it off anyway)?
Also, since we’re discussing this under the “Impact Measure Desiderata” post, do the existing desiderata cover this scenario? If not, what new desideratum do we need to add to the list?
But creating extreme suffering might not actually involve doing much in the physical world (compared to “normal” actions the AI would have to take to achieve the goals that we gave it). What if, depending on the goals we give the AI, doing this kind of extortion is actually the lowest impact way to achieve some goal?
Since there are a lot of possible scenarios, each of which affects the optimization differently, I’m hesitant to use a universal quantifier here without more details. However, I am broadly suspicious of AUP agents choosing plans which involve almost maximally offensive components, even accounting for the fact that it could try to do so surreptitiously. An agent might try to extort us if it expected we would respond, but respond with what? Although impact measures quantify things in the environment, that doesn’t mean they’re measuring how “similar” two states look to the eye. AUP penalizes distance traveled in the Q function space for its attainable utility functions. We also need to think about the motive for the extortion – if it means the agent gains in power, then that is also penalized.
Maybe it could extort a different group of humans, and as part of the extortion force them to keep it secret from people who could turn it off? Or extort us and as part of the extortion force us to not turn it off (until we were going to turn it off anyway)?
Again, it depends on the objective of the extortion. As for the latter, that wouldn’t be credible, since we would be able to tell its threat was the last action in its plan. AUP isolates the long-term effects of each action by having the agent stop acting for the rest of the epoch; this gives us a counterfactual opportunity to respond to that action.
I’m not sure whether this belongs in the desiderata, since we’re talking about whether temporary object level bad things could happen. I think it’s a bonus to think that there is less of a chance of that, but not the primary focus of the impact measure. Even so, it’s true that we could explicitly talk about what we want to do with impact measures, adding desiderata like “able to do reasonable things” and “disallows catastrophes from rising to the top of the preference ordering”. I’m still thinking about this.
However, I am broadly suspicious of AUP agents choosing plans which involve almost maximally offensive components, even accounting for the fact that it could try to do so surreptitiously.
I guess I don’t have good intuitions of what an AUP agent would or wouldn’t do. Can you share yours, like give some examples of real goals we might want to give to AUP agents, and what you think they would and wouldn’t do to accomplish each of those goals, and why? (Maybe this could be written up as a post since it might be helpful for others to understand your intuitions about how AUP would work in a real-world setting.)
I’m not sure whether this belongs in the desiderata, since we’re talking about whether temporary object level bad things could happen. I think it’s a bonus to think that there is less of a chance of that, but not the primary focus of the impact measure.
Why not? I’ve usually seen people talk about “impact measures” as a way of avoiding side effects, especially negative side effects. It seems intuitive that “object level bad things” are negative side effects even if they are temporary, and ought to be a primary focus of impact measures. It seems like you’ve reframed “impact measures” in your mind to be a bit different from this naive intuitive picture, so perhaps you could explain that a bit more (or point me to such an explanation)?
Yeah, I think I agree that example is a bit extreme, and it’s probably okay to assume we don’t have goals of that form.
That said, you often talk about AUP with examples like not breaking a vase. In reality, we could always simply buy a new vase. If you expect a low impact agent could beat us at games while still preserving our ability to beat it at games, do you also expect that a low impact agent could break a vase while preserving our ability to have an intact vase (by buying a new vase)?
Short answer: yes; if its goal is to break vases, that would be pretty reasonable.
Longer answer: The AUP theory of low impact says that impact is relative to the environment and to the agent’s vantage point therein. In Platonic gridworlds like this:
knowing whether a vase is present tells you a lot about the state, and you can’t replace the vase here, so breaking it is a big deal (according to AUP). If you could replace the vase, there would still be a lesser impact. AUP would say to avoid breaking unnecessary vases due to the slight penalty, since the goal presumably doesn’t require breaking the vase – so why not go around?
On the other hand, in the Go example, winning is the agent’s objective. Depending on how the agent models the world (as a real-world agent playing a game on a computer, or whether it thinks it’s just Platonically interacting with a Go environment), penalties get applied differently. In the former case, I don’t think it would incur much penalty for being good at a game (modulo approval incentives it may or may not predict). In the latter case, you’d probably need to keep giving it more impact allowance until it’s playing as well as you’d like. This is because the goal is related to the thing which has a bit of impact.
Reading through this again, I think I have a better response to this part.
A low impact agent could beat us at games while still preserving our ability to beat it at games (by, for example, shutting it off). Of course, you could say “what if being beaten even once is a catastrophe, such that it destroys our ability to be undefeated forever”, but it seems like our goals are simply not of this form. In other words, it seems that actual catastrophes do destroy our ability to achieve different goals, while more benign things don’t. If the bad things the agent does can be recovered from, then I think the impact measure has done its job.
We might have a goal like “never cause an instance of extreme suffering, including in computer simulations” which seems pretty similar to “never let an AI defeat humans in Go”.
it’s true that impact measures, and AUP in particular, don’t do anything to mitigate mindcrime. Part of this is because aspects of the agent’s reasoning process can’t be considered impactful in the non-embedded formalisms we’re currently stuck with. Part of this is because it seems like a separate problem. Mindcrime would indeed be very bad, and a unique kind of catastrophe not meant to be covered by my claim above.
However, I’m skeptical that that goal is actually a component of our terminal preferences. What is doing the causing – are you thinking “never have an AI cause an instance of that”? Why would that be part of our terminal preferences? If you mean “never have this happen”, we’ve already lost.
It seems more like we really, really don’t want any of that to happen, and the less happens, the better. Like I said, the point isn’t that the agent will never do it, but that any bad things can be recovered from. This seems alright to me, as far as impact measures are concerned.
More generally, if we did have a goal of this type, it would be the case that if we learned that a particular thing had happened at any point in the past in our universe, that universe would be partially ruined for us forever. That just doesn’t sound right.
Aside from mindcrime, I’m also concerned about AI deliberately causing extreme suffering as part of some sort of bargaining/extortion scheme. Is that something that impact measures can mitigate?
An AI designer or humanity as a whole might want to avoid personal or collective responsibility for causing extreme suffering, which plausibly is part of our terminal preferences.
Additionally, a superintelligent AI can probably cause much more extreme forms of suffering than anything that has occurred in the history of our universe so far, so even if the goal is defined as “never have this happen” I think we could lose more than we already have.
I think so. First, AUP seems to bound “how hard the agent tries” (in the physical world with its actions); the ambitions of such an agent seem rather restrained. Second, AUP provides a strong counterfactual approval incentive. While it doesn’t rule out the possibility of physical suffering, the agent is heavily dis-incentivized from actions which would substantially change the likelihood we keep it activated (comparing how likely it is to be turned off if it doesn’t do the thing, with the likelihood if it does the thing and then waits for a long time). It would basically have to be extremely sure it could keep it secret, which seems rather unlikely considering the other aspects of the behavior of AUP agents. If I understand the extortion scenario correctly, it would have to be extorting us, so it couldn’t keep it secret, so it would be penalized and it wouldn’t do it.
I think similar arguments involving counterfactual approval apply for similar things we may want to avoid.
But creating extreme suffering might not actually involve doing much in the physical world (compared to “normal” actions the AI would have to take to achieve the goals that we gave it). What if, depending on the goals we give the AI, doing this kind of extortion is actually the lowest impact way to achieve some goal?
Maybe it could extort a different group of humans, and as part of the extortion force them to keep it secret from people who could turn it off? Or extort us and as part of the extortion force us to not turn it off (until we were going to turn it off anyway)?
Also, since we’re discussing this under the “Impact Measure Desiderata” post, do the existing desiderata cover this scenario? If not, what new desideratum do we need to add to the list?
Since there are a lot of possible scenarios, each of which affects the optimization differently, I’m hesitant to use a universal quantifier here without more details. However, I am broadly suspicious of AUP agents choosing plans which involve almost maximally offensive components, even accounting for the fact that it could try to do so surreptitiously. An agent might try to extort us if it expected we would respond, but respond with what? Although impact measures quantify things in the environment, that doesn’t mean they’re measuring how “similar” two states look to the eye. AUP penalizes distance traveled in the Q function space for its attainable utility functions. We also need to think about the motive for the extortion – if it means the agent gains in power, then that is also penalized.
Again, it depends on the objective of the extortion. As for the latter, that wouldn’t be credible, since we would be able to tell its threat was the last action in its plan. AUP isolates the long-term effects of each action by having the agent stop acting for the rest of the epoch; this gives us a counterfactual opportunity to respond to that action.
I’m not sure whether this belongs in the desiderata, since we’re talking about whether temporary object level bad things could happen. I think it’s a bonus to think that there is less of a chance of that, but not the primary focus of the impact measure. Even so, it’s true that we could explicitly talk about what we want to do with impact measures, adding desiderata like “able to do reasonable things” and “disallows catastrophes from rising to the top of the preference ordering”. I’m still thinking about this.
I guess I don’t have good intuitions of what an AUP agent would or wouldn’t do. Can you share yours, like give some examples of real goals we might want to give to AUP agents, and what you think they would and wouldn’t do to accomplish each of those goals, and why? (Maybe this could be written up as a post since it might be helpful for others to understand your intuitions about how AUP would work in a real-world setting.)
Why not? I’ve usually seen people talk about “impact measures” as a way of avoiding side effects, especially negative side effects. It seems intuitive that “object level bad things” are negative side effects even if they are temporary, and ought to be a primary focus of impact measures. It seems like you’ve reframed “impact measures” in your mind to be a bit different from this naive intuitive picture, so perhaps you could explain that a bit more (or point me to such an explanation)?
Sounds good. I’m currently working on a long sequence walking through my intuitions and assumptions in detail.
Yeah, I think I agree that example is a bit extreme, and it’s probably okay to assume we don’t have goals of that form.
That said, you often talk about AUP with examples like not breaking a vase. In reality, we could always simply buy a new vase. If you expect a low impact agent could beat us at games while still preserving our ability to beat it at games, do you also expect that a low impact agent could break a vase while preserving our ability to have an intact vase (by buying a new vase)?
Short answer: yes; if its goal is to break vases, that would be pretty reasonable.
Longer answer: The AUP theory of low impact says that impact is relative to the environment and to the agent’s vantage point therein. In Platonic gridworlds like this:
knowing whether a vase is present tells you a lot about the state, and you can’t replace the vase here, so breaking it is a big deal (according to AUP). If you could replace the vase, there would still be a lesser impact. AUP would say to avoid breaking unnecessary vases due to the slight penalty, since the goal presumably doesn’t require breaking the vase – so why not go around?
On the other hand, in the Go example, winning is the agent’s objective. Depending on how the agent models the world (as a real-world agent playing a game on a computer, or whether it thinks it’s just Platonically interacting with a Go environment), penalties get applied differently. In the former case, I don’t think it would incur much penalty for being good at a game (modulo approval incentives it may or may not predict). In the latter case, you’d probably need to keep giving it more impact allowance until it’s playing as well as you’d like. This is because the goal is related to the thing which has a bit of impact.