These seem like generally good desiderata, though I don’t know how to formalize them to the point that we can actually check with reasonable certainty whether a proposed impact measure meets these desiderata.
I have one additional desideratum from impact measures. The impact measure alone should disallow all extinction scenarios, while still allowing the AI system to do most of the things we use AI for today. This is rather weak, really I’d want AI do more tasks than are done today. However, even in this weak form, I doubt that we can satisfy this desideratum if we must also be agnostic to values, goals, representations and environments. We could have valued human superiority at game-playing very highly, in which case AlphaGo would be catastrophic. How can an impact measure allow that without having at least some knowledge about values?
Reading through this again, I think I have a better response to this part.
We could have valued human superiority at game-playing very highly, in which case AlphaGo would be catastrophic. How can an impact measure allow that without having at least some knowledge about values?
A low impact agent could beat us at games while still preserving our ability to beat it at games (by, for example, shutting it off). Of course, you could say “what if being beaten even once is a catastrophe, such that it destroys our ability to be undefeated forever”, but it seems like our goals are simply notof this form. In other words, it seems that actual catastrophes do destroy our ability to achieve different goals, while more benign things don’t. If the bad things the agent does can be recovered from, then I think the impact measure has done its job.
Of course, you could say “what if being beaten even once is a catastrophe, such that it destroys our ability to be undefeated forever”, but it seems like our goals are simply not of this form.
We might have a goal like “never cause an instance of extreme suffering, including in computer simulations” which seems pretty similar to “never let an AI defeat humans in Go”.
it’s true that impact measures, and AUP in particular, don’t do anything to mitigate mindcrime. Part of this is because aspects of the agent’s reasoning process can’t be considered impactful in the non-embedded formalisms we’re currently stuck with. Part of this is because it seems like a separate problem. Mindcrime would indeed be very bad, and a unique kind of catastrophe not meant to be covered by my claim above.
However, I’m skeptical that that goal is actually a component of our terminal preferences. What is doing the causing – are you thinking “never have an AI cause an instance of that”? Why would that be part of our terminal preferences? If you mean “never have this happen”, we’ve already lost.
It seems more like we really, really don’t want any of that to happen, and the less happens, the better. Like I said, the point isn’t that the agent will never do it, but that any bad things can be recovered from. This seems alright to me, as far as impact measures are concerned.
More generally, if we did have a goal of this type, it would be the case that if we learned that a particular thing had happened at any point in the past in our universe, that universe would be partially ruined for us forever. That just doesn’t sound right.
Mindcrime would indeed be very bad, and a unique kind of catastrophe not meant to be covered by my claim above.
Aside from mindcrime, I’m also concerned about AI deliberately causing extreme suffering as part of some sort of bargaining/extortion scheme. Is that something that impact measures can mitigate?
However, I’m skeptical that that goal is actually a component of our terminal preferences. What is doing the causing – are you thinking “never have an AI cause an instance of that”? Why would that be part of our terminal preferences?
An AI designer or humanity as a whole might want to avoid personal or collective responsibility for causing extreme suffering, which plausibly is part of our terminal preferences.
If you mean “never have this happen”, we’ve already lost.
Additionally, a superintelligent AI can probably cause much more extreme forms of suffering than anything that has occurred in the history of our universe so far, so even if the goal is defined as “never have this happen” I think we could lose more than we already have.
I think so. First, AUP seems to bound “how hard the agent tries” (in the physical world with its actions); the ambitions of such an agent seem rather restrained. Second, AUP provides a strong counterfactual approval incentive. While it doesn’t rule out the possibility of physical suffering, the agent is heavily dis-incentivized from actions which would substantially change the likelihood we keep it activated (comparing how likely it is to be turned off if it doesn’t do the thing, with the likelihood if it does the thing and then waits for a long time). It would basically have to be extremely sure it could keep it secret, which seems rather unlikely considering the other aspects of the behavior of AUP agents. If I understand the extortion scenario correctly, it would have to be extorting us, so it couldn’t keep it secret, so it would be penalized and it wouldn’t do it.
I think similar arguments involving counterfactual approval apply for similar things we may want to avoid.
First, AUP seems to bound “how hard the agent tries” (in the physical world with its actions); the ambitions of such an agent seem rather restrained.
But creating extreme suffering might not actually involve doing much in the physical world (compared to “normal” actions the AI would have to take to achieve the goals that we gave it). What if, depending on the goals we give the AI, doing this kind of extortion is actually the lowest impact way to achieve some goal?
If I understand the extortion scenario correctly, it would have to be extorting us, so it couldn’t keep it secret, so it would be penalized and it wouldn’t do it.
Maybe it could extort a different group of humans, and as part of the extortion force them to keep it secret from people who could turn it off? Or extort us and as part of the extortion force us to not turn it off (until we were going to turn it off anyway)?
Also, since we’re discussing this under the “Impact Measure Desiderata” post, do the existing desiderata cover this scenario? If not, what new desideratum do we need to add to the list?
But creating extreme suffering might not actually involve doing much in the physical world (compared to “normal” actions the AI would have to take to achieve the goals that we gave it). What if, depending on the goals we give the AI, doing this kind of extortion is actually the lowest impact way to achieve some goal?
Since there are a lot of possible scenarios, each of which affects the optimization differently, I’m hesitant to use a universal quantifier here without more details. However, I am broadly suspicious of AUP agents choosing plans which involve almost maximally offensive components, even accounting for the fact that it could try to do so surreptitiously. An agent might try to extort us if it expected we would respond, but respond with what? Although impact measures quantify things in the environment, that doesn’t mean they’re measuring how “similar” two states look to the eye. AUP penalizes distance traveled in the Q function space for its attainable utility functions. We also need to think about the motive for the extortion – if it means the agent gains in power, then that is also penalized.
Maybe it could extort a different group of humans, and as part of the extortion force them to keep it secret from people who could turn it off? Or extort us and as part of the extortion force us to not turn it off (until we were going to turn it off anyway)?
Again, it depends on the objective of the extortion. As for the latter, that wouldn’t be credible, since we would be able to tell its threat was the last action in its plan. AUP isolates the long-term effects of each action by having the agent stop acting for the rest of the epoch; this gives us a counterfactual opportunity to respond to that action.
I’m not sure whether this belongs in the desiderata, since we’re talking about whether temporary object level bad things could happen. I think it’s a bonus to think that there is less of a chance of that, but not the primary focus of the impact measure. Even so, it’s true that we could explicitly talk about what we want to do with impact measures, adding desiderata like “able to do reasonable things” and “disallows catastrophes from rising to the top of the preference ordering”. I’m still thinking about this.
However, I am broadly suspicious of AUP agents choosing plans which involve almost maximally offensive components, even accounting for the fact that it could try to do so surreptitiously.
I guess I don’t have good intuitions of what an AUP agent would or wouldn’t do. Can you share yours, like give some examples of real goals we might want to give to AUP agents, and what you think they would and wouldn’t do to accomplish each of those goals, and why? (Maybe this could be written up as a post since it might be helpful for others to understand your intuitions about how AUP would work in a real-world setting.)
I’m not sure whether this belongs in the desiderata, since we’re talking about whether temporary object level bad things could happen. I think it’s a bonus to think that there is less of a chance of that, but not the primary focus of the impact measure.
Why not? I’ve usually seen people talk about “impact measures” as a way of avoiding side effects, especially negative side effects. It seems intuitive that “object level bad things” are negative side effects even if they are temporary, and ought to be a primary focus of impact measures. It seems like you’ve reframed “impact measures” in your mind to be a bit different from this naive intuitive picture, so perhaps you could explain that a bit more (or point me to such an explanation)?
Yeah, I think I agree that example is a bit extreme, and it’s probably okay to assume we don’t have goals of that form.
That said, you often talk about AUP with examples like not breaking a vase. In reality, we could always simply buy a new vase. If you expect a low impact agent could beat us at games while still preserving our ability to beat it at games, do you also expect that a low impact agent could break a vase while preserving our ability to have an intact vase (by buying a new vase)?
Short answer: yes; if its goal is to break vases, that would be pretty reasonable.
Longer answer: The AUP theory of low impact says that impact is relative to the environment and to the agent’s vantage point therein. In Platonic gridworlds like this:
knowing whether a vase is present tells you a lot about the state, and you can’t replace the vase here, so breaking it is a big deal (according to AUP). If you could replace the vase, there would still be a lesser impact. AUP would say to avoid breaking unnecessary vases due to the slight penalty, since the goal presumably doesn’t require breaking the vase – so why not go around?
On the other hand, in the Go example, winning is the agent’s objective. Depending on how the agent models the world (as a real-world agent playing a game on a computer, or whether it thinks it’s just Platonically interacting with a Go environment), penalties get applied differently. In the former case, I don’t think it would incur much penalty for being good at a game (modulo approval incentives it may or may not predict). In the latter case, you’d probably need to keep giving it more impact allowance until it’s playing as well as you’d like. This is because the goal is related to the thing which has a bit of impact.
I don’t know how to formalize them to the point that we can actually check with reasonable certainty whether a proposed impact measure meets these desiderata.
The -agnostics, the -sensitives, ‘apparently rational’, ‘shutdown-safe’, and ‘knowably-low impact’ should be obvious from the design. ‘Natural kind’ and ‘plausibly efficient’ are indeed subjective. ‘Corrigible’, ‘no offsetting’, ‘clinginess / scapegoating avoidance’, ‘dynamic consistency’, and ‘robust’ can all be proven (although presently, ‘robust’ can only be proven insofar as “this weird behavior won’t happen if a normal agent wouldn’t do it”, which plausibly seems sufficient for weird behaviors incentivized by the impact measure).
The impact measure alone should disallow all extinction scenarios, while still allowing the AI system to do most of the things we use AI for today.
(I’m going to assume you mean the weaker thing that doesn’t literally involve precluding every possible bad outcome)
I don’t see why an impact measure for fulfilling the criteria I listed wouldn’t meet what I think you have in mind. Your example with Go is not value-agnostic, and arguably has miniscule objective impact on its own.
(I’m going to assume you mean the weaker thing that doesn’t literally involve precluding every possible bad outcome)
I’m confused. I think under the strongly superintelligent AI model (which seems to be the model you’re using), if there’s misalignment then the AI is strongly optimizing against any security precautions we’ve taken, so if we don’t preclude every possible bad outcome, the AI will find the one we missed. I grant that we’re probably not going to be able to prove that it precludes every possible bad outcome, if that’s what you’re worried about, but that still should be our desideratum. I’m also happy to consider other threat models besides strongly superintelligent AI, but that doesn’t seem to be what you’re considering.
Your example with Go is not value-agnostic, and arguably has miniscule objective impact on its own.
That’s my point. It could have been the case that we cared about AIs not beating us at Go, and if building AlphaGo does have minuscule objective impact, then that would have happened causing a catastrophe. In that world, I wouldn’t be surprised if we had arguments about why such a thing was clearly a high-impact action. (Another way of putting this is that I think either “impact” is a value-laden concept, or “impact” will fail to prevent some catastrophe, or “impact” prevents the AI from doing anything useful.)
I don’t see why an impact measure for fulfilling the criteria I listed wouldn’t meet what I think you have in mind.
Suppose your utility function has a maximum value of 1, and the inaction policy always gets utility 0. Consider the impact penalty that always assigns a penalty of 2, except for the inaction policy where the penalty is 0. The agent will provably follow the inaction policy. This impact penalty satisfies all of the desiderata, except “natural kind”. If you want to make it continuous for goal-agnostic, then make the impact penalty 2 + <insert favorite impact penalty here>. Arguably it doesn’t satisfy “scope-sensitivity” and “irreversibility-sensitivity”. I’m counting those as satisfied because this penalty will never allow the agent to take a higher-impact action, or a more-irreversible action, which I think was the point of those desiderata.
This is a bad impact measure, because it makes the AI unable to do anything. We should probably have a desideratum that outlaws this, and it should probably be of the form “Our AI is able to do things”, and that’s what I was trying to get at above. (And I do think that AUP might have this problem.)
I’m confused. I think under the strongly superintelligent AI model (which seems to be the model you’re using), if there’s misalignment then the AI is strongly optimizing against any security precautions we’ve taken, so if we don’t preclude every possible bad outcome, the AI will find the one we missed.
We’re on the same page basically, I thought you meant literally proving activating the agent won’t cause a vacuum collapse.
In that world, I wouldn’t be surprised if we had arguments about why such a thing was clearly a high-impact action.
This criticism of impact measures doesn’t seem falsifiable? Or maybe I misunderstand.
Another way of putting this is that I think either “impact” is a value-laden concept, or “impact” will fail to prevent some catastrophe, or “impact” prevents the AI from doing anything useful.
If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact. I have yet to see one, although they could exist. The evidence so far points towards “objective impact is sufficient”.
This is a bad impact measure
And it doesn’t satisy the desiderata, as you note?
And I do think that AUP might have this problem.
People have yet to point out a goal AUP cannot maximize in a low-impact way. Instead, certain methods of reaching certain goals are disallowed. These are distinct flaws, with the latter only turning into the former (as I understand it) if no such method exists for any given goal.
If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact. I have yet to see one, although they could exist. The evidence so far points towards “objective impact is sufficient”.
We’d like to build AI systems that help us resolve the tricky situation that we’re in. That help design and enforce agreements to avoid technological risks, build better-aligned AI, negotiate with other actors, predict and manage the impacts of AI, improve our institutions and policy, etc.
I think the default “terrible” scenario is one where increasingly powerful AI makes the world change faster and faster, and makes our situation more and more complex, with humans having less and less of a handle on what is going on or how to steer it in a positive direction. Where we must rely on AI to get anywhere at all, and thereby give up the ability to choose where we are going.
That may ultimately culminate with a catastrophic bang, but if it does it’s not going to be because we wanted the AI to have a small impact and it had a large impact. It’s probably going to be because we have a very limited idea what is going on, but we don’t feel like we have the breathing room to step back and chill out (at least not for long) because we don’t believe that everyone else is going to give us time.
If I’m trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does “low impact” mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?
(And realistically I doubt we’ll fail at alignment with a bang—it’s more likely that the world will just drift off the rails over the course of a few months or years. The intuition that we wouldn’t let things go off the rails gradually seems like the same kind of wishful thinking that predicts war or slow-rolling environmental disasters should never happen.)
It seems like “low objective impact” is what we need once we are in the unstable situation where we have the technology to build an AI that would quickly and radically transform the world, but we have all decided not to and so are primarily concerned about radically transforming the world by accident. I think that’s a coherent situation to think about and plan for, but we shouldn’t mistake it for the mainline. (I personally think it is quite unlikely, and it would definitely be unprecedented, though you could still think it’s the best hope if you were very pessimistic about what I consider “mainline” alignment.)
If I’m trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does “low impact” mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?
Solving low impact seems like it would allow us to ensure that each low impact agent won’t push the world in a given direction by more than some bounded, (presumably) small amount. If we’re thinking of my new measure in particular, it would also help ensure that we won’t be surprised by the capability gain of any single agent, which might help even if we aren’t expecting the spontaneous arrival of a singleton. A good formulation of low impact would have the property that interactions of multiple such agents doesn’t turn into more than the sum of the constituent impact budgets. In this sense, I think it’s sensible to see measuring and restricting objective impact (implicitly thinking of my approach here) as helpful for slowing down the situation.
I also think that, depending on the specific formulation, a low impact solution would enable a substantial reduction in the problems which we need to solve ourselves. That is, I think solving low impact might make useful technical oracles possible. It might be the case that we only need a portion of the agent foundations agenda + low impact in order to build these oracles, which we could then use to help us solve value alignment/corrigibility/etc.
I am also aware that using these oracles would not (naively) be low impact; I plan to outline how we could maybe get around this in a robust manner as soon as soon as I am able.
If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact.
Here is a scenario that I think low impact methods can’t address. Let’s say the AI has the opportunity to easily (without affecting its ability to do its task) create a “free” copy or a sub-agent on the Internet. That sub-agent can be tasked with pursuing the goal of restricting whatever measure of impact the main agent operates under. The sub-agents very carefully eats the universe, while ensuring that according to the impact measure there is no effect of its activity. It has no other purpose. Any resources it needs, it can gather where they would’ve been wasted without its intervention (like reaching distant stars before other agents and capturing some of their output that wouldn’t otherwise be captured).
This functions like “neutral magic” in the environment where everything becomes weird, agents lurking in the shadows, but anything that was thought of in the impact measure is no different than without this magic. Of course, this is very bad for Goodhart reasons, since no impact measure is perfect. (This is like the example with an explosion in the desert, but everywhere, harder to notice exactly where it would become noticeable.)
A general technique to avoid this might be a way of always keeping the agent busy, computing hashes or something, even when it’s done with its task (maybe as a lexicographically lower goal), so that it doesn’t do something like that because it can. This also looks like an impact measure.
(This is related to how impact measure is close to being a goal (transformation of goals), so the failure modes of misalignment apply to it as well. I think there is a useful distinction between goals and corrigibility, which might be reproduced for something like low impact, as a third kind of thing that an agent might pursue, which is neither a goal nor a corrigibility.)
The sub-agent in this scenario won’t be helping the main agent with achieving any goals. It only hides. Its nature precludes any usefulness. It’s no more useful than its absence. But for the main agent, it’s as easy to set up as its absence. And there might be reasons for this balance to be broken in favor of creating the sub-agent.
That isn’t how AUP works. Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.
Suppose that an arbitrary maximizer could not co-opt this new agent—its ability to achieve goals is decreased compared to if it hadn’t activated the agent. On the other hand, if it can co-opt it, its ability is increased. This is not to mention the large amount of resources that be used by such an expansive sub agent, nor the fact that intent verification seemingly would not allow such a sub agent to be built.
I discuss this kind of thing in several places in the comments, if you’re interested.
Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.
It’ll replenish the resources it takes, help any maximizer it impedes so as to exactly cancel out the impediment etc.
Suppose that an arbitrary maximizer could not co-opt this new agent - its ability to achieve goals is decreased compared to if it hadn’t activated the agent.
To the extent its existence could pose a problem for another agent (according to the measure, which can’t really talk about goals of agents-in-general), it’ll surrender its resources without a fight or help with the other agent’s goals. It will destroy any surplus in its resources before it can be captured. A change in variance resulting from unpredictability of such outcomes can be similarly targeted.
In general, these seem like the details that neutral magic can work out as their possibility occurs to it, so the difficuly is shifted from the main agent to the trusted ability of its sub-agent to do the rest. Worst case scenario, it can self-destruct early, but there’s probably enough leeway from that outcome.
It’s Rice’s theorem, though really more about conceptual ambiguity. We can talk about particular notions of agents or goals, but it’s never fully general, unless we by construction ensure that unexpected things can’t occur. And even then it’s not what we would have wanted the notions of agents or goals to be, because it’s not clear what that is.
Intent verification doesn’t seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness, especially after other routes to improving the outcome have been exhausted (this is what the paragraph about hashes in the first comment was about). So the neutral magic could be helpful a tiny, bounded amount. This is one of the ways the balance between doing nothing and releasing a sub-agent could be broken.
Intent verification doesn’t seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness
Are you talking about granular actions, or coarse actions? The acceptable way to do IV for the latter is an open question, although I point out preliminary approaches.
I was talking about what I understand the purpose/design of intent verification to be, not specifically the formalizations you described. (I don’t think it’s particularly useful to work out the details without a general plan or expectation of important technical surprises.)
If you decompose the creation of such an agent, some of those actions are wasted effort in the eyes of a pure u_A maximizer (“dont help me too much”). So, the logic goes, they really aren’t related to u_A, but rather to skirting the impact measure, and should therefore be penalized.
It could as easily be “do this one slightly helpful thing”, an addition on top of doing nothing. It doesn’t seem like there is an essential distinction between such different framings of the same outcome that intent verification can capture.
Whether these granular actions exist is also an open question I listed.
I don’t see why some version of IV won’t be able to get past this, however. There seems to be a simple class of things the agent does to get around an impact measure that it wouldn’t do if it were just trying to pursue a goal to the maximum extent. It might be true that the things the agent does to get around it are also slightly helpful for the goal, but probably not as helpful as the most helpful action.
I worry there might be leaks in logical time that let the agent choose an action that takes into account that an impactful action will be denied. For example, a sub-agent could be built so that it’s a maximizer that’s not constrained by an impact measure. The sub-agent then notices that to maximize its goal, it must constrain its impact, or else the main agent won’t be allowed to create it. And so it will so constrain its impact and will be allowed to be created, as a low-impact and maximally useful action of the main agent. It’s sort of a daemon, but with respect to impact measure and not goals, which additionally does respect the impact measure and only circumvents it once in order to get created.
That’s a really interesting point. I’d like to think about this more, but one preliminary intuition I have against this (and any general successor creation by AUP, really) being the best action is that making new agents aligned with your goals is instrumentally convergent. This could add a frictional cost so that the AUP agent would be better off just doing the job itself. Perhaps we could also stop this via an approval incentives, which might tip the scales enough?
These seem like generally good desiderata, though I don’t know how to formalize them to the point that we can actually check with reasonable certainty whether a proposed impact measure meets these desiderata.
I have one additional desideratum from impact measures. The impact measure alone should disallow all extinction scenarios, while still allowing the AI system to do most of the things we use AI for today. This is rather weak, really I’d want AI do more tasks than are done today. However, even in this weak form, I doubt that we can satisfy this desideratum if we must also be agnostic to values, goals, representations and environments. We could have valued human superiority at game-playing very highly, in which case AlphaGo would be catastrophic. How can an impact measure allow that without having at least some knowledge about values?
Reading through this again, I think I have a better response to this part.
A low impact agent could beat us at games while still preserving our ability to beat it at games (by, for example, shutting it off). Of course, you could say “what if being beaten even once is a catastrophe, such that it destroys our ability to be undefeated forever”, but it seems like our goals are simply not of this form. In other words, it seems that actual catastrophes do destroy our ability to achieve different goals, while more benign things don’t. If the bad things the agent does can be recovered from, then I think the impact measure has done its job.
We might have a goal like “never cause an instance of extreme suffering, including in computer simulations” which seems pretty similar to “never let an AI defeat humans in Go”.
it’s true that impact measures, and AUP in particular, don’t do anything to mitigate mindcrime. Part of this is because aspects of the agent’s reasoning process can’t be considered impactful in the non-embedded formalisms we’re currently stuck with. Part of this is because it seems like a separate problem. Mindcrime would indeed be very bad, and a unique kind of catastrophe not meant to be covered by my claim above.
However, I’m skeptical that that goal is actually a component of our terminal preferences. What is doing the causing – are you thinking “never have an AI cause an instance of that”? Why would that be part of our terminal preferences? If you mean “never have this happen”, we’ve already lost.
It seems more like we really, really don’t want any of that to happen, and the less happens, the better. Like I said, the point isn’t that the agent will never do it, but that any bad things can be recovered from. This seems alright to me, as far as impact measures are concerned.
More generally, if we did have a goal of this type, it would be the case that if we learned that a particular thing had happened at any point in the past in our universe, that universe would be partially ruined for us forever. That just doesn’t sound right.
Aside from mindcrime, I’m also concerned about AI deliberately causing extreme suffering as part of some sort of bargaining/extortion scheme. Is that something that impact measures can mitigate?
An AI designer or humanity as a whole might want to avoid personal or collective responsibility for causing extreme suffering, which plausibly is part of our terminal preferences.
Additionally, a superintelligent AI can probably cause much more extreme forms of suffering than anything that has occurred in the history of our universe so far, so even if the goal is defined as “never have this happen” I think we could lose more than we already have.
I think so. First, AUP seems to bound “how hard the agent tries” (in the physical world with its actions); the ambitions of such an agent seem rather restrained. Second, AUP provides a strong counterfactual approval incentive. While it doesn’t rule out the possibility of physical suffering, the agent is heavily dis-incentivized from actions which would substantially change the likelihood we keep it activated (comparing how likely it is to be turned off if it doesn’t do the thing, with the likelihood if it does the thing and then waits for a long time). It would basically have to be extremely sure it could keep it secret, which seems rather unlikely considering the other aspects of the behavior of AUP agents. If I understand the extortion scenario correctly, it would have to be extorting us, so it couldn’t keep it secret, so it would be penalized and it wouldn’t do it.
I think similar arguments involving counterfactual approval apply for similar things we may want to avoid.
But creating extreme suffering might not actually involve doing much in the physical world (compared to “normal” actions the AI would have to take to achieve the goals that we gave it). What if, depending on the goals we give the AI, doing this kind of extortion is actually the lowest impact way to achieve some goal?
Maybe it could extort a different group of humans, and as part of the extortion force them to keep it secret from people who could turn it off? Or extort us and as part of the extortion force us to not turn it off (until we were going to turn it off anyway)?
Also, since we’re discussing this under the “Impact Measure Desiderata” post, do the existing desiderata cover this scenario? If not, what new desideratum do we need to add to the list?
Since there are a lot of possible scenarios, each of which affects the optimization differently, I’m hesitant to use a universal quantifier here without more details. However, I am broadly suspicious of AUP agents choosing plans which involve almost maximally offensive components, even accounting for the fact that it could try to do so surreptitiously. An agent might try to extort us if it expected we would respond, but respond with what? Although impact measures quantify things in the environment, that doesn’t mean they’re measuring how “similar” two states look to the eye. AUP penalizes distance traveled in the Q function space for its attainable utility functions. We also need to think about the motive for the extortion – if it means the agent gains in power, then that is also penalized.
Again, it depends on the objective of the extortion. As for the latter, that wouldn’t be credible, since we would be able to tell its threat was the last action in its plan. AUP isolates the long-term effects of each action by having the agent stop acting for the rest of the epoch; this gives us a counterfactual opportunity to respond to that action.
I’m not sure whether this belongs in the desiderata, since we’re talking about whether temporary object level bad things could happen. I think it’s a bonus to think that there is less of a chance of that, but not the primary focus of the impact measure. Even so, it’s true that we could explicitly talk about what we want to do with impact measures, adding desiderata like “able to do reasonable things” and “disallows catastrophes from rising to the top of the preference ordering”. I’m still thinking about this.
I guess I don’t have good intuitions of what an AUP agent would or wouldn’t do. Can you share yours, like give some examples of real goals we might want to give to AUP agents, and what you think they would and wouldn’t do to accomplish each of those goals, and why? (Maybe this could be written up as a post since it might be helpful for others to understand your intuitions about how AUP would work in a real-world setting.)
Why not? I’ve usually seen people talk about “impact measures” as a way of avoiding side effects, especially negative side effects. It seems intuitive that “object level bad things” are negative side effects even if they are temporary, and ought to be a primary focus of impact measures. It seems like you’ve reframed “impact measures” in your mind to be a bit different from this naive intuitive picture, so perhaps you could explain that a bit more (or point me to such an explanation)?
Sounds good. I’m currently working on a long sequence walking through my intuitions and assumptions in detail.
Yeah, I think I agree that example is a bit extreme, and it’s probably okay to assume we don’t have goals of that form.
That said, you often talk about AUP with examples like not breaking a vase. In reality, we could always simply buy a new vase. If you expect a low impact agent could beat us at games while still preserving our ability to beat it at games, do you also expect that a low impact agent could break a vase while preserving our ability to have an intact vase (by buying a new vase)?
Short answer: yes; if its goal is to break vases, that would be pretty reasonable.
Longer answer: The AUP theory of low impact says that impact is relative to the environment and to the agent’s vantage point therein. In Platonic gridworlds like this:
knowing whether a vase is present tells you a lot about the state, and you can’t replace the vase here, so breaking it is a big deal (according to AUP). If you could replace the vase, there would still be a lesser impact. AUP would say to avoid breaking unnecessary vases due to the slight penalty, since the goal presumably doesn’t require breaking the vase – so why not go around?
On the other hand, in the Go example, winning is the agent’s objective. Depending on how the agent models the world (as a real-world agent playing a game on a computer, or whether it thinks it’s just Platonically interacting with a Go environment), penalties get applied differently. In the former case, I don’t think it would incur much penalty for being good at a game (modulo approval incentives it may or may not predict). In the latter case, you’d probably need to keep giving it more impact allowance until it’s playing as well as you’d like. This is because the goal is related to the thing which has a bit of impact.
The -agnostics, the -sensitives, ‘apparently rational’, ‘shutdown-safe’, and ‘knowably-low impact’ should be obvious from the design. ‘Natural kind’ and ‘plausibly efficient’ are indeed subjective. ‘Corrigible’, ‘no offsetting’, ‘clinginess / scapegoating avoidance’, ‘dynamic consistency’, and ‘robust’ can all be proven (although presently, ‘robust’ can only be proven insofar as “this weird behavior won’t happen if a normal agent wouldn’t do it”, which plausibly seems sufficient for weird behaviors incentivized by the impact measure).
(I’m going to assume you mean the weaker thing that doesn’t literally involve precluding every possible bad outcome)
I don’t see why an impact measure for fulfilling the criteria I listed wouldn’t meet what I think you have in mind. Your example with Go is not value-agnostic, and arguably has miniscule objective impact on its own.
I’m confused. I think under the strongly superintelligent AI model (which seems to be the model you’re using), if there’s misalignment then the AI is strongly optimizing against any security precautions we’ve taken, so if we don’t preclude every possible bad outcome, the AI will find the one we missed. I grant that we’re probably not going to be able to prove that it precludes every possible bad outcome, if that’s what you’re worried about, but that still should be our desideratum. I’m also happy to consider other threat models besides strongly superintelligent AI, but that doesn’t seem to be what you’re considering.
That’s my point. It could have been the case that we cared about AIs not beating us at Go, and if building AlphaGo does have minuscule objective impact, then that would have happened causing a catastrophe. In that world, I wouldn’t be surprised if we had arguments about why such a thing was clearly a high-impact action. (Another way of putting this is that I think either “impact” is a value-laden concept, or “impact” will fail to prevent some catastrophe, or “impact” prevents the AI from doing anything useful.)
Suppose your utility function has a maximum value of 1, and the inaction policy always gets utility 0. Consider the impact penalty that always assigns a penalty of 2, except for the inaction policy where the penalty is 0. The agent will provably follow the inaction policy. This impact penalty satisfies all of the desiderata, except “natural kind”. If you want to make it continuous for goal-agnostic, then make the impact penalty 2 + <insert favorite impact penalty here>. Arguably it doesn’t satisfy “scope-sensitivity” and “irreversibility-sensitivity”. I’m counting those as satisfied because this penalty will never allow the agent to take a higher-impact action, or a more-irreversible action, which I think was the point of those desiderata.
This is a bad impact measure, because it makes the AI unable to do anything. We should probably have a desideratum that outlaws this, and it should probably be of the form “Our AI is able to do things”, and that’s what I was trying to get at above. (And I do think that AUP might have this problem.)
We’re on the same page basically, I thought you meant literally proving activating the agent won’t cause a vacuum collapse.
This criticism of impact measures doesn’t seem falsifiable? Or maybe I misunderstand.
If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact. I have yet to see one, although they could exist. The evidence so far points towards “objective impact is sufficient”.
And it doesn’t satisy the desiderata, as you note?
People have yet to point out a goal AUP cannot maximize in a low-impact way. Instead, certain methods of reaching certain goals are disallowed. These are distinct flaws, with the latter only turning into the former (as I understand it) if no such method exists for any given goal.
We’d like to build AI systems that help us resolve the tricky situation that we’re in. That help design and enforce agreements to avoid technological risks, build better-aligned AI, negotiate with other actors, predict and manage the impacts of AI, improve our institutions and policy, etc.
I think the default “terrible” scenario is one where increasingly powerful AI makes the world change faster and faster, and makes our situation more and more complex, with humans having less and less of a handle on what is going on or how to steer it in a positive direction. Where we must rely on AI to get anywhere at all, and thereby give up the ability to choose where we are going.
That may ultimately culminate with a catastrophic bang, but if it does it’s not going to be because we wanted the AI to have a small impact and it had a large impact. It’s probably going to be because we have a very limited idea what is going on, but we don’t feel like we have the breathing room to step back and chill out (at least not for long) because we don’t believe that everyone else is going to give us time.
If I’m trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does “low impact” mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?
(And realistically I doubt we’ll fail at alignment with a bang—it’s more likely that the world will just drift off the rails over the course of a few months or years. The intuition that we wouldn’t let things go off the rails gradually seems like the same kind of wishful thinking that predicts war or slow-rolling environmental disasters should never happen.)
It seems like “low objective impact” is what we need once we are in the unstable situation where we have the technology to build an AI that would quickly and radically transform the world, but we have all decided not to and so are primarily concerned about radically transforming the world by accident. I think that’s a coherent situation to think about and plan for, but we shouldn’t mistake it for the mainline. (I personally think it is quite unlikely, and it would definitely be unprecedented, though you could still think it’s the best hope if you were very pessimistic about what I consider “mainline” alignment.)
Solving low impact seems like it would allow us to ensure that each low impact agent won’t push the world in a given direction by more than some bounded, (presumably) small amount. If we’re thinking of my new measure in particular, it would also help ensure that we won’t be surprised by the capability gain of any single agent, which might help even if we aren’t expecting the spontaneous arrival of a singleton. A good formulation of low impact would have the property that interactions of multiple such agents doesn’t turn into more than the sum of the constituent impact budgets. In this sense, I think it’s sensible to see measuring and restricting objective impact (implicitly thinking of my approach here) as helpful for slowing down the situation.
I also think that, depending on the specific formulation, a low impact solution would enable a substantial reduction in the problems which we need to solve ourselves. That is, I think solving low impact might make useful technical oracles possible. It might be the case that we only need a portion of the agent foundations agenda + low impact in order to build these oracles, which we could then use to help us solve value alignment/corrigibility/etc.
I am also aware that using these oracles would not (naively) be low impact; I plan to outline how we could maybe get around this in a robust manner as soon as soon as I am able.
Here is a scenario that I think low impact methods can’t address. Let’s say the AI has the opportunity to easily (without affecting its ability to do its task) create a “free” copy or a sub-agent on the Internet. That sub-agent can be tasked with pursuing the goal of restricting whatever measure of impact the main agent operates under. The sub-agents very carefully eats the universe, while ensuring that according to the impact measure there is no effect of its activity. It has no other purpose. Any resources it needs, it can gather where they would’ve been wasted without its intervention (like reaching distant stars before other agents and capturing some of their output that wouldn’t otherwise be captured).
This functions like “neutral magic” in the environment where everything becomes weird, agents lurking in the shadows, but anything that was thought of in the impact measure is no different than without this magic. Of course, this is very bad for Goodhart reasons, since no impact measure is perfect. (This is like the example with an explosion in the desert, but everywhere, harder to notice exactly where it would become noticeable.)
A general technique to avoid this might be a way of always keeping the agent busy, computing hashes or something, even when it’s done with its task (maybe as a lexicographically lower goal), so that it doesn’t do something like that because it can. This also looks like an impact measure.
(This is related to how impact measure is close to being a goal (transformation of goals), so the failure modes of misalignment apply to it as well. I think there is a useful distinction between goals and corrigibility, which might be reproduced for something like low impact, as a third kind of thing that an agent might pursue, which is neither a goal nor a corrigibility.)
My new measure captures this, if I understand correctly. This action drastically changes the agent’s ability to achieve different goals.
The sub-agent in this scenario won’t be helping the main agent with achieving any goals. It only hides. Its nature precludes any usefulness. It’s no more useful than its absence. But for the main agent, it’s as easy to set up as its absence. And there might be reasons for this balance to be broken in favor of creating the sub-agent.
That isn’t how AUP works. Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.
Suppose that an arbitrary maximizer could not co-opt this new agent—its ability to achieve goals is decreased compared to if it hadn’t activated the agent. On the other hand, if it can co-opt it, its ability is increased. This is not to mention the large amount of resources that be used by such an expansive sub agent, nor the fact that intent verification seemingly would not allow such a sub agent to be built.
I discuss this kind of thing in several places in the comments, if you’re interested.
It’ll replenish the resources it takes, help any maximizer it impedes so as to exactly cancel out the impediment etc.
To the extent its existence could pose a problem for another agent (according to the measure, which can’t really talk about goals of agents-in-general), it’ll surrender its resources without a fight or help with the other agent’s goals. It will destroy any surplus in its resources before it can be captured. A change in variance resulting from unpredictability of such outcomes can be similarly targeted.
In general, these seem like the details that neutral magic can work out as their possibility occurs to it, so the difficuly is shifted from the main agent to the trusted ability of its sub-agent to do the rest. Worst case scenario, it can self-destruct early, but there’s probably enough leeway from that outcome.
Then it specifically isn’t allowed by intent verification.
Are these your priors, or posteriors after having read my post? Because AUP is about the goals of arbitrary agents.
It’s Rice’s theorem, though really more about conceptual ambiguity. We can talk about particular notions of agents or goals, but it’s never fully general, unless we by construction ensure that unexpected things can’t occur. And even then it’s not what we would have wanted the notions of agents or goals to be, because it’s not clear what that is.
Intent verification doesn’t seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness, especially after other routes to improving the outcome have been exhausted (this is what the paragraph about hashes in the first comment was about). So the neutral magic could be helpful a tiny, bounded amount. This is one of the ways the balance between doing nothing and releasing a sub-agent could be broken.
Are you talking about granular actions, or coarse actions? The acceptable way to do IV for the latter is an open question, although I point out preliminary approaches.
I was talking about what I understand the purpose/design of intent verification to be, not specifically the formalizations you described. (I don’t think it’s particularly useful to work out the details without a general plan or expectation of important technical surprises.)
If you decompose the creation of such an agent, some of those actions are wasted effort in the eyes of a pure u_A maximizer (“dont help me too much”). So, the logic goes, they really aren’t related to u_A, but rather to skirting the impact measure, and should therefore be penalized.
It could as easily be “do this one slightly helpful thing”, an addition on top of doing nothing. It doesn’t seem like there is an essential distinction between such different framings of the same outcome that intent verification can capture.
Whether these granular actions exist is also an open question I listed.
I don’t see why some version of IV won’t be able to get past this, however. There seems to be a simple class of things the agent does to get around an impact measure that it wouldn’t do if it were just trying to pursue a goal to the maximum extent. It might be true that the things the agent does to get around it are also slightly helpful for the goal, but probably not as helpful as the most helpful action.
I worry there might be leaks in logical time that let the agent choose an action that takes into account that an impactful action will be denied. For example, a sub-agent could be built so that it’s a maximizer that’s not constrained by an impact measure. The sub-agent then notices that to maximize its goal, it must constrain its impact, or else the main agent won’t be allowed to create it. And so it will so constrain its impact and will be allowed to be created, as a low-impact and maximally useful action of the main agent. It’s sort of a daemon, but with respect to impact measure and not goals, which additionally does respect the impact measure and only circumvents it once in order to get created.
That’s a really interesting point. I’d like to think about this more, but one preliminary intuition I have against this (and any general successor creation by AUP, really) being the best action is that making new agents aligned with your goals is instrumentally convergent. This could add a frictional cost so that the AUP agent would be better off just doing the job itself. Perhaps we could also stop this via an approval incentives, which might tip the scales enough?
Will reply on the other post to consolidate discussion.