I don’t know how to formalize them to the point that we can actually check with reasonable certainty whether a proposed impact measure meets these desiderata.
The -agnostics, the -sensitives, ‘apparently rational’, ‘shutdown-safe’, and ‘knowably-low impact’ should be obvious from the design. ‘Natural kind’ and ‘plausibly efficient’ are indeed subjective. ‘Corrigible’, ‘no offsetting’, ‘clinginess / scapegoating avoidance’, ‘dynamic consistency’, and ‘robust’ can all be proven (although presently, ‘robust’ can only be proven insofar as “this weird behavior won’t happen if a normal agent wouldn’t do it”, which plausibly seems sufficient for weird behaviors incentivized by the impact measure).
The impact measure alone should disallow all extinction scenarios, while still allowing the AI system to do most of the things we use AI for today.
(I’m going to assume you mean the weaker thing that doesn’t literally involve precluding every possible bad outcome)
I don’t see why an impact measure for fulfilling the criteria I listed wouldn’t meet what I think you have in mind. Your example with Go is not value-agnostic, and arguably has miniscule objective impact on its own.
(I’m going to assume you mean the weaker thing that doesn’t literally involve precluding every possible bad outcome)
I’m confused. I think under the strongly superintelligent AI model (which seems to be the model you’re using), if there’s misalignment then the AI is strongly optimizing against any security precautions we’ve taken, so if we don’t preclude every possible bad outcome, the AI will find the one we missed. I grant that we’re probably not going to be able to prove that it precludes every possible bad outcome, if that’s what you’re worried about, but that still should be our desideratum. I’m also happy to consider other threat models besides strongly superintelligent AI, but that doesn’t seem to be what you’re considering.
Your example with Go is not value-agnostic, and arguably has miniscule objective impact on its own.
That’s my point. It could have been the case that we cared about AIs not beating us at Go, and if building AlphaGo does have minuscule objective impact, then that would have happened causing a catastrophe. In that world, I wouldn’t be surprised if we had arguments about why such a thing was clearly a high-impact action. (Another way of putting this is that I think either “impact” is a value-laden concept, or “impact” will fail to prevent some catastrophe, or “impact” prevents the AI from doing anything useful.)
I don’t see why an impact measure for fulfilling the criteria I listed wouldn’t meet what I think you have in mind.
Suppose your utility function has a maximum value of 1, and the inaction policy always gets utility 0. Consider the impact penalty that always assigns a penalty of 2, except for the inaction policy where the penalty is 0. The agent will provably follow the inaction policy. This impact penalty satisfies all of the desiderata, except “natural kind”. If you want to make it continuous for goal-agnostic, then make the impact penalty 2 + <insert favorite impact penalty here>. Arguably it doesn’t satisfy “scope-sensitivity” and “irreversibility-sensitivity”. I’m counting those as satisfied because this penalty will never allow the agent to take a higher-impact action, or a more-irreversible action, which I think was the point of those desiderata.
This is a bad impact measure, because it makes the AI unable to do anything. We should probably have a desideratum that outlaws this, and it should probably be of the form “Our AI is able to do things”, and that’s what I was trying to get at above. (And I do think that AUP might have this problem.)
I’m confused. I think under the strongly superintelligent AI model (which seems to be the model you’re using), if there’s misalignment then the AI is strongly optimizing against any security precautions we’ve taken, so if we don’t preclude every possible bad outcome, the AI will find the one we missed.
We’re on the same page basically, I thought you meant literally proving activating the agent won’t cause a vacuum collapse.
In that world, I wouldn’t be surprised if we had arguments about why such a thing was clearly a high-impact action.
This criticism of impact measures doesn’t seem falsifiable? Or maybe I misunderstand.
Another way of putting this is that I think either “impact” is a value-laden concept, or “impact” will fail to prevent some catastrophe, or “impact” prevents the AI from doing anything useful.
If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact. I have yet to see one, although they could exist. The evidence so far points towards “objective impact is sufficient”.
This is a bad impact measure
And it doesn’t satisy the desiderata, as you note?
And I do think that AUP might have this problem.
People have yet to point out a goal AUP cannot maximize in a low-impact way. Instead, certain methods of reaching certain goals are disallowed. These are distinct flaws, with the latter only turning into the former (as I understand it) if no such method exists for any given goal.
If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact. I have yet to see one, although they could exist. The evidence so far points towards “objective impact is sufficient”.
We’d like to build AI systems that help us resolve the tricky situation that we’re in. That help design and enforce agreements to avoid technological risks, build better-aligned AI, negotiate with other actors, predict and manage the impacts of AI, improve our institutions and policy, etc.
I think the default “terrible” scenario is one where increasingly powerful AI makes the world change faster and faster, and makes our situation more and more complex, with humans having less and less of a handle on what is going on or how to steer it in a positive direction. Where we must rely on AI to get anywhere at all, and thereby give up the ability to choose where we are going.
That may ultimately culminate with a catastrophic bang, but if it does it’s not going to be because we wanted the AI to have a small impact and it had a large impact. It’s probably going to be because we have a very limited idea what is going on, but we don’t feel like we have the breathing room to step back and chill out (at least not for long) because we don’t believe that everyone else is going to give us time.
If I’m trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does “low impact” mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?
(And realistically I doubt we’ll fail at alignment with a bang—it’s more likely that the world will just drift off the rails over the course of a few months or years. The intuition that we wouldn’t let things go off the rails gradually seems like the same kind of wishful thinking that predicts war or slow-rolling environmental disasters should never happen.)
It seems like “low objective impact” is what we need once we are in the unstable situation where we have the technology to build an AI that would quickly and radically transform the world, but we have all decided not to and so are primarily concerned about radically transforming the world by accident. I think that’s a coherent situation to think about and plan for, but we shouldn’t mistake it for the mainline. (I personally think it is quite unlikely, and it would definitely be unprecedented, though you could still think it’s the best hope if you were very pessimistic about what I consider “mainline” alignment.)
If I’m trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does “low impact” mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?
Solving low impact seems like it would allow us to ensure that each low impact agent won’t push the world in a given direction by more than some bounded, (presumably) small amount. If we’re thinking of my new measure in particular, it would also help ensure that we won’t be surprised by the capability gain of any single agent, which might help even if we aren’t expecting the spontaneous arrival of a singleton. A good formulation of low impact would have the property that interactions of multiple such agents doesn’t turn into more than the sum of the constituent impact budgets. In this sense, I think it’s sensible to see measuring and restricting objective impact (implicitly thinking of my approach here) as helpful for slowing down the situation.
I also think that, depending on the specific formulation, a low impact solution would enable a substantial reduction in the problems which we need to solve ourselves. That is, I think solving low impact might make useful technical oracles possible. It might be the case that we only need a portion of the agent foundations agenda + low impact in order to build these oracles, which we could then use to help us solve value alignment/corrigibility/etc.
I am also aware that using these oracles would not (naively) be low impact; I plan to outline how we could maybe get around this in a robust manner as soon as soon as I am able.
If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact.
Here is a scenario that I think low impact methods can’t address. Let’s say the AI has the opportunity to easily (without affecting its ability to do its task) create a “free” copy or a sub-agent on the Internet. That sub-agent can be tasked with pursuing the goal of restricting whatever measure of impact the main agent operates under. The sub-agents very carefully eats the universe, while ensuring that according to the impact measure there is no effect of its activity. It has no other purpose. Any resources it needs, it can gather where they would’ve been wasted without its intervention (like reaching distant stars before other agents and capturing some of their output that wouldn’t otherwise be captured).
This functions like “neutral magic” in the environment where everything becomes weird, agents lurking in the shadows, but anything that was thought of in the impact measure is no different than without this magic. Of course, this is very bad for Goodhart reasons, since no impact measure is perfect. (This is like the example with an explosion in the desert, but everywhere, harder to notice exactly where it would become noticeable.)
A general technique to avoid this might be a way of always keeping the agent busy, computing hashes or something, even when it’s done with its task (maybe as a lexicographically lower goal), so that it doesn’t do something like that because it can. This also looks like an impact measure.
(This is related to how impact measure is close to being a goal (transformation of goals), so the failure modes of misalignment apply to it as well. I think there is a useful distinction between goals and corrigibility, which might be reproduced for something like low impact, as a third kind of thing that an agent might pursue, which is neither a goal nor a corrigibility.)
The sub-agent in this scenario won’t be helping the main agent with achieving any goals. It only hides. Its nature precludes any usefulness. It’s no more useful than its absence. But for the main agent, it’s as easy to set up as its absence. And there might be reasons for this balance to be broken in favor of creating the sub-agent.
That isn’t how AUP works. Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.
Suppose that an arbitrary maximizer could not co-opt this new agent—its ability to achieve goals is decreased compared to if it hadn’t activated the agent. On the other hand, if it can co-opt it, its ability is increased. This is not to mention the large amount of resources that be used by such an expansive sub agent, nor the fact that intent verification seemingly would not allow such a sub agent to be built.
I discuss this kind of thing in several places in the comments, if you’re interested.
Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.
It’ll replenish the resources it takes, help any maximizer it impedes so as to exactly cancel out the impediment etc.
Suppose that an arbitrary maximizer could not co-opt this new agent - its ability to achieve goals is decreased compared to if it hadn’t activated the agent.
To the extent its existence could pose a problem for another agent (according to the measure, which can’t really talk about goals of agents-in-general), it’ll surrender its resources without a fight or help with the other agent’s goals. It will destroy any surplus in its resources before it can be captured. A change in variance resulting from unpredictability of such outcomes can be similarly targeted.
In general, these seem like the details that neutral magic can work out as their possibility occurs to it, so the difficuly is shifted from the main agent to the trusted ability of its sub-agent to do the rest. Worst case scenario, it can self-destruct early, but there’s probably enough leeway from that outcome.
It’s Rice’s theorem, though really more about conceptual ambiguity. We can talk about particular notions of agents or goals, but it’s never fully general, unless we by construction ensure that unexpected things can’t occur. And even then it’s not what we would have wanted the notions of agents or goals to be, because it’s not clear what that is.
Intent verification doesn’t seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness, especially after other routes to improving the outcome have been exhausted (this is what the paragraph about hashes in the first comment was about). So the neutral magic could be helpful a tiny, bounded amount. This is one of the ways the balance between doing nothing and releasing a sub-agent could be broken.
Intent verification doesn’t seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness
Are you talking about granular actions, or coarse actions? The acceptable way to do IV for the latter is an open question, although I point out preliminary approaches.
I was talking about what I understand the purpose/design of intent verification to be, not specifically the formalizations you described. (I don’t think it’s particularly useful to work out the details without a general plan or expectation of important technical surprises.)
If you decompose the creation of such an agent, some of those actions are wasted effort in the eyes of a pure u_A maximizer (“dont help me too much”). So, the logic goes, they really aren’t related to u_A, but rather to skirting the impact measure, and should therefore be penalized.
It could as easily be “do this one slightly helpful thing”, an addition on top of doing nothing. It doesn’t seem like there is an essential distinction between such different framings of the same outcome that intent verification can capture.
Whether these granular actions exist is also an open question I listed.
I don’t see why some version of IV won’t be able to get past this, however. There seems to be a simple class of things the agent does to get around an impact measure that it wouldn’t do if it were just trying to pursue a goal to the maximum extent. It might be true that the things the agent does to get around it are also slightly helpful for the goal, but probably not as helpful as the most helpful action.
I worry there might be leaks in logical time that let the agent choose an action that takes into account that an impactful action will be denied. For example, a sub-agent could be built so that it’s a maximizer that’s not constrained by an impact measure. The sub-agent then notices that to maximize its goal, it must constrain its impact, or else the main agent won’t be allowed to create it. And so it will so constrain its impact and will be allowed to be created, as a low-impact and maximally useful action of the main agent. It’s sort of a daemon, but with respect to impact measure and not goals, which additionally does respect the impact measure and only circumvents it once in order to get created.
That’s a really interesting point. I’d like to think about this more, but one preliminary intuition I have against this (and any general successor creation by AUP, really) being the best action is that making new agents aligned with your goals is instrumentally convergent. This could add a frictional cost so that the AUP agent would be better off just doing the job itself. Perhaps we could also stop this via an approval incentives, which might tip the scales enough?
The -agnostics, the -sensitives, ‘apparently rational’, ‘shutdown-safe’, and ‘knowably-low impact’ should be obvious from the design. ‘Natural kind’ and ‘plausibly efficient’ are indeed subjective. ‘Corrigible’, ‘no offsetting’, ‘clinginess / scapegoating avoidance’, ‘dynamic consistency’, and ‘robust’ can all be proven (although presently, ‘robust’ can only be proven insofar as “this weird behavior won’t happen if a normal agent wouldn’t do it”, which plausibly seems sufficient for weird behaviors incentivized by the impact measure).
(I’m going to assume you mean the weaker thing that doesn’t literally involve precluding every possible bad outcome)
I don’t see why an impact measure for fulfilling the criteria I listed wouldn’t meet what I think you have in mind. Your example with Go is not value-agnostic, and arguably has miniscule objective impact on its own.
I’m confused. I think under the strongly superintelligent AI model (which seems to be the model you’re using), if there’s misalignment then the AI is strongly optimizing against any security precautions we’ve taken, so if we don’t preclude every possible bad outcome, the AI will find the one we missed. I grant that we’re probably not going to be able to prove that it precludes every possible bad outcome, if that’s what you’re worried about, but that still should be our desideratum. I’m also happy to consider other threat models besides strongly superintelligent AI, but that doesn’t seem to be what you’re considering.
That’s my point. It could have been the case that we cared about AIs not beating us at Go, and if building AlphaGo does have minuscule objective impact, then that would have happened causing a catastrophe. In that world, I wouldn’t be surprised if we had arguments about why such a thing was clearly a high-impact action. (Another way of putting this is that I think either “impact” is a value-laden concept, or “impact” will fail to prevent some catastrophe, or “impact” prevents the AI from doing anything useful.)
Suppose your utility function has a maximum value of 1, and the inaction policy always gets utility 0. Consider the impact penalty that always assigns a penalty of 2, except for the inaction policy where the penalty is 0. The agent will provably follow the inaction policy. This impact penalty satisfies all of the desiderata, except “natural kind”. If you want to make it continuous for goal-agnostic, then make the impact penalty 2 + <insert favorite impact penalty here>. Arguably it doesn’t satisfy “scope-sensitivity” and “irreversibility-sensitivity”. I’m counting those as satisfied because this penalty will never allow the agent to take a higher-impact action, or a more-irreversible action, which I think was the point of those desiderata.
This is a bad impact measure, because it makes the AI unable to do anything. We should probably have a desideratum that outlaws this, and it should probably be of the form “Our AI is able to do things”, and that’s what I was trying to get at above. (And I do think that AUP might have this problem.)
We’re on the same page basically, I thought you meant literally proving activating the agent won’t cause a vacuum collapse.
This criticism of impact measures doesn’t seem falsifiable? Or maybe I misunderstand.
If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact. I have yet to see one, although they could exist. The evidence so far points towards “objective impact is sufficient”.
And it doesn’t satisy the desiderata, as you note?
People have yet to point out a goal AUP cannot maximize in a low-impact way. Instead, certain methods of reaching certain goals are disallowed. These are distinct flaws, with the latter only turning into the former (as I understand it) if no such method exists for any given goal.
We’d like to build AI systems that help us resolve the tricky situation that we’re in. That help design and enforce agreements to avoid technological risks, build better-aligned AI, negotiate with other actors, predict and manage the impacts of AI, improve our institutions and policy, etc.
I think the default “terrible” scenario is one where increasingly powerful AI makes the world change faster and faster, and makes our situation more and more complex, with humans having less and less of a handle on what is going on or how to steer it in a positive direction. Where we must rely on AI to get anywhere at all, and thereby give up the ability to choose where we are going.
That may ultimately culminate with a catastrophic bang, but if it does it’s not going to be because we wanted the AI to have a small impact and it had a large impact. It’s probably going to be because we have a very limited idea what is going on, but we don’t feel like we have the breathing room to step back and chill out (at least not for long) because we don’t believe that everyone else is going to give us time.
If I’m trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does “low impact” mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?
(And realistically I doubt we’ll fail at alignment with a bang—it’s more likely that the world will just drift off the rails over the course of a few months or years. The intuition that we wouldn’t let things go off the rails gradually seems like the same kind of wishful thinking that predicts war or slow-rolling environmental disasters should never happen.)
It seems like “low objective impact” is what we need once we are in the unstable situation where we have the technology to build an AI that would quickly and radically transform the world, but we have all decided not to and so are primarily concerned about radically transforming the world by accident. I think that’s a coherent situation to think about and plan for, but we shouldn’t mistake it for the mainline. (I personally think it is quite unlikely, and it would definitely be unprecedented, though you could still think it’s the best hope if you were very pessimistic about what I consider “mainline” alignment.)
Solving low impact seems like it would allow us to ensure that each low impact agent won’t push the world in a given direction by more than some bounded, (presumably) small amount. If we’re thinking of my new measure in particular, it would also help ensure that we won’t be surprised by the capability gain of any single agent, which might help even if we aren’t expecting the spontaneous arrival of a singleton. A good formulation of low impact would have the property that interactions of multiple such agents doesn’t turn into more than the sum of the constituent impact budgets. In this sense, I think it’s sensible to see measuring and restricting objective impact (implicitly thinking of my approach here) as helpful for slowing down the situation.
I also think that, depending on the specific formulation, a low impact solution would enable a substantial reduction in the problems which we need to solve ourselves. That is, I think solving low impact might make useful technical oracles possible. It might be the case that we only need a portion of the agent foundations agenda + low impact in order to build these oracles, which we could then use to help us solve value alignment/corrigibility/etc.
I am also aware that using these oracles would not (naively) be low impact; I plan to outline how we could maybe get around this in a robust manner as soon as soon as I am able.
Here is a scenario that I think low impact methods can’t address. Let’s say the AI has the opportunity to easily (without affecting its ability to do its task) create a “free” copy or a sub-agent on the Internet. That sub-agent can be tasked with pursuing the goal of restricting whatever measure of impact the main agent operates under. The sub-agents very carefully eats the universe, while ensuring that according to the impact measure there is no effect of its activity. It has no other purpose. Any resources it needs, it can gather where they would’ve been wasted without its intervention (like reaching distant stars before other agents and capturing some of their output that wouldn’t otherwise be captured).
This functions like “neutral magic” in the environment where everything becomes weird, agents lurking in the shadows, but anything that was thought of in the impact measure is no different than without this magic. Of course, this is very bad for Goodhart reasons, since no impact measure is perfect. (This is like the example with an explosion in the desert, but everywhere, harder to notice exactly where it would become noticeable.)
A general technique to avoid this might be a way of always keeping the agent busy, computing hashes or something, even when it’s done with its task (maybe as a lexicographically lower goal), so that it doesn’t do something like that because it can. This also looks like an impact measure.
(This is related to how impact measure is close to being a goal (transformation of goals), so the failure modes of misalignment apply to it as well. I think there is a useful distinction between goals and corrigibility, which might be reproduced for something like low impact, as a third kind of thing that an agent might pursue, which is neither a goal nor a corrigibility.)
My new measure captures this, if I understand correctly. This action drastically changes the agent’s ability to achieve different goals.
The sub-agent in this scenario won’t be helping the main agent with achieving any goals. It only hides. Its nature precludes any usefulness. It’s no more useful than its absence. But for the main agent, it’s as easy to set up as its absence. And there might be reasons for this balance to be broken in favor of creating the sub-agent.
That isn’t how AUP works. Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.
Suppose that an arbitrary maximizer could not co-opt this new agent—its ability to achieve goals is decreased compared to if it hadn’t activated the agent. On the other hand, if it can co-opt it, its ability is increased. This is not to mention the large amount of resources that be used by such an expansive sub agent, nor the fact that intent verification seemingly would not allow such a sub agent to be built.
I discuss this kind of thing in several places in the comments, if you’re interested.
It’ll replenish the resources it takes, help any maximizer it impedes so as to exactly cancel out the impediment etc.
To the extent its existence could pose a problem for another agent (according to the measure, which can’t really talk about goals of agents-in-general), it’ll surrender its resources without a fight or help with the other agent’s goals. It will destroy any surplus in its resources before it can be captured. A change in variance resulting from unpredictability of such outcomes can be similarly targeted.
In general, these seem like the details that neutral magic can work out as their possibility occurs to it, so the difficuly is shifted from the main agent to the trusted ability of its sub-agent to do the rest. Worst case scenario, it can self-destruct early, but there’s probably enough leeway from that outcome.
Then it specifically isn’t allowed by intent verification.
Are these your priors, or posteriors after having read my post? Because AUP is about the goals of arbitrary agents.
It’s Rice’s theorem, though really more about conceptual ambiguity. We can talk about particular notions of agents or goals, but it’s never fully general, unless we by construction ensure that unexpected things can’t occur. And even then it’s not what we would have wanted the notions of agents or goals to be, because it’s not clear what that is.
Intent verification doesn’t seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness, especially after other routes to improving the outcome have been exhausted (this is what the paragraph about hashes in the first comment was about). So the neutral magic could be helpful a tiny, bounded amount. This is one of the ways the balance between doing nothing and releasing a sub-agent could be broken.
Are you talking about granular actions, or coarse actions? The acceptable way to do IV for the latter is an open question, although I point out preliminary approaches.
I was talking about what I understand the purpose/design of intent verification to be, not specifically the formalizations you described. (I don’t think it’s particularly useful to work out the details without a general plan or expectation of important technical surprises.)
If you decompose the creation of such an agent, some of those actions are wasted effort in the eyes of a pure u_A maximizer (“dont help me too much”). So, the logic goes, they really aren’t related to u_A, but rather to skirting the impact measure, and should therefore be penalized.
It could as easily be “do this one slightly helpful thing”, an addition on top of doing nothing. It doesn’t seem like there is an essential distinction between such different framings of the same outcome that intent verification can capture.
Whether these granular actions exist is also an open question I listed.
I don’t see why some version of IV won’t be able to get past this, however. There seems to be a simple class of things the agent does to get around an impact measure that it wouldn’t do if it were just trying to pursue a goal to the maximum extent. It might be true that the things the agent does to get around it are also slightly helpful for the goal, but probably not as helpful as the most helpful action.
I worry there might be leaks in logical time that let the agent choose an action that takes into account that an impactful action will be denied. For example, a sub-agent could be built so that it’s a maximizer that’s not constrained by an impact measure. The sub-agent then notices that to maximize its goal, it must constrain its impact, or else the main agent won’t be allowed to create it. And so it will so constrain its impact and will be allowed to be created, as a low-impact and maximally useful action of the main agent. It’s sort of a daemon, but with respect to impact measure and not goals, which additionally does respect the impact measure and only circumvents it once in order to get created.
That’s a really interesting point. I’d like to think about this more, but one preliminary intuition I have against this (and any general successor creation by AUP, really) being the best action is that making new agents aligned with your goals is instrumentally convergent. This could add a frictional cost so that the AUP agent would be better off just doing the job itself. Perhaps we could also stop this via an approval incentives, which might tip the scales enough?
Will reply on the other post to consolidate discussion.