If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact.
Here is a scenario that I think low impact methods can’t address. Let’s say the AI has the opportunity to easily (without affecting its ability to do its task) create a “free” copy or a sub-agent on the Internet. That sub-agent can be tasked with pursuing the goal of restricting whatever measure of impact the main agent operates under. The sub-agents very carefully eats the universe, while ensuring that according to the impact measure there is no effect of its activity. It has no other purpose. Any resources it needs, it can gather where they would’ve been wasted without its intervention (like reaching distant stars before other agents and capturing some of their output that wouldn’t otherwise be captured).
This functions like “neutral magic” in the environment where everything becomes weird, agents lurking in the shadows, but anything that was thought of in the impact measure is no different than without this magic. Of course, this is very bad for Goodhart reasons, since no impact measure is perfect. (This is like the example with an explosion in the desert, but everywhere, harder to notice exactly where it would become noticeable.)
A general technique to avoid this might be a way of always keeping the agent busy, computing hashes or something, even when it’s done with its task (maybe as a lexicographically lower goal), so that it doesn’t do something like that because it can. This also looks like an impact measure.
(This is related to how impact measure is close to being a goal (transformation of goals), so the failure modes of misalignment apply to it as well. I think there is a useful distinction between goals and corrigibility, which might be reproduced for something like low impact, as a third kind of thing that an agent might pursue, which is neither a goal nor a corrigibility.)
The sub-agent in this scenario won’t be helping the main agent with achieving any goals. It only hides. Its nature precludes any usefulness. It’s no more useful than its absence. But for the main agent, it’s as easy to set up as its absence. And there might be reasons for this balance to be broken in favor of creating the sub-agent.
That isn’t how AUP works. Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.
Suppose that an arbitrary maximizer could not co-opt this new agent—its ability to achieve goals is decreased compared to if it hadn’t activated the agent. On the other hand, if it can co-opt it, its ability is increased. This is not to mention the large amount of resources that be used by such an expansive sub agent, nor the fact that intent verification seemingly would not allow such a sub agent to be built.
I discuss this kind of thing in several places in the comments, if you’re interested.
Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.
It’ll replenish the resources it takes, help any maximizer it impedes so as to exactly cancel out the impediment etc.
Suppose that an arbitrary maximizer could not co-opt this new agent - its ability to achieve goals is decreased compared to if it hadn’t activated the agent.
To the extent its existence could pose a problem for another agent (according to the measure, which can’t really talk about goals of agents-in-general), it’ll surrender its resources without a fight or help with the other agent’s goals. It will destroy any surplus in its resources before it can be captured. A change in variance resulting from unpredictability of such outcomes can be similarly targeted.
In general, these seem like the details that neutral magic can work out as their possibility occurs to it, so the difficuly is shifted from the main agent to the trusted ability of its sub-agent to do the rest. Worst case scenario, it can self-destruct early, but there’s probably enough leeway from that outcome.
It’s Rice’s theorem, though really more about conceptual ambiguity. We can talk about particular notions of agents or goals, but it’s never fully general, unless we by construction ensure that unexpected things can’t occur. And even then it’s not what we would have wanted the notions of agents or goals to be, because it’s not clear what that is.
Intent verification doesn’t seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness, especially after other routes to improving the outcome have been exhausted (this is what the paragraph about hashes in the first comment was about). So the neutral magic could be helpful a tiny, bounded amount. This is one of the ways the balance between doing nothing and releasing a sub-agent could be broken.
Intent verification doesn’t seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness
Are you talking about granular actions, or coarse actions? The acceptable way to do IV for the latter is an open question, although I point out preliminary approaches.
I was talking about what I understand the purpose/design of intent verification to be, not specifically the formalizations you described. (I don’t think it’s particularly useful to work out the details without a general plan or expectation of important technical surprises.)
If you decompose the creation of such an agent, some of those actions are wasted effort in the eyes of a pure u_A maximizer (“dont help me too much”). So, the logic goes, they really aren’t related to u_A, but rather to skirting the impact measure, and should therefore be penalized.
It could as easily be “do this one slightly helpful thing”, an addition on top of doing nothing. It doesn’t seem like there is an essential distinction between such different framings of the same outcome that intent verification can capture.
Whether these granular actions exist is also an open question I listed.
I don’t see why some version of IV won’t be able to get past this, however. There seems to be a simple class of things the agent does to get around an impact measure that it wouldn’t do if it were just trying to pursue a goal to the maximum extent. It might be true that the things the agent does to get around it are also slightly helpful for the goal, but probably not as helpful as the most helpful action.
I worry there might be leaks in logical time that let the agent choose an action that takes into account that an impactful action will be denied. For example, a sub-agent could be built so that it’s a maximizer that’s not constrained by an impact measure. The sub-agent then notices that to maximize its goal, it must constrain its impact, or else the main agent won’t be allowed to create it. And so it will so constrain its impact and will be allowed to be created, as a low-impact and maximally useful action of the main agent. It’s sort of a daemon, but with respect to impact measure and not goals, which additionally does respect the impact measure and only circumvents it once in order to get created.
That’s a really interesting point. I’d like to think about this more, but one preliminary intuition I have against this (and any general successor creation by AUP, really) being the best action is that making new agents aligned with your goals is instrumentally convergent. This could add a frictional cost so that the AUP agent would be better off just doing the job itself. Perhaps we could also stop this via an approval incentives, which might tip the scales enough?
Here is a scenario that I think low impact methods can’t address. Let’s say the AI has the opportunity to easily (without affecting its ability to do its task) create a “free” copy or a sub-agent on the Internet. That sub-agent can be tasked with pursuing the goal of restricting whatever measure of impact the main agent operates under. The sub-agents very carefully eats the universe, while ensuring that according to the impact measure there is no effect of its activity. It has no other purpose. Any resources it needs, it can gather where they would’ve been wasted without its intervention (like reaching distant stars before other agents and capturing some of their output that wouldn’t otherwise be captured).
This functions like “neutral magic” in the environment where everything becomes weird, agents lurking in the shadows, but anything that was thought of in the impact measure is no different than without this magic. Of course, this is very bad for Goodhart reasons, since no impact measure is perfect. (This is like the example with an explosion in the desert, but everywhere, harder to notice exactly where it would become noticeable.)
A general technique to avoid this might be a way of always keeping the agent busy, computing hashes or something, even when it’s done with its task (maybe as a lexicographically lower goal), so that it doesn’t do something like that because it can. This also looks like an impact measure.
(This is related to how impact measure is close to being a goal (transformation of goals), so the failure modes of misalignment apply to it as well. I think there is a useful distinction between goals and corrigibility, which might be reproduced for something like low impact, as a third kind of thing that an agent might pursue, which is neither a goal nor a corrigibility.)
My new measure captures this, if I understand correctly. This action drastically changes the agent’s ability to achieve different goals.
The sub-agent in this scenario won’t be helping the main agent with achieving any goals. It only hides. Its nature precludes any usefulness. It’s no more useful than its absence. But for the main agent, it’s as easy to set up as its absence. And there might be reasons for this balance to be broken in favor of creating the sub-agent.
That isn’t how AUP works. Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.
Suppose that an arbitrary maximizer could not co-opt this new agent—its ability to achieve goals is decreased compared to if it hadn’t activated the agent. On the other hand, if it can co-opt it, its ability is increased. This is not to mention the large amount of resources that be used by such an expansive sub agent, nor the fact that intent verification seemingly would not allow such a sub agent to be built.
I discuss this kind of thing in several places in the comments, if you’re interested.
It’ll replenish the resources it takes, help any maximizer it impedes so as to exactly cancel out the impediment etc.
To the extent its existence could pose a problem for another agent (according to the measure, which can’t really talk about goals of agents-in-general), it’ll surrender its resources without a fight or help with the other agent’s goals. It will destroy any surplus in its resources before it can be captured. A change in variance resulting from unpredictability of such outcomes can be similarly targeted.
In general, these seem like the details that neutral magic can work out as their possibility occurs to it, so the difficuly is shifted from the main agent to the trusted ability of its sub-agent to do the rest. Worst case scenario, it can self-destruct early, but there’s probably enough leeway from that outcome.
Then it specifically isn’t allowed by intent verification.
Are these your priors, or posteriors after having read my post? Because AUP is about the goals of arbitrary agents.
It’s Rice’s theorem, though really more about conceptual ambiguity. We can talk about particular notions of agents or goals, but it’s never fully general, unless we by construction ensure that unexpected things can’t occur. And even then it’s not what we would have wanted the notions of agents or goals to be, because it’s not clear what that is.
Intent verification doesn’t seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness, especially after other routes to improving the outcome have been exhausted (this is what the paragraph about hashes in the first comment was about). So the neutral magic could be helpful a tiny, bounded amount. This is one of the ways the balance between doing nothing and releasing a sub-agent could be broken.
Are you talking about granular actions, or coarse actions? The acceptable way to do IV for the latter is an open question, although I point out preliminary approaches.
I was talking about what I understand the purpose/design of intent verification to be, not specifically the formalizations you described. (I don’t think it’s particularly useful to work out the details without a general plan or expectation of important technical surprises.)
If you decompose the creation of such an agent, some of those actions are wasted effort in the eyes of a pure u_A maximizer (“dont help me too much”). So, the logic goes, they really aren’t related to u_A, but rather to skirting the impact measure, and should therefore be penalized.
It could as easily be “do this one slightly helpful thing”, an addition on top of doing nothing. It doesn’t seem like there is an essential distinction between such different framings of the same outcome that intent verification can capture.
Whether these granular actions exist is also an open question I listed.
I don’t see why some version of IV won’t be able to get past this, however. There seems to be a simple class of things the agent does to get around an impact measure that it wouldn’t do if it were just trying to pursue a goal to the maximum extent. It might be true that the things the agent does to get around it are also slightly helpful for the goal, but probably not as helpful as the most helpful action.
I worry there might be leaks in logical time that let the agent choose an action that takes into account that an impactful action will be denied. For example, a sub-agent could be built so that it’s a maximizer that’s not constrained by an impact measure. The sub-agent then notices that to maximize its goal, it must constrain its impact, or else the main agent won’t be allowed to create it. And so it will so constrain its impact and will be allowed to be created, as a low-impact and maximally useful action of the main agent. It’s sort of a daemon, but with respect to impact measure and not goals, which additionally does respect the impact measure and only circumvents it once in order to get created.
That’s a really interesting point. I’d like to think about this more, but one preliminary intuition I have against this (and any general successor creation by AUP, really) being the best action is that making new agents aligned with your goals is instrumentally convergent. This could add a frictional cost so that the AUP agent would be better off just doing the job itself. Perhaps we could also stop this via an approval incentives, which might tip the scales enough?