The sub-agent in this scenario won’t be helping the main agent with achieving any goals. It only hides. Its nature precludes any usefulness. It’s no more useful than its absence. But for the main agent, it’s as easy to set up as its absence. And there might be reasons for this balance to be broken in favor of creating the sub-agent.
That isn’t how AUP works. Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.
Suppose that an arbitrary maximizer could not co-opt this new agent—its ability to achieve goals is decreased compared to if it hadn’t activated the agent. On the other hand, if it can co-opt it, its ability is increased. This is not to mention the large amount of resources that be used by such an expansive sub agent, nor the fact that intent verification seemingly would not allow such a sub agent to be built.
I discuss this kind of thing in several places in the comments, if you’re interested.
Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.
It’ll replenish the resources it takes, help any maximizer it impedes so as to exactly cancel out the impediment etc.
Suppose that an arbitrary maximizer could not co-opt this new agent - its ability to achieve goals is decreased compared to if it hadn’t activated the agent.
To the extent its existence could pose a problem for another agent (according to the measure, which can’t really talk about goals of agents-in-general), it’ll surrender its resources without a fight or help with the other agent’s goals. It will destroy any surplus in its resources before it can be captured. A change in variance resulting from unpredictability of such outcomes can be similarly targeted.
In general, these seem like the details that neutral magic can work out as their possibility occurs to it, so the difficuly is shifted from the main agent to the trusted ability of its sub-agent to do the rest. Worst case scenario, it can self-destruct early, but there’s probably enough leeway from that outcome.
It’s Rice’s theorem, though really more about conceptual ambiguity. We can talk about particular notions of agents or goals, but it’s never fully general, unless we by construction ensure that unexpected things can’t occur. And even then it’s not what we would have wanted the notions of agents or goals to be, because it’s not clear what that is.
Intent verification doesn’t seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness, especially after other routes to improving the outcome have been exhausted (this is what the paragraph about hashes in the first comment was about). So the neutral magic could be helpful a tiny, bounded amount. This is one of the ways the balance between doing nothing and releasing a sub-agent could be broken.
Intent verification doesn’t seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness
Are you talking about granular actions, or coarse actions? The acceptable way to do IV for the latter is an open question, although I point out preliminary approaches.
I was talking about what I understand the purpose/design of intent verification to be, not specifically the formalizations you described. (I don’t think it’s particularly useful to work out the details without a general plan or expectation of important technical surprises.)
If you decompose the creation of such an agent, some of those actions are wasted effort in the eyes of a pure u_A maximizer (“dont help me too much”). So, the logic goes, they really aren’t related to u_A, but rather to skirting the impact measure, and should therefore be penalized.
It could as easily be “do this one slightly helpful thing”, an addition on top of doing nothing. It doesn’t seem like there is an essential distinction between such different framings of the same outcome that intent verification can capture.
Whether these granular actions exist is also an open question I listed.
I don’t see why some version of IV won’t be able to get past this, however. There seems to be a simple class of things the agent does to get around an impact measure that it wouldn’t do if it were just trying to pursue a goal to the maximum extent. It might be true that the things the agent does to get around it are also slightly helpful for the goal, but probably not as helpful as the most helpful action.
I worry there might be leaks in logical time that let the agent choose an action that takes into account that an impactful action will be denied. For example, a sub-agent could be built so that it’s a maximizer that’s not constrained by an impact measure. The sub-agent then notices that to maximize its goal, it must constrain its impact, or else the main agent won’t be allowed to create it. And so it will so constrain its impact and will be allowed to be created, as a low-impact and maximally useful action of the main agent. It’s sort of a daemon, but with respect to impact measure and not goals, which additionally does respect the impact measure and only circumvents it once in order to get created.
That’s a really interesting point. I’d like to think about this more, but one preliminary intuition I have against this (and any general successor creation by AUP, really) being the best action is that making new agents aligned with your goals is instrumentally convergent. This could add a frictional cost so that the AUP agent would be better off just doing the job itself. Perhaps we could also stop this via an approval incentives, which might tip the scales enough?
My new measure captures this, if I understand correctly. This action drastically changes the agent’s ability to achieve different goals.
The sub-agent in this scenario won’t be helping the main agent with achieving any goals. It only hides. Its nature precludes any usefulness. It’s no more useful than its absence. But for the main agent, it’s as easy to set up as its absence. And there might be reasons for this balance to be broken in favor of creating the sub-agent.
That isn’t how AUP works. Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.
Suppose that an arbitrary maximizer could not co-opt this new agent—its ability to achieve goals is decreased compared to if it hadn’t activated the agent. On the other hand, if it can co-opt it, its ability is increased. This is not to mention the large amount of resources that be used by such an expansive sub agent, nor the fact that intent verification seemingly would not allow such a sub agent to be built.
I discuss this kind of thing in several places in the comments, if you’re interested.
It’ll replenish the resources it takes, help any maximizer it impedes so as to exactly cancel out the impediment etc.
To the extent its existence could pose a problem for another agent (according to the measure, which can’t really talk about goals of agents-in-general), it’ll surrender its resources without a fight or help with the other agent’s goals. It will destroy any surplus in its resources before it can be captured. A change in variance resulting from unpredictability of such outcomes can be similarly targeted.
In general, these seem like the details that neutral magic can work out as their possibility occurs to it, so the difficuly is shifted from the main agent to the trusted ability of its sub-agent to do the rest. Worst case scenario, it can self-destruct early, but there’s probably enough leeway from that outcome.
Then it specifically isn’t allowed by intent verification.
Are these your priors, or posteriors after having read my post? Because AUP is about the goals of arbitrary agents.
It’s Rice’s theorem, though really more about conceptual ambiguity. We can talk about particular notions of agents or goals, but it’s never fully general, unless we by construction ensure that unexpected things can’t occur. And even then it’s not what we would have wanted the notions of agents or goals to be, because it’s not clear what that is.
Intent verification doesn’t seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness, especially after other routes to improving the outcome have been exhausted (this is what the paragraph about hashes in the first comment was about). So the neutral magic could be helpful a tiny, bounded amount. This is one of the ways the balance between doing nothing and releasing a sub-agent could be broken.
Are you talking about granular actions, or coarse actions? The acceptable way to do IV for the latter is an open question, although I point out preliminary approaches.
I was talking about what I understand the purpose/design of intent verification to be, not specifically the formalizations you described. (I don’t think it’s particularly useful to work out the details without a general plan or expectation of important technical surprises.)
If you decompose the creation of such an agent, some of those actions are wasted effort in the eyes of a pure u_A maximizer (“dont help me too much”). So, the logic goes, they really aren’t related to u_A, but rather to skirting the impact measure, and should therefore be penalized.
It could as easily be “do this one slightly helpful thing”, an addition on top of doing nothing. It doesn’t seem like there is an essential distinction between such different framings of the same outcome that intent verification can capture.
Whether these granular actions exist is also an open question I listed.
I don’t see why some version of IV won’t be able to get past this, however. There seems to be a simple class of things the agent does to get around an impact measure that it wouldn’t do if it were just trying to pursue a goal to the maximum extent. It might be true that the things the agent does to get around it are also slightly helpful for the goal, but probably not as helpful as the most helpful action.
I worry there might be leaks in logical time that let the agent choose an action that takes into account that an impactful action will be denied. For example, a sub-agent could be built so that it’s a maximizer that’s not constrained by an impact measure. The sub-agent then notices that to maximize its goal, it must constrain its impact, or else the main agent won’t be allowed to create it. And so it will so constrain its impact and will be allowed to be created, as a low-impact and maximally useful action of the main agent. It’s sort of a daemon, but with respect to impact measure and not goals, which additionally does respect the impact measure and only circumvents it once in order to get created.
That’s a really interesting point. I’d like to think about this more, but one preliminary intuition I have against this (and any general successor creation by AUP, really) being the best action is that making new agents aligned with your goals is instrumentally convergent. This could add a frictional cost so that the AUP agent would be better off just doing the job itself. Perhaps we could also stop this via an approval incentives, which might tip the scales enough?