I basically don’t see the human mimicry frame as a particularly relevant baseline. However, I think I agree with parts of your concern, and I hadn’t grasped your point at first.
The [AUP] equations incentivize the AI to take actions that will provide an immediate reward in the next timestep, but penalizes its ability to achieve rewards in later timesteps.
I’d consider a different interpretation. The intent behind the equations is that the agent executes plans using its “current level of resources”, while being seriously penalized for gaining resources. It’s like if you were allowed to explore, you’re currently on land which is 1,050 feet above sea level, and you can only walk on land with elevation between 1,000 and 1,400 feet. That’s the intent.
The equations don’t fully capture that, and I’m pessimistic that there’s a simple way to capture it:
But what if the only way to receive a reward is to do something that will only give a reward several timesteps later? In realistic situations, when can you ever actually accomplish the goal you’re trying to accomplish in a single atomic action?
For example, suppose the AI is rewarded for making paperclips, but all it can do in the next timestep is start moving its arm towards wire. If it’s just rewarded for making paperclips, and it can’t make a paperclip the next timestep, so the AI would instead focus on minimizing impact and not do anything.
I agree that it might be penalized hard here, and this is one reason I’m not satisfied with equation 5 of that post. It penalizes the agent for moving towards its objective. This is weird, and several other commenters share this concern.
Over the last year, I think that the “penalize own AU gain” is worse than “penalize average AU gain”, in that I think the latter penalty equation leads to more sensible incentives. I still think that there might be some good way to penalize the agent for becoming more ableto pursue its own goal. Equation 5 isn’t it, and I think that part of your critique is broadly right.
I hadn’t thought about the distinction between gaining and using resources. You can still wreak havoc without getting resources, though, by using them in a damaging way. But I can see why the distinction might be helpful to think about.
It still seems to me that an agent using equation 5 would pretty much act like a human imitator for anything that takes more than one step, so that’s why I was using it as a comparison. I can try to explain my reasoning if you want, but I suppose it’s a moot point now. And I don’t know if I’m right, anyways.
Basically, I’m concerned that most nontrivial things a person wants will take multiple actions, so in most of the steps the AI will be motivated mainly by the reward given in the current step for reward-shaping reasons (as long as it doesn’t gain too much power). And doing the action that gives the most immediate reward for reward shaping-reasons sounds pretty much like doing whatever action the human would think is best in that situation. Which is probably what the human (and mimic) would do.
I hadn’t thought about the distinction between gaining and using resources. You can still wreak havoc without getting resources, though, by using them in a damaging way. But I can see why the distinction might be helpful to think about.
I explain my thoughts on this in The Catastrophic Convergence Conjecture. Not sure if you’ve read that, or if you think it’s false, or you have another position entirely.
I agree that intelligent agents have a tendency to seek power and that that is a large cause of what makes them dangerous. Agents could potentially cause catastrophes in other ways, but I’m not sure if any are realistic.
As an example, suppose an agent creates powerful self-replicating nanotechnology that makes a pile of paperclips, the agent’s goal. However, since they are self-replicating the agent didn’t want to spend the time engineering a way to stop replication, the nanobots eat the world.
But catastrophes like this would probably also be dealt with by AUP-preservation, though. At least, if you use the multi-equation impact measure. (If the impact equation only concerns the agent’s ability to achieve its own goal, maybe it would let the world be consume after putting up a nanotech-proof barrier around all of its paperclip manufacturing resources. But again, I don’t know if that’s realistic.)
I’m also concerned agents would create large, catastrophic changes to the world in ways that don’t increase their power. For example, an agent who wants to make paperclips might try to create nanotech that assembles the entire world into paperclips. It’s not clear to me that this would increase the agent’s power much. The agent wouldn’t necessarily have any control of the bots, so it would limit the agent to doing with for just its one utility function. And if the agent is intelligent enough to easily discover how to create such technology, actually creating them doesn’t sound like it would give it more power than it already had.
If the material for the bots is scarce then making them prevents the AI from making other things, then they might provide a net decrease to the agent’s power. And once the world is paperclips, the agent would be limited to just having paperclips available, which could make it pretty weak.
I don’t know if you consider the described scenario as seeking power. At least, I don’t think it would count as an increase in the agent’s impact equation.
I basically don’t see the human mimicry frame as a particularly relevant baseline. However, I think I agree with parts of your concern, and I hadn’t grasped your point at first.
I’d consider a different interpretation. The intent behind the equations is that the agent executes plans using its “current level of resources”, while being seriously penalized for gaining resources. It’s like if you were allowed to explore, you’re currently on land which is 1,050 feet above sea level, and you can only walk on land with elevation between 1,000 and 1,400 feet. That’s the intent.
The equations don’t fully capture that, and I’m pessimistic that there’s a simple way to capture it:
I agree that it might be penalized hard here, and this is one reason I’m not satisfied with equation 5 of that post. It penalizes the agent for moving towards its objective. This is weird, and several other commenters share this concern.
Over the last year, I think that the “penalize own AU gain” is worse than “penalize average AU gain”, in that I think the latter penalty equation leads to more sensible incentives. I still think that there might be some good way to penalize the agent for becoming more able to pursue its own goal. Equation 5 isn’t it, and I think that part of your critique is broadly right.
I hadn’t thought about the distinction between gaining and using resources. You can still wreak havoc without getting resources, though, by using them in a damaging way. But I can see why the distinction might be helpful to think about.
It still seems to me that an agent using equation 5 would pretty much act like a human imitator for anything that takes more than one step, so that’s why I was using it as a comparison. I can try to explain my reasoning if you want, but I suppose it’s a moot point now. And I don’t know if I’m right, anyways.
Basically, I’m concerned that most nontrivial things a person wants will take multiple actions, so in most of the steps the AI will be motivated mainly by the reward given in the current step for reward-shaping reasons (as long as it doesn’t gain too much power). And doing the action that gives the most immediate reward for reward shaping-reasons sounds pretty much like doing whatever action the human would think is best in that situation. Which is probably what the human (and mimic) would do.
I explain my thoughts on this in The Catastrophic Convergence Conjecture. Not sure if you’ve read that, or if you think it’s false, or you have another position entirely.
I agree that intelligent agents have a tendency to seek power and that that is a large cause of what makes them dangerous. Agents could potentially cause catastrophes in other ways, but I’m not sure if any are realistic.
As an example, suppose an agent creates powerful self-replicating nanotechnology that makes a pile of paperclips, the agent’s goal. However, since they are self-replicating the agent didn’t want to spend the time engineering a way to stop replication, the nanobots eat the world.
But catastrophes like this would probably also be dealt with by AUP-preservation, though. At least, if you use the multi-equation impact measure. (If the impact equation only concerns the agent’s ability to achieve its own goal, maybe it would let the world be consume after putting up a nanotech-proof barrier around all of its paperclip manufacturing resources. But again, I don’t know if that’s realistic.)
I’m also concerned agents would create large, catastrophic changes to the world in ways that don’t increase their power. For example, an agent who wants to make paperclips might try to create nanotech that assembles the entire world into paperclips. It’s not clear to me that this would increase the agent’s power much. The agent wouldn’t necessarily have any control of the bots, so it would limit the agent to doing with for just its one utility function. And if the agent is intelligent enough to easily discover how to create such technology, actually creating them doesn’t sound like it would give it more power than it already had.
If the material for the bots is scarce then making them prevents the AI from making other things, then they might provide a net decrease to the agent’s power. And once the world is paperclips, the agent would be limited to just having paperclips available, which could make it pretty weak.
I don’t know if you consider the described scenario as seeking power. At least, I don’t think it would count as an increase in the agent’s impact equation.