if you told me “you can build superintelligent agents which don’t try to seek power by penalizing them for becoming more able to achieve their own goal”, I wouldn’t exactly die of shock
This seems broadly reasonable to me, but I don’t think it can work under the threat model of optimal agents. “Impact” / “more able” as defined in this sequence can only be caused by events the agent didn’t perfectly predict, because impact requires a change in the agent’s belief about the reward it can accumulate. In a deterministic environment with a truly optimal agent, the agent’s beliefs will never change as it executes the optimal policy, and so there will never be impact. So AUP_conceptual using the definition of impact/power in this sequence doesn’t seem like it solves the problem under the threat model of perfectly optimal agents. (That’s fine! We won’t have those!)In practice, I interpret variants of AUP-the-method (as in the previous post) as trying to get safety via some combination of two things:
Power proxy: When using a set of auxiliary reward functions, the agent’s beliefs about attainable utility for the auxiliary rewards changes, because it is not following an optimal policy for them. This forms a good proxy for power that is compatible with the agent having perfect beliefs. The main problem here is that proxies can be gamed (as in various subagent constructions).
Starting from a “dumber” belief: (Super unclear / fuzzy) Given that the agent’s actual beliefs won’t change, we can instead have it measure the difference between its beliefs and some “dumber” beliefs, e.g. its beliefs if it were following an inaction policy or a random policy for N timesteps, followed by an optimal policy. The problem here is that you aren’t leveraging the AI’s understanding of the environment, and so in practice I’d expect the effect of this is going to depend pretty significantly on the environment.
I like AUP primarily because of the first reason: while the power proxy is not ungameable, it certainly seems quite good, and seems like it only deviates from our intuitive notion of power in very weird circumstances or under adversarial optimization. While this means it isn’t superintelligence-safe, it still seems like an important idea that might be useful in other ways.
Once you remove the auxiliary rewards and only use the primary reward R, I think you have mostly lost this benefit: at this point you are saying “optimize for R, but don’t optimize for long-term R”, which seems pretty weird and not a good proxy for power. At this point I think you’re only getting the benefit of starting from a “dumber” belief, or perhaps you shift reward acquisition to be closer to the present than the far future, but this seems pretty divorced from the CCC and all of the conceptual progress made in this sequence. It seems much more in the same spirit as quantilization and/or satisficing, and I’d rather use one of those two methods (since they’re simpler and easier to understand).(I analyzed a couple of variants of AUP-without-auxiliary-rewards here; I think it mostly supports my claim that these implementations of AUP are pretty similar in spirit to quantilization / satisficing.)
In a deterministic environment with a truly optimal agent, the agent’s beliefs will never change as it executes the optimal policy, and so there will never be impact. So AUP_conceptual using the definition of impact/power in this sequence doesn’t seem like it solves the problem under the threat model of perfectly optimal agents.
I don’t think this critique applies to AUPconceptual. AUPconceptual is defined as penalizing the intuitive version of “change in power”, not the formal definition. From our perspective, we could still say an agent is penalized for changes in power (intuitively perceived), even if the world is secretly deterministic.
If I’m an optimal agent with perfect beliefs about what the (deterministic) world will do, even intuitively I would never say that my power changes. Can you give me an example of what such an agent could do that would change its power?
If by “intuitive” you mean “from the perspective of real humans, even if the agent is optimal / superintelligent”, then I feel like there are lots of conceptual solutions to AI alignment, like “do what I mean”, “don’t do bad things”, “do good things”, “promote human flourishing”, etc.
(this comment and the previous both point at relatively early-stage thoughts; sorry if it seems like I’m equivocating)
even intuitively I would never say that my power changes. Can you give me an example of what such an agent could do that would change its power?
I think there’s a piece of intuition missing from that first claim, which goes something like “powerhuman-intuitive has to do with easily exploitable opportunities in a given situation”, so it doesn’t matter if the agent is optimal. In that case, gaining a ton of money would increase power.
If by “intuitive” you mean “from the perspective of real humans, even if the agent is optimal / superintelligent”, then I feel like there are lots of conceptual solutions to AI alignment, like “do what I mean”, “don’t do bad things”, “do good things”, “promote human flourishing”, etc.
While I was initially leaning towards this perspective, I’m leaning away now. However, still note that this solution doesn’t have anything to do with human values in particular.
has to do with easily exploitable opportunities in a given situation
Sorry, I don’t understand what you mean here.
However, still note that this solution doesn’t have anything to do with human values in particular.
I feel like I can still generate lots of solutions that have that property. For example, “preserve human autonomy”, “be nice”, “follow norms”, “do what I mean”, “be corrigible”, “don’t do anything I wouldn’t do”, “be obedient”.
All of these depend on the AI having some knowledge about humans, but so does penalizing power.
When I say that our intuitive sense of power has to do with the easily exploitable opportunities available to an actor, that refers to opportunities which e.g. a ~human-level intelligence could notice and take advantage of. This has some strange edge cases, but it’s part of my thinking.
The key point is that AUPconceptual relaxes the problem:
If we could robustly penalize the agent for intuitively perceived gains in power (whatever that means), would that solve the problem?
This is not trivial. I think it’s a useful question to ask (especially because we can formalize so many of these power intuitions), even if none of the formalizations are perfect.
The key point is that AUPconceptual relaxes the problem:
If we could robustly penalize the agent for intuitively perceived gains in power (whatever that means), would that solve the problem?
This is not trivial.
Probably I’m just missing something, but I don’t see why you couldn’t say something similar about:
“preserve human autonomy”, “be nice”, “follow norms”, “do what I mean”, “be corrigible”, “don’t do anything I wouldn’t do”, “be obedient”
E.g.
If we could robustly reward the agent for intuitively perceived nice actions (whatever that means), would that solve the problem?
It seems like the main difference is that for power in particular is that there’s more hope that we could formalize power without reference to humans (which seems harder to do for e.g. “niceness”), but then my original point applies.
(This discussion was continued privately – to clarify, I was narrowly arguing that AUPconceptual is correct, but that this should only provide a mild update in favor of implementations working in the superintelligent case.)
This seems broadly reasonable to me, but I don’t think it can work under the threat model of optimal agents. “Impact” / “more able” as defined in this sequence can only be caused by events the agent didn’t perfectly predict, because impact requires a change in the agent’s belief about the reward it can accumulate. In a deterministic environment with a truly optimal agent, the agent’s beliefs will never change as it executes the optimal policy, and so there will never be impact. So AUP_conceptual using the definition of impact/power in this sequence doesn’t seem like it solves the problem under the threat model of perfectly optimal agents. (That’s fine! We won’t have those!)In practice, I interpret variants of AUP-the-method (as in the previous post) as trying to get safety via some combination of two things:
Power proxy: When using a set of auxiliary reward functions, the agent’s beliefs about attainable utility for the auxiliary rewards changes, because it is not following an optimal policy for them. This forms a good proxy for power that is compatible with the agent having perfect beliefs. The main problem here is that proxies can be gamed (as in various subagent constructions).
Starting from a “dumber” belief: (Super unclear / fuzzy) Given that the agent’s actual beliefs won’t change, we can instead have it measure the difference between its beliefs and some “dumber” beliefs, e.g. its beliefs if it were following an inaction policy or a random policy for N timesteps, followed by an optimal policy. The problem here is that you aren’t leveraging the AI’s understanding of the environment, and so in practice I’d expect the effect of this is going to depend pretty significantly on the environment.
I like AUP primarily because of the first reason: while the power proxy is not ungameable, it certainly seems quite good, and seems like it only deviates from our intuitive notion of power in very weird circumstances or under adversarial optimization. While this means it isn’t superintelligence-safe, it still seems like an important idea that might be useful in other ways.
Once you remove the auxiliary rewards and only use the primary reward R, I think you have mostly lost this benefit: at this point you are saying “optimize for R, but don’t optimize for long-term R”, which seems pretty weird and not a good proxy for power. At this point I think you’re only getting the benefit of starting from a “dumber” belief, or perhaps you shift reward acquisition to be closer to the present than the far future, but this seems pretty divorced from the CCC and all of the conceptual progress made in this sequence. It seems much more in the same spirit as quantilization and/or satisficing, and I’d rather use one of those two methods (since they’re simpler and easier to understand).(I analyzed a couple of variants of AUP-without-auxiliary-rewards here; I think it mostly supports my claim that these implementations of AUP are pretty similar in spirit to quantilization / satisficing.)
I don’t think this critique applies to AUPconceptual. AUPconceptual is defined as penalizing the intuitive version of “change in power”, not the formal definition. From our perspective, we could still say an agent is penalized for changes in power (intuitively perceived), even if the world is secretly deterministic.
If I’m an optimal agent with perfect beliefs about what the (deterministic) world will do, even intuitively I would never say that my power changes. Can you give me an example of what such an agent could do that would change its power?
If by “intuitive” you mean “from the perspective of real humans, even if the agent is optimal / superintelligent”, then I feel like there are lots of conceptual solutions to AI alignment, like “do what I mean”, “don’t do bad things”, “do good things”, “promote human flourishing”, etc.
(this comment and the previous both point at relatively early-stage thoughts; sorry if it seems like I’m equivocating)
I think there’s a piece of intuition missing from that first claim, which goes something like “powerhuman-intuitive has to do with easily exploitable opportunities in a given situation”, so it doesn’t matter if the agent is optimal. In that case, gaining a ton of money would increase power.
While I was initially leaning towards this perspective, I’m leaning away now. However, still note that this solution doesn’t have anything to do with human values in particular.
Sorry, I don’t understand what you mean here.
I feel like I can still generate lots of solutions that have that property. For example, “preserve human autonomy”, “be nice”, “follow norms”, “do what I mean”, “be corrigible”, “don’t do anything I wouldn’t do”, “be obedient”.
All of these depend on the AI having some knowledge about humans, but so does penalizing power.
When I say that our intuitive sense of power has to do with the easily exploitable opportunities available to an actor, that refers to opportunities which e.g. a ~human-level intelligence could notice and take advantage of. This has some strange edge cases, but it’s part of my thinking.
The key point is that AUPconceptual relaxes the problem:
This is not trivial. I think it’s a useful question to ask (especially because we can formalize so many of these power intuitions), even if none of the formalizations are perfect.
Probably I’m just missing something, but I don’t see why you couldn’t say something similar about:
E.g.
It seems like the main difference is that for power in particular is that there’s more hope that we could formalize power without reference to humans (which seems harder to do for e.g. “niceness”), but then my original point applies.
(This discussion was continued privately – to clarify, I was narrowly arguing that AUPconceptual is correct, but that this should only provide a mild update in favor of implementations working in the superintelligent case.)