If I’m an optimal agent with perfect beliefs about what the (deterministic) world will do, even intuitively I would never say that my power changes. Can you give me an example of what such an agent could do that would change its power?
If by “intuitive” you mean “from the perspective of real humans, even if the agent is optimal / superintelligent”, then I feel like there are lots of conceptual solutions to AI alignment, like “do what I mean”, “don’t do bad things”, “do good things”, “promote human flourishing”, etc.
(this comment and the previous both point at relatively early-stage thoughts; sorry if it seems like I’m equivocating)
even intuitively I would never say that my power changes. Can you give me an example of what such an agent could do that would change its power?
I think there’s a piece of intuition missing from that first claim, which goes something like “powerhuman-intuitive has to do with easily exploitable opportunities in a given situation”, so it doesn’t matter if the agent is optimal. In that case, gaining a ton of money would increase power.
If by “intuitive” you mean “from the perspective of real humans, even if the agent is optimal / superintelligent”, then I feel like there are lots of conceptual solutions to AI alignment, like “do what I mean”, “don’t do bad things”, “do good things”, “promote human flourishing”, etc.
While I was initially leaning towards this perspective, I’m leaning away now. However, still note that this solution doesn’t have anything to do with human values in particular.
has to do with easily exploitable opportunities in a given situation
Sorry, I don’t understand what you mean here.
However, still note that this solution doesn’t have anything to do with human values in particular.
I feel like I can still generate lots of solutions that have that property. For example, “preserve human autonomy”, “be nice”, “follow norms”, “do what I mean”, “be corrigible”, “don’t do anything I wouldn’t do”, “be obedient”.
All of these depend on the AI having some knowledge about humans, but so does penalizing power.
When I say that our intuitive sense of power has to do with the easily exploitable opportunities available to an actor, that refers to opportunities which e.g. a ~human-level intelligence could notice and take advantage of. This has some strange edge cases, but it’s part of my thinking.
The key point is that AUPconceptual relaxes the problem:
If we could robustly penalize the agent for intuitively perceived gains in power (whatever that means), would that solve the problem?
This is not trivial. I think it’s a useful question to ask (especially because we can formalize so many of these power intuitions), even if none of the formalizations are perfect.
The key point is that AUPconceptual relaxes the problem:
If we could robustly penalize the agent for intuitively perceived gains in power (whatever that means), would that solve the problem?
This is not trivial.
Probably I’m just missing something, but I don’t see why you couldn’t say something similar about:
“preserve human autonomy”, “be nice”, “follow norms”, “do what I mean”, “be corrigible”, “don’t do anything I wouldn’t do”, “be obedient”
E.g.
If we could robustly reward the agent for intuitively perceived nice actions (whatever that means), would that solve the problem?
It seems like the main difference is that for power in particular is that there’s more hope that we could formalize power without reference to humans (which seems harder to do for e.g. “niceness”), but then my original point applies.
(This discussion was continued privately – to clarify, I was narrowly arguing that AUPconceptual is correct, but that this should only provide a mild update in favor of implementations working in the superintelligent case.)
If I’m an optimal agent with perfect beliefs about what the (deterministic) world will do, even intuitively I would never say that my power changes. Can you give me an example of what such an agent could do that would change its power?
If by “intuitive” you mean “from the perspective of real humans, even if the agent is optimal / superintelligent”, then I feel like there are lots of conceptual solutions to AI alignment, like “do what I mean”, “don’t do bad things”, “do good things”, “promote human flourishing”, etc.
(this comment and the previous both point at relatively early-stage thoughts; sorry if it seems like I’m equivocating)
I think there’s a piece of intuition missing from that first claim, which goes something like “powerhuman-intuitive has to do with easily exploitable opportunities in a given situation”, so it doesn’t matter if the agent is optimal. In that case, gaining a ton of money would increase power.
While I was initially leaning towards this perspective, I’m leaning away now. However, still note that this solution doesn’t have anything to do with human values in particular.
Sorry, I don’t understand what you mean here.
I feel like I can still generate lots of solutions that have that property. For example, “preserve human autonomy”, “be nice”, “follow norms”, “do what I mean”, “be corrigible”, “don’t do anything I wouldn’t do”, “be obedient”.
All of these depend on the AI having some knowledge about humans, but so does penalizing power.
When I say that our intuitive sense of power has to do with the easily exploitable opportunities available to an actor, that refers to opportunities which e.g. a ~human-level intelligence could notice and take advantage of. This has some strange edge cases, but it’s part of my thinking.
The key point is that AUPconceptual relaxes the problem:
This is not trivial. I think it’s a useful question to ask (especially because we can formalize so many of these power intuitions), even if none of the formalizations are perfect.
Probably I’m just missing something, but I don’t see why you couldn’t say something similar about:
E.g.
It seems like the main difference is that for power in particular is that there’s more hope that we could formalize power without reference to humans (which seems harder to do for e.g. “niceness”), but then my original point applies.
(This discussion was continued privately – to clarify, I was narrowly arguing that AUPconceptual is correct, but that this should only provide a mild update in favor of implementations working in the superintelligent case.)