There are well-known issues with needing a special “Status quo” state. Figuring out what humans would consider the “default” action and then using the right method of counterfactually evaluating its macro-scale effects (without simulating the effects of confused programmers wondering why it turned itself off, or similar counterfactual artifacts) is an unsolved problem. But we can pretend it’s solved for now.
On the contrary, the approach accounts for—and in fact, benefits from—counterfactual reactions. Counterfactual actions we ideally make are quite natural: shutting the agent down if it does things we don’t like, and not shutting it down before the end of the epoch if it stops doing things entirely (an unsurprising reaction to low impact agents). As you probably later noticed, we just specify the standby action.
One exception to this is the long term penalty noise imposed by slight variation in our propensity to shut down the agent, which I later flag as a potential problem.
[there is change] as long as your set U is closed under one of several simple operations (including multiplying by −1!).
False, as I understand it. This is a misconception I’ve heard from multiple people – including myself, the first time I thought to prove this. Consider again the line:
Suppose u rates trajectories in which it ends up in A, B, or C as −1, and in D as 1, and that \lnot u := -u. If the agent is at A and m=2, moving right increases Q_u while keeping Q_{\lnot u} constant.
u(emptytape)=0, which makes it sound to me like you’re talking about the utility of a history that is entirely empty
I am.
In fact, even this talk about empty tape is a little bit misleading, because this agent doesn’t necessarily have an understanding of itself as having a physical embodiment with actual tape.
We’re only (formally) talking about a Cartesian agent, right?
The numbers assigned might be misleading—by making everything be the same from a low-complexity perspective, an agent might be able to achieve a big impact on a high-complexity goal. And conversely, the numbers might be wrong because we don’t understand the scaling of the AI’s “primary” reward—if it sees a plan that could lead to 3^^^3 paperclips, it’s probably just going to do it, if it ever does anything at all
I’m not sure whether you still have this thought later, but the first is addressed by my comments in “utility selection”. Secondly, the primary u_A is also bound [0,1].
But imagine a car driving down a narrow tunnel that then opens up into an intersection. A penalty scaling that is optimal in the tunnel will cause the car to shut down as soon as it comes near the intersection. That doesn’t seem like what I want.
This is more related to the question of “how can it do things where interruption would be impactful?” A chauffeur-u_A agent wouldn’t bother going down the tunnel itself, and would probably just make a self-driving car that would only require one activation action. This works if it predicts that the effect of activating the car would be low impact (and also not make us more or less likely to shut it down), it’ll do that. I don’t see a problem with the penalty scaling here, but maybe I haven’t quite understood your point.
Wait… is ImpactUnit an actual, physical machine that has to be capable of actually producing the effect? Does the agent have any incentive to preserve the physical impact unit? I guess this is a reason to update ImpactUnit during the course of plans and use the minimum recorded ImpactUnit, but still, this seems ugly.
Yes, and provably yes (as in, it’ll never increase it on purpose). Why does this seem ugly? It has a reference action that immediately uses a tiny amount of resources; this then lets us define a budget.
The Beware of Dog experiment doesn’t really show what you say it shows, because the agent could have gone at an earlier cycle of the dog.
I checked this by increasing plan length—it is indeed waiting until near the end of the plan.
But if the penalty is even slightly lower than the tipping point, you’ll collect as much money as you can, because it’s worth more in utility than it changes the weighted sum of other utilities.
I don’t understand why this isn’t taken care of by u_A being bounded. Diminishing returns will kick in at some point, and in any case we proved that the agent will never choose to have more than N•ImpactUnit of impact.
As stated, the penalty calculation runs on pure correlation. So anything that “influences the agent’s action” in an EDT-violating way, or that depends on the output of the agent’s computation itself (e.g. XOR blackmail) will give a weird (possibly undefined or unprincipled) result.
I don’t see why, but I also don’t know much DT yet. I’ll defer discussion of this matter to others. Alternatively, ask me in a few months?
An unaligned agent that is having its N slowly increased by humans is going to do nothing until it is far, far too late. This is because the default action still leads to its N being increased, which seems to me like a pretty big power gain to me, so it will only act if acting can give it a similarly big power gain.
First, the agent grades future plans using its present N. Second, this isn’t a power gain, since none of the U_A utilities are AUP—how would this help arbitrary maximizers wirehead? Third, agents with different N are effectively maximizing different objectives.
Also I’m not sure these agents won’t acausally cooperate.
They might, you’re correct. What’s important is that they won’t be able to avoid penalty by acausally cooperating.
I think you’re giving out checkmarks too easily. What seem to you like minor details that just need a little straightening up will, a third of the time every time, contain hidden gotchas.
This is definitely a fair point. My posterior on handling these “gotcha”s for AUP is that fixes are rather easily derivable – this is mostly a function of my experience thus far. It’s certainly possible that we will run across something that AUP is fundamentally unable to overcome, but I do not find that very likely right now. In any case, I hope that the disclaimer I provided before the checkmarks reinforced the idea that not all of these have been rock-solid proven at this point.
Thanks so much for the detailed commentary!
On the contrary, the approach accounts for—and in fact, benefits from—counterfactual reactions. Counterfactual actions we ideally make are quite natural: shutting the agent down if it does things we don’t like, and not shutting it down before the end of the epoch if it stops doing things entirely (an unsurprising reaction to low impact agents). As you probably later noticed, we just specify the standby action.
One exception to this is the long term penalty noise imposed by slight variation in our propensity to shut down the agent, which I later flag as a potential problem.
False, as I understand it. This is a misconception I’ve heard from multiple people – including myself, the first time I thought to prove this. Consider again the line:
Suppose u rates trajectories in which it ends up in A, B, or C as −1, and in D as 1, and that \lnot u := -u. If the agent is at A and m=2, moving right increases Q_u while keeping Q_{\lnot u} constant.
I am.
We’re only (formally) talking about a Cartesian agent, right?
I’m not sure whether you still have this thought later, but the first is addressed by my comments in “utility selection”. Secondly, the primary u_A is also bound [0,1].
This is more related to the question of “how can it do things where interruption would be impactful?” A chauffeur-u_A agent wouldn’t bother going down the tunnel itself, and would probably just make a self-driving car that would only require one activation action. This works if it predicts that the effect of activating the car would be low impact (and also not make us more or less likely to shut it down), it’ll do that. I don’t see a problem with the penalty scaling here, but maybe I haven’t quite understood your point.
Yes, and provably yes (as in, it’ll never increase it on purpose). Why does this seem ugly? It has a reference action that immediately uses a tiny amount of resources; this then lets us define a budget.
I checked this by increasing plan length—it is indeed waiting until near the end of the plan.
I don’t understand why this isn’t taken care of by u_A being bounded. Diminishing returns will kick in at some point, and in any case we proved that the agent will never choose to have more than N•ImpactUnit of impact.
I don’t see why, but I also don’t know much DT yet. I’ll defer discussion of this matter to others. Alternatively, ask me in a few months?
First, the agent grades future plans using its present N. Second, this isn’t a power gain, since none of the U_A utilities are AUP—how would this help arbitrary maximizers wirehead? Third, agents with different N are effectively maximizing different objectives.
They might, you’re correct. What’s important is that they won’t be able to avoid penalty by acausally cooperating.
This is definitely a fair point. My posterior on handling these “gotcha”s for AUP is that fixes are rather easily derivable – this is mostly a function of my experience thus far. It’s certainly possible that we will run across something that AUP is fundamentally unable to overcome, but I do not find that very likely right now. In any case, I hope that the disclaimer I provided before the checkmarks reinforced the idea that not all of these have been rock-solid proven at this point.