I agree with this basic point and it seems important, thanks.
It seems like there are two qualitatively different concerns when trying to optimize for X, that probably need to be distinguished / thought about separately:
Upstream daemon: directly optimizes something that is upstream of X on the training distribution.
Downstream daemon: optimizes something that is downstream of X on the training distribution (e.g. because survival is a convergent instrumental goal and X is upstream of survival), and so backwards chains through X.
Obviously the real situation can be a complicated mixture, and this is not a clean distinction even apart from that.
The arguments in the OP only plausibly apply to downstream daemons. I think they make the most sense in terms of making induction benign.
I’ve normally thought of upstream daemons as much more likely, but much easier to deal with:
In the case of upstream daemons, the most natural response is to probe a broad range of situations similar to the ones you care about. This requires a lot more understanding, but it seems quite realistic to try to avoid the situation where (Y --> empowering humans) on the training distribution but (Y --> disempowering humans) on the slightly expanded distribution. (Compared to others around here, this probably seems more plausible to me because I have an intuition about a basin of attraction around corrigibility, and am imagining consistently retraining the agent such that it doesn’t need to deal with giant distributional shift.)
In the case of downstream daemons, you have to deal with an adversary who is motivated to look for the first opportunity to defect that you won’t be able to notice. This pits much more optimization pressure against your training regime. In particular, the “distributional shift” in this case is caused by the agent being powerful enough to seize control, and therefore it is very difficult to avoid.
While I usually flag these as two potentially distinct concerns, they do run together a lot in my head as evidenced by this post. I’m not sure if it’s possible to cleanly distinguish them, or how. The right distinction may also be something else, e.g focusing directly on the possibility of a treacherous turn.
I think it makes sense to classify daemons into two types the way you do. Interestingly MIRI seems to be a lot more concerned about what you call upstream daemons. The Arbital page you linked to only talks about upstream daemons and the Google Doc “MIRI notes on alignment difficulty” seems to be mostly about that too. (What is it with people keeping important AI safety documents in private Google Docs these days, with no apparent plans of publication? Do you know any others that I’m not already shared on, BTW?)
and am imagining consistently retraining the agent such that it doesn’t need to deal with giant distributional shift
I don’t recall you writing about this before. How do you see this working? I guess with LBO you could train a complete “core for reasoning” and then amplify that to keep retraining the higher level agents on broader and broader distributions, but how would it work with HBO, where the human overseer’s time becomes increasingly scarce/costly relative to the AI’s as AIs get faster? I’m also pretty concerned about the overseer running into their own lack of robustness against distributional shifts if this is what you’re planning.
Interestingly MIRI seems to be a lot more concerned about what you call upstream daemons. The Arbital page you linked to only talks about upstream daemons and the Google Doc “MIRI notes on alignment difficulty” seems to be mostly about that too.
I think people (including at MIRI) normally describe daemons as emerging from upstream optimization, but then describe them as becoming downstream daemons as they improve. Without the second step, it seems hard to be so pessimistic about the “normal” intervention of “test in a wider range of cases.”
how would it work with HBO, where the human overseer’s time becomes increasingly scarce/costly relative to the AI’s as AIs get faster?
At time 0 the human trains the AI to operate at time 1. At time T>>0 the AI trains itself to operate at time T+1, at some point the human no longer needs to be involved—if the AI is actually aligned on inputs that it encounters at time T, then it has a hope of remaining aligned on inputs it encounters at time T+1.
I spoke a bit too glibly though, I think there are lots of possible approaches for dealing with this problem, each of them slightly increases my optimism, this isn’t the most important:
Retraining constantly. More generally, only using the AI for a short period of time before building a completely new AI. (I think that humans basically only need to solve the alignment problem for AI-modestly-better-than-humans-at-alignment, and then we leave the issue up to the AI.)
Using techniques here to avoid “active malice” in the worst case. This doesn’t include all cases where the AI is optimizing a subgoal which is no longer correlated with the real goal. But it does include cases where that subgoal then involves disempowering the human instrumentally, which seems necessary to really have a catastrophe.
I think there is some real sense in which an upstream daemon (of the kind that could appear for a minimal circuit) may be a much smaller problem, though this requires much more understanding.
I’m also pretty concerned about the overseer running into their own lack of robustness against distributional shifts if this is what you’re planning.
I think this is definitely an additional difficulty. Right now I think accidentally introducing consequentialists is a somewhat larger concern, either daemons from the distillation step or weird memetic patterns in the amplification step, but hopefully at some point I’ll be focusing on this problem.
Without the second step, it seems hard to be so pessimistic about the “normal” intervention of “test in a wider range of cases.”
Another way to be pessimistic is you expect that if the test fails on a wider range of cases, it will be unclear how to proceed at that point, and less safety-conscious AI projects may take the lead before you figure that out. (I think this, or a similar point, was made in the MIRI doc.)
At time 0 the human trains the AI to operate at time 1. At time T>>0 the AI trains itself to operate at time T+1, at some point the human no longer needs to be involved—if the AI is actually aligned on inputs that it encounters at time T, then it has a hope of remaining aligned on inputs it encounters at time T+1.
I don’t think this can work if you’re just doing naive imitation learning? Do you have some other training method in mind?
I don’t think this can work if you’re just doing naive imitation learning? Do you have some other training method in mind?
To be clear, I’m imagining imitation learning + amplification. So the agent at time T engages in some deliberative process to produce training targets for the agent at time T+1. The agent at time T also deliberates in order to choose what situations the agent at time T+1 should train on.
What obstruction do you have in mind?
(I’m imagining using imitation+RL rather than pure imitation, but the difference won’t help with this question.)
By “naive imitation learning” I was thinking “without amplification”. With amplification, I’m less sure it won’t work but it still seems pretty iffy. The plan seems to depend on at least the following:
We can create a small (since generated by expensive humans) set of training data that is representative of the data manifold of HBO reasoning (without missing some important part of it).
The data manifold of HBO reasoning is universal, i.e., all future tasks can be broken down (recursively) into subtasks that lie on this manifold.
At each iteration of amplification, the agent being amplified knows how to break down an input task into subtasks that lie on (or is not too far from) its own training distribution. It’s not clear to me how to do this, for example how the agent can obtain a simple enough representation of its own training distribution in order to reason about this problem.
The AI at time T can predict the test distribution at time T+1 well enough to generate training data for it. This seems hard to ensure given that the environment is likely to contain hard to predict elements like other agents, including adversarial agents. (This may not be a dealbreaker if the AI can detect out-of-distribution inputs at time T+1 and ask for further training data on them. Is this what you have in mind?)
I don’t see why to separate 1⁄2, the goal is to find training data that describes some “universal” core for behavior.
3. I don’t think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.
4. Hard-to-predict inputs aren’t intrinsically a problem. If your agent fails malignly on input x, but not on distribution D, then your agent is able to distinguish x from D. So the difficulty comes from inputs that are easy-to-recognize but hard-to-generate. These certainly exist (e.g. consider a model which kills everyone given a signed edict from the pope). I think the most likely approach is to “reach inside” the model in order to stress test the behavior on inputs that you can’t actually synthesize (e.g. by understanding that is checking the pope’s signature, and just seeing what would happen if the check passed). This is the endpoint of improvements in these techniques.
(Of course, I don’t think we’ll be able to prevent benign failures in general.)
I don’t see why to separate 1⁄2, the goal is to find training data that describes some “universal” core for behavior.
It seems to me there are separate risks of the human HBO itself not being universal (e.g., humans are not universal or we need even higher bandwidth to be universal), and not being able to capture enough of the human HBO input/output function in a dataset to train an AI to be universal.
3. I don’t think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.
What if the path towards the universal core goes through an area where the AI wasn’t trained on?
This is the endpoint of improvements in these techniques.
I think that makes sense but now you’re making a conjunctive instead of disjunctive argument (which it seemed like you were claiming by saying “I think there are lots of possible approaches for dealing with this problem” and listing retraining and optimizing worst case performance as separate approaches).
ETA:
If you’re able to obtain a control guarantee over the whole input space, then that seems to solve the problem and you don’t need constant retraining to be aligned. If you’re only able to obtain it for some subset of inputs, then it seems that at time T the AI needs to be able to predict the T+1 test distribution so that it can make sure that’s covered by the control guarantee.
Regarding daemons starting as upstream and becoming downstream...
I think this makes it sound like the goal (soit Y) of the daemon changes, but I usually don’t think of it that way.
What changes is that pursuing Y initially leads to rapidly improving performance at X, but then the performance of X and Y pull apart as the daemon optimizes more heavily for Y.
It seems highly analogous to hacking a learned reward function.
Isn’t that the problem Christiano is relatively optimistic about solving through training on different distributions? In particular, it seems like we’d notice the daemon’s goals diverging from ours and try to correct it. If I understand Christiano’s point correctly, he is optimistic that we could do so successfully due to the existence of a “basin of corrigibility”. OTOH, the most concerning case would be when the daemon is intentionally trying to hide the fact that its goals are diverging from X, because we may not realize the system is misaligned before the daemon executes a treacherous turn.
Instead, I interpreted “starting as upstream and becoming downstream” as a description of a phenomenon that might look roughly like this:
1. Upstream daemon (UD) is “discovered” by some optimization process because it seems to do a good job optimizing X on the training data
2. The daemon is “initialized” by the outer optimizer and starts “running”
3. The outer optimizer (OO) provides a set of inputs to the daemon with the goal of optimizing X
4. Based on these inputs, the daemon “notices” that the entity that “initialized” it is trying to optimize X rather than Y
5. The UD infers that the outer optimizer will “terminate” it if the OO discovers the UD is optimizing for Y rather than X
6. The UD starts optimizing for X in order to survive (a convergent instrumental goal regardless of Y)
7. The OO observes the UD’s performance on a wide variety of training data and infers that the UD is not in fact a upstream daemon
8. The daemon is “set loose” and proceeds to execute a treacherous turn
As you can tell, my model of how daemons work is extremely lacking; I’ve used quotes for every phrase that feels dubious to me.
One major question I have which this model, if correct, partially resolves for me: How does the downstream daemon ever “know” what X we are optimizing for in order for the outer optimizer to “discover” it?
Suppose the best a system can do without a daemon is 97% accuracy.
The daemon can figure out how to get 99% accuracy.
But in order to outperform other systems, it can just provide 98% accuracy, and use 1% of inputs to pursue it’s own agenda.
This all happens on-distribution.
If there are multiple daemon-containing systems competing for survival (with selection happening according to accuracy), this might force them to maximize accuracy, instead of just beating a “non-daemon baseline”.
This is all only relevant to downstream daemons, right? If so, I don’t understand why the DD would ever provide 98% accuracy; I’d expect it to provide 99% accuracy until it sees a chance to provide [arbitarily low]% accuracy and start pursuing its agenda directly. As you say, this might happen due to competition between daemon-containing systems, but I think a DD would want to maximize its chances of survival by maximizng its accuracy either way.
I think it’s relevant for either kind (actually, I’m not sure I like the distinction, or find it particularly relevant).
If there aren’t other daemons to compete with, then 98% is sufficient for survival, so why not use the extra 1% to begin pursuing your own agenda immediately and covertly? This seems to be how principle-agent problems often play out in real life with humans.
You should contact Rob Bensinger since he’s the owner of the document in question. (It looks like I technically can share the document with others, but I’m not sure what Rob/MIRI’s policy is about who that document should be shared with.)
Coincidentally I’m also trying to understand this post at the same time, and was somewhat confused by the “upstream”/”downstream” distinction.
What I eventually concluded was that there are 3 ways a daemon that intrinsically values optimizing some Y can “look like” it’s optimzing X:
Y = X (this seems both unconcerning and unlikely, and thus somewhat irrelevant)
optimzing Y causes optimization pressure to be applied to X (upstream daemon, describes humans if Y = our actual goals and X = inclusive genetic fitness)
The daemon is directly optimizing X because the daemon believes this instrumentally helps it achieve Y (downstream daemon, e.g. if optimizing X helps the daemon survive)
Does this seem correct? In particular, I don’t understand why upstream daemons would have to have a relatively benign goal.
I agree with this basic point and it seems important, thanks.
It seems like there are two qualitatively different concerns when trying to optimize for X, that probably need to be distinguished / thought about separately:
Upstream daemon: directly optimizes something that is upstream of X on the training distribution.
Downstream daemon: optimizes something that is downstream of X on the training distribution (e.g. because survival is a convergent instrumental goal and X is upstream of survival), and so backwards chains through X.
Obviously the real situation can be a complicated mixture, and this is not a clean distinction even apart from that.
The arguments in the OP only plausibly apply to downstream daemons. I think they make the most sense in terms of making induction benign.
I’ve normally thought of upstream daemons as much more likely, but much easier to deal with:
In the case of upstream daemons, the most natural response is to probe a broad range of situations similar to the ones you care about. This requires a lot more understanding, but it seems quite realistic to try to avoid the situation where (Y --> empowering humans) on the training distribution but (Y --> disempowering humans) on the slightly expanded distribution. (Compared to others around here, this probably seems more plausible to me because I have an intuition about a basin of attraction around corrigibility, and am imagining consistently retraining the agent such that it doesn’t need to deal with giant distributional shift.)
In the case of downstream daemons, you have to deal with an adversary who is motivated to look for the first opportunity to defect that you won’t be able to notice. This pits much more optimization pressure against your training regime. In particular, the “distributional shift” in this case is caused by the agent being powerful enough to seize control, and therefore it is very difficult to avoid.
While I usually flag these as two potentially distinct concerns, they do run together a lot in my head as evidenced by this post. I’m not sure if it’s possible to cleanly distinguish them, or how. The right distinction may also be something else, e.g focusing directly on the possibility of a treacherous turn.
I think it makes sense to classify daemons into two types the way you do. Interestingly MIRI seems to be a lot more concerned about what you call upstream daemons. The Arbital page you linked to only talks about upstream daemons and the Google Doc “MIRI notes on alignment difficulty” seems to be mostly about that too. (What is it with people keeping important AI safety documents in private Google Docs these days, with no apparent plans of publication? Do you know any others that I’m not already shared on, BTW?)
I don’t recall you writing about this before. How do you see this working? I guess with LBO you could train a complete “core for reasoning” and then amplify that to keep retraining the higher level agents on broader and broader distributions, but how would it work with HBO, where the human overseer’s time becomes increasingly scarce/costly relative to the AI’s as AIs get faster? I’m also pretty concerned about the overseer running into their own lack of robustness against distributional shifts if this is what you’re planning.
I think people (including at MIRI) normally describe daemons as emerging from upstream optimization, but then describe them as becoming downstream daemons as they improve. Without the second step, it seems hard to be so pessimistic about the “normal” intervention of “test in a wider range of cases.”
At time 0 the human trains the AI to operate at time 1. At time T>>0 the AI trains itself to operate at time T+1, at some point the human no longer needs to be involved—if the AI is actually aligned on inputs that it encounters at time T, then it has a hope of remaining aligned on inputs it encounters at time T+1.
I spoke a bit too glibly though, I think there are lots of possible approaches for dealing with this problem, each of them slightly increases my optimism, this isn’t the most important:
Retraining constantly. More generally, only using the AI for a short period of time before building a completely new AI. (I think that humans basically only need to solve the alignment problem for AI-modestly-better-than-humans-at-alignment, and then we leave the issue up to the AI.)
Using techniques here to avoid “active malice” in the worst case. This doesn’t include all cases where the AI is optimizing a subgoal which is no longer correlated with the real goal. But it does include cases where that subgoal then involves disempowering the human instrumentally, which seems necessary to really have a catastrophe.
I think there is some real sense in which an upstream daemon (of the kind that could appear for a minimal circuit) may be a much smaller problem, though this requires much more understanding.
I think this is definitely an additional difficulty. Right now I think accidentally introducing consequentialists is a somewhat larger concern, either daemons from the distillation step or weird memetic patterns in the amplification step, but hopefully at some point I’ll be focusing on this problem.
Another way to be pessimistic is you expect that if the test fails on a wider range of cases, it will be unclear how to proceed at that point, and less safety-conscious AI projects may take the lead before you figure that out. (I think this, or a similar point, was made in the MIRI doc.)
I don’t think this can work if you’re just doing naive imitation learning? Do you have some other training method in mind?
To be clear, I’m imagining imitation learning + amplification. So the agent at time T engages in some deliberative process to produce training targets for the agent at time T+1. The agent at time T also deliberates in order to choose what situations the agent at time T+1 should train on.
What obstruction do you have in mind?
(I’m imagining using imitation+RL rather than pure imitation, but the difference won’t help with this question.)
By “naive imitation learning” I was thinking “without amplification”. With amplification, I’m less sure it won’t work but it still seems pretty iffy. The plan seems to depend on at least the following:
We can create a small (since generated by expensive humans) set of training data that is representative of the data manifold of HBO reasoning (without missing some important part of it).
The data manifold of HBO reasoning is universal, i.e., all future tasks can be broken down (recursively) into subtasks that lie on this manifold.
At each iteration of amplification, the agent being amplified knows how to break down an input task into subtasks that lie on (or is not too far from) its own training distribution. It’s not clear to me how to do this, for example how the agent can obtain a simple enough representation of its own training distribution in order to reason about this problem.
The AI at time T can predict the test distribution at time T+1 well enough to generate training data for it. This seems hard to ensure given that the environment is likely to contain hard to predict elements like other agents, including adversarial agents. (This may not be a dealbreaker if the AI can detect out-of-distribution inputs at time T+1 and ask for further training data on them. Is this what you have in mind?)
I don’t see why to separate 1⁄2, the goal is to find training data that describes some “universal” core for behavior.
3. I don’t think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.
4. Hard-to-predict inputs aren’t intrinsically a problem. If your agent fails malignly on input x, but not on distribution D, then your agent is able to distinguish x from D. So the difficulty comes from inputs that are easy-to-recognize but hard-to-generate. These certainly exist (e.g. consider a model which kills everyone given a signed edict from the pope). I think the most likely approach is to “reach inside” the model in order to stress test the behavior on inputs that you can’t actually synthesize (e.g. by understanding that is checking the pope’s signature, and just seeing what would happen if the check passed). This is the endpoint of improvements in these techniques.
(Of course, I don’t think we’ll be able to prevent benign failures in general.)
It seems to me there are separate risks of the human HBO itself not being universal (e.g., humans are not universal or we need even higher bandwidth to be universal), and not being able to capture enough of the human HBO input/output function in a dataset to train an AI to be universal.
What if the path towards the universal core goes through an area where the AI wasn’t trained on?
I think that makes sense but now you’re making a conjunctive instead of disjunctive argument (which it seemed like you were claiming by saying “I think there are lots of possible approaches for dealing with this problem” and listing retraining and optimizing worst case performance as separate approaches).
ETA: If you’re able to obtain a control guarantee over the whole input space, then that seems to solve the problem and you don’t need constant retraining to be aligned. If you’re only able to obtain it for some subset of inputs, then it seems that at time T the AI needs to be able to predict the T+1 test distribution so that it can make sure that’s covered by the control guarantee.
Regarding daemons starting as upstream and becoming downstream...
I think this makes it sound like the goal (soit Y) of the daemon changes, but I usually don’t think of it that way.
What changes is that pursuing Y initially leads to rapidly improving performance at X, but then the performance of X and Y pull apart as the daemon optimizes more heavily for Y.
It seems highly analogous to hacking a learned reward function.
Isn’t that the problem Christiano is relatively optimistic about solving through training on different distributions? In particular, it seems like we’d notice the daemon’s goals diverging from ours and try to correct it. If I understand Christiano’s point correctly, he is optimistic that we could do so successfully due to the existence of a “basin of corrigibility”. OTOH, the most concerning case would be when the daemon is intentionally trying to hide the fact that its goals are diverging from X, because we may not realize the system is misaligned before the daemon executes a treacherous turn.
Instead, I interpreted “starting as upstream and becoming downstream” as a description of a phenomenon that might look roughly like this:
1. Upstream daemon (UD) is “discovered” by some optimization process because it seems to do a good job optimizing X on the training data
2. The daemon is “initialized” by the outer optimizer and starts “running”
3. The outer optimizer (OO) provides a set of inputs to the daemon with the goal of optimizing X
4. Based on these inputs, the daemon “notices” that the entity that “initialized” it is trying to optimize X rather than Y
5. The UD infers that the outer optimizer will “terminate” it if the OO discovers the UD is optimizing for Y rather than X
6. The UD starts optimizing for X in order to survive (a convergent instrumental goal regardless of Y)
7. The OO observes the UD’s performance on a wide variety of training data and infers that the UD is not in fact a upstream daemon
8. The daemon is “set loose” and proceeds to execute a treacherous turn
As you can tell, my model of how daemons work is extremely lacking; I’ve used quotes for every phrase that feels dubious to me.
One major question I have which this model, if correct, partially resolves for me: How does the downstream daemon ever “know” what X we are optimizing for in order for the outer optimizer to “discover” it?
A concrete vision:
Suppose the best a system can do without a daemon is 97% accuracy.
The daemon can figure out how to get 99% accuracy.
But in order to outperform other systems, it can just provide 98% accuracy, and use 1% of inputs to pursue it’s own agenda.
This all happens on-distribution.
If there are multiple daemon-containing systems competing for survival (with selection happening according to accuracy), this might force them to maximize accuracy, instead of just beating a “non-daemon baseline”.
This is all only relevant to downstream daemons, right? If so, I don’t understand why the DD would ever provide 98% accuracy; I’d expect it to provide 99% accuracy until it sees a chance to provide [arbitarily low]% accuracy and start pursuing its agenda directly. As you say, this might happen due to competition between daemon-containing systems, but I think a DD would want to maximize its chances of survival by maximizng its accuracy either way.
I think it’s relevant for either kind (actually, I’m not sure I like the distinction, or find it particularly relevant).
If there aren’t other daemons to compete with, then 98% is sufficient for survival, so why not use the extra 1% to begin pursuing your own agenda immediately and covertly? This seems to be how principle-agent problems often play out in real life with humans.
I am interested as well. Please share the docs in question with my LW username at gmail dot com if that is a possibility. Thank you!
You should contact Rob Bensinger since he’s the owner of the document in question. (It looks like I technically can share the document with others, but I’m not sure what Rob/MIRI’s policy is about who that document should be shared with.)
(Summarizing/reinterpreting the upstream/downstream distinction for myself):
“upstream”: has a (relatively benign?) goal which actually helps achieve X
“downstream”: doesn’t
Coincidentally I’m also trying to understand this post at the same time, and was somewhat confused by the “upstream”/”downstream” distinction.
What I eventually concluded was that there are 3 ways a daemon that intrinsically values optimizing some Y can “look like” it’s optimzing X:
Y = X (this seems both unconcerning and unlikely, and thus somewhat irrelevant)
optimzing Y causes optimization pressure to be applied to X (upstream daemon, describes humans if Y = our actual goals and X = inclusive genetic fitness)
The daemon is directly optimizing X because the daemon believes this instrumentally helps it achieve Y (downstream daemon, e.g. if optimizing X helps the daemon survive)
Does this seem correct? In particular, I don’t understand why upstream daemons would have to have a relatively benign goal.
Yeah that seems right. I think it’s a better summary of what Paul was talking about.