I’m currently intuiting that there’s a broad basin of “seeming corrigible until you can perform a treacherous turn”, but not a broad basin of true corrigibility. If the overseer can robustly detect incorrigible cognition in the distilled agent, I think things may be fine. But if e.g. the overseer is only inspecting the distilled agent’s justifications for its behavior, and something like its verbal loop, I don’t see how things can work out.
Here are my assumptions underlying this intuition:
1. Past a certain capabilities level, distilled agents can meaningfully be modeled as expected utility maximizers for short-term rewards from its overseer. (This is because expected utility maximization is instrumentally convergent, the same way that believing 51 to be composite is instrumentally convergent, as per Eliezer’s comment.)
2. If the agents reason using TDT, the effective utility functions of these act-based agents will optimize for long-term instrumental preferences of their users, because in expectation, adopting this policy raises whatever short-term instrumental reward they’ll get.
3. Expected utility maximizers for discounted future rewards of reward signals are hopelessly incorrigible.
4. Agents produced through the distillation process, misaligned in this way, could enter a regime of devoting 99% of their cognition to doing exactly what a corrigible agent would do, and 1% of their cognition to finding strategies for determining how to / whether to execute a treacherous turn.
5. Because of how many ways there are to hide this 1% of cognition, a complete white-box understanding of the agent’s cognition is necessary to spot this sort of malign reasoning.
6. We can’t achieve this level of understanding via anything like current ML transparency techniques.
Which of these assumptions do you disagree with most? Or do you agree with these assumptions, but feel optimistic that we can get good worst-case guarantees anyway?
2. How easy is it to learn to be corrigible? I’d think of this as: if we impose the extra constraint that our model behave corrigibly on all inputs, in addition to solving the object-level task well, how much bigger do we need to make the model?
My current intuition: for the same reasons it seems really hard to train models that believe 51 is prime while also performing well on object-level tasks, learning to be corrigible seems very difficult.
But if e.g. the overseer is only inspecting the distilled agent’s justifications for its behavior, and something like its verbal loop, I don’t see how things can work out.
You can imagine the overseer as inspecting the agent’s actions, and probing the agent’s behavior in hypothetical situations. The overseer only “looks inside” the agent’s head as a way to help evaluate behavior or identify possibly problematic situations (and there is not necessarily any explicit looking, it could be a weight/activation sharing scheme where the overseer is able to draw on the agent’s internal cognitive state in the way that the agent is).
1. Past a certain capabilities level, distilled agents can meaningfully be modeled as expected utility maximizers for short-term rewards from its overseer. (This is because expected utility maximization is instrumentally convergent, the same way that believing 51 to be composite is instrumentally convergent, as per Eliezer’s comment.)
Agents can be viewed as expected utility maximizers as soon as their preferences over lotteries become consistent, which probably happens very early in training.
2. If the agents reason using TDT, the effective utility functions of these act-based agents will optimize for long-term instrumental preferences of their users, because in expectation, adopting this policy raises whatever short-term instrumental reward they’ll get.
This seems like a non-sequitir / magical thinking. We can be more precise:
The policy which uses TDT to try to increase its own future reward will be straightforwardly selected against by the outer optimization.
The inner process may nevertheless use TDT if TDT doesn’t diverge from CDT on the training distribution, or it might learn to use TDT but “look nice” so that it doesn’t get selected against. From the perspective of my scheme this isn’t importantly different from e.g. the case where the agent just has a completely different goal and is waiting for a treacherous turn.
3. Expected utility maximizers for discounted future rewards of reward signals are hopelessly incorrigible.
I certainly agree that if we have an agent that is unaligned (but instrumentally deciding to appear aligned), that will be incorrigible. I don’t think that the other stuff about TDT adds anything beyond the usual situation though.
4. Agents produced through the distillation process, misaligned in this way, could enter a regime of devoting 99% of their cognition to doing exactly what a corrigible agent would do, and 1% of their cognition to finding strategies for determining how to / whether to execute a treacherous turn.
Yes, that’s always the threat model for adversarial training / verification / interpretability.
5. Because of how many ways there are to hide this 1% of cognition, a complete white-box understanding of the agent’s cognition is necessary to spot this sort of malign reasoning.
Of course this is the step where I disagree. We can talk briefly about the possible options (what we care about is some kind of disjunction over these options, along with all other options that we haven’t considered):
Adversarial training. The hope is that you can construct situations that will elicit a treacherous turn. You could do this by a combination of (a) probing a very broad distribution over possible future situations, including every situation you can think of that could appear in the near future, (b) having weight/activation sharing between the agent and the adversary—the hope is that if the agent is able to use that 1% of its brain to think of whether to use a treacherous turn, the adversary could in the same way use that 1% in order to identify situations that might elicit a treacherous turn, (c) allowing the adversary to submit “abstract” attacks that cover a large space of possible situations.
Verification. Speaking very vaguely, the hope is that instead of merely running the model on individual inputs, you can run it on some kind of relaxed input (e.g. intervals, or putative moments of a distribution over inputs, etc.). Then we verify that not only does the model behave acceptably on actual inputs, it behaves acceptably on these relaxed inputs. I don’t have a simple non-technical description of this idea, but I feel relatively confident that the intuition behind your claim is not precise enough to actually rule out this kind of solution.
Interpretability. The hope is that you can look inside the model and see the “should I do a treacherous turn?” computation. If the agent is doing any optimization to obscure this reasoning from us, then we want to look for that optimization instead. So we are only up against normal gradient descent here. Gradient descent will be pushing against whatever kinds of treacherous turns that can be detected by our machinery. But it will only push for things that improve behavior on the training distribution, so a treacherous turn can only appear where it is coupled with the machinery that is producing good predictions, giving us more traction for interpretability. In general, this looks like a tricky problem, but I don’t think you can get nearly as far as saying “a complete white-box understanding of the agent’s cognition is necessary to spot this sort of malign reasoning.” I think you should generally be skeptical when you have a vague intuitive appeal up against a universal quantifier.
I’m currently intuiting that there’s a broad basin of “seeming corrigible until you can perform a treacherous turn”, but not a broad basin of true corrigibility
I don’t think this argument says very much about whether there is a broad basin of attraction around corrigibility; if your agent waits for a treacherous turn and then behaves badly, that’s not in the basin. The point of being a broad basin is that “executes a treacherous turn” now needs to be a discrete thing to kick you out of the basin, it can’t be an infinitesimal degradation of performance. But we still have the question: even if “bad behavior” is a discrete event, can we actually use techniques for optimizing worst-case performance to avoid it?
I’m still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it’s in the test distribution, and executing a treacherous turn if so. The way I understand your summaries:
Adversarial training makes it harder for the malignant 1% to do anything dangerous, since it makes it harder to tell whether it’s in the test distribution. But it does little to prevent a malignant 1% from very conservatively trying (and usually failing) to ascertain whether it’s in the test distribution.
Interpretability makes it harder for the malignant 1% to be hidden, but doesn’t prevent malignant cognition it can’t detect. (My reading of “Gradient descent will be pushing against whatever kinds of treacherous turns that can be detected by our machinery. But it will only push for things that improve behavior on the training distribution, so a treacherous turn can only appear where it is coupled with the machinery that is producing good predictions, giving us more traction for interpretability.” is completely consistent with this.)
I didn’t understand what your wrote about verification well enough to have anything to say.
It does prima facie seem that an agent spending 100% of its cognition being competent and corrigible achieves higher reward than an agent that only spends 99% of its cognition being competent and corrigible, and 1% of its cognition trying (and almost always failing) to see if it’s in the test distribution. Is your model that gradient descent will favor the former agent over the latter agent, making the 99%/1% agent unlikely to arise (and perhaps quantifiably so)?
The inner process may nevertheless use TDT if TDT doesn’t diverge from CDT on the training distribution, or it might learn to use TDT but “look nice” so that it doesn’t get selected against.
This was what I was intending to convey in assumption 3.
Adversarial training makes it harder for the malignant 1% to do anything dangerous, since it makes it harder to tell whether it’s in the test distribution. But it does little to prevent a malignant 1% from very conservatively trying (and usually failing) to ascertain whether it’s in the test distribution.
The question is roughly: can you be good enough at finding problematic inputs that, if a problematic input exists, it is much more likely to be found during the adversarial training step than at deployment time? If so, then a model with high reward probably won’t do anything bad at deployment time (exactly how good you need to be depends on exactly how overwhelmingly strong the prior for malign behavior is).
This likely involves introducing some asymmetry between the adversary’s task and the test time task. Examples of possible asymmetries including allowing the adversary to produce different kinds of inputs (e.g. setting activations arbitrarily), letting the adversary poke around inside the model to find bad cases (e.g. doing gradient descent, combining with interpretability, or backwards chaining through the model), or giving the adversary a much weaker job (e.g. only asking them to produce an input on which the model thinks anything remotely suspicious as judged by interpretability, or finding any conditions under which the model fails to instantly shut down a malicious scheme that it previously initiated).
(I don’t know what “prevent” means here.)
Interpretability makes it harder for the malignant 1% to be hidden, but doesn’t prevent malignant cognition it can’t detect.
I agree that if your interpretability doesn’t allow you to detect the agent thinking “Am I on the training distribution?” then it won’t work. So then the question is: can interpretability be strong enough that if the model is thinking something catastrophically dangerous, you have a very good chance of noticing it? (Again, how good you need to be depends on how overwhelming is the prior for bad behavior.)
It does prima facie seem that an agent spending 100% of its cognition being competent and corrigible achieves higher reward than an agent that only spends 99% of its cognition being competent and corrigible, and 1% of its cognition trying (and almost always failing) to see if it’s in the test distribution.
I don’t think this fact makes bad behavior unlikely on their own. But interpretability or adversarial training could exploit the fact that your model is trained by gradient descent. Regularization can strengthen the basic observation that a malign failure needs to use parts of the model that are exercised during training, making it easier to identify possible malign failures by examining behavior on the training distribution. We can potentially exploit the fact that gradient descent updates the model continuously.
I’m still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it’s in the test distribution, and executing a treacherous turn if so.
(“Well-hidden” here means “well-hidden by gradient descent,” not “well-hidden by the agent,” right?)
I’m currently intuiting that there’s a broad basin of “seeming corrigible until you can perform a treacherous turn”, but not a broad basin of true corrigibility. If the overseer can robustly detect incorrigible cognition in the distilled agent, I think things may be fine. But if e.g. the overseer is only inspecting the distilled agent’s justifications for its behavior, and something like its verbal loop, I don’t see how things can work out.
Here are my assumptions underlying this intuition:
1. Past a certain capabilities level, distilled agents can meaningfully be modeled as expected utility maximizers for short-term rewards from its overseer. (This is because expected utility maximization is instrumentally convergent, the same way that believing 51 to be composite is instrumentally convergent, as per Eliezer’s comment.)
2. If the agents reason using TDT, the effective utility functions of these act-based agents will optimize for long-term instrumental preferences of their users, because in expectation, adopting this policy raises whatever short-term instrumental reward they’ll get.
3. Expected utility maximizers for discounted future rewards of reward signals are hopelessly incorrigible.
4. Agents produced through the distillation process, misaligned in this way, could enter a regime of devoting 99% of their cognition to doing exactly what a corrigible agent would do, and 1% of their cognition to finding strategies for determining how to / whether to execute a treacherous turn.
5. Because of how many ways there are to hide this 1% of cognition, a complete white-box understanding of the agent’s cognition is necessary to spot this sort of malign reasoning.
6. We can’t achieve this level of understanding via anything like current ML transparency techniques.
Which of these assumptions do you disagree with most? Or do you agree with these assumptions, but feel optimistic that we can get good worst-case guarantees anyway?
My current intuition: for the same reasons it seems really hard to train models that believe 51 is prime while also performing well on object-level tasks, learning to be corrigible seems very difficult.
You can imagine the overseer as inspecting the agent’s actions, and probing the agent’s behavior in hypothetical situations. The overseer only “looks inside” the agent’s head as a way to help evaluate behavior or identify possibly problematic situations (and there is not necessarily any explicit looking, it could be a weight/activation sharing scheme where the overseer is able to draw on the agent’s internal cognitive state in the way that the agent is).
Agents can be viewed as expected utility maximizers as soon as their preferences over lotteries become consistent, which probably happens very early in training.
This seems like a non-sequitir / magical thinking. We can be more precise:
The policy which uses TDT to try to increase its own future reward will be straightforwardly selected against by the outer optimization.
The inner process may nevertheless use TDT if TDT doesn’t diverge from CDT on the training distribution, or it might learn to use TDT but “look nice” so that it doesn’t get selected against. From the perspective of my scheme this isn’t importantly different from e.g. the case where the agent just has a completely different goal and is waiting for a treacherous turn.
I certainly agree that if we have an agent that is unaligned (but instrumentally deciding to appear aligned), that will be incorrigible. I don’t think that the other stuff about TDT adds anything beyond the usual situation though.
Yes, that’s always the threat model for adversarial training / verification / interpretability.
Of course this is the step where I disagree. We can talk briefly about the possible options (what we care about is some kind of disjunction over these options, along with all other options that we haven’t considered):
Adversarial training. The hope is that you can construct situations that will elicit a treacherous turn. You could do this by a combination of (a) probing a very broad distribution over possible future situations, including every situation you can think of that could appear in the near future, (b) having weight/activation sharing between the agent and the adversary—the hope is that if the agent is able to use that 1% of its brain to think of whether to use a treacherous turn, the adversary could in the same way use that 1% in order to identify situations that might elicit a treacherous turn, (c) allowing the adversary to submit “abstract” attacks that cover a large space of possible situations.
Verification. Speaking very vaguely, the hope is that instead of merely running the model on individual inputs, you can run it on some kind of relaxed input (e.g. intervals, or putative moments of a distribution over inputs, etc.). Then we verify that not only does the model behave acceptably on actual inputs, it behaves acceptably on these relaxed inputs. I don’t have a simple non-technical description of this idea, but I feel relatively confident that the intuition behind your claim is not precise enough to actually rule out this kind of solution.
Interpretability. The hope is that you can look inside the model and see the “should I do a treacherous turn?” computation. If the agent is doing any optimization to obscure this reasoning from us, then we want to look for that optimization instead. So we are only up against normal gradient descent here. Gradient descent will be pushing against whatever kinds of treacherous turns that can be detected by our machinery. But it will only push for things that improve behavior on the training distribution, so a treacherous turn can only appear where it is coupled with the machinery that is producing good predictions, giving us more traction for interpretability. In general, this looks like a tricky problem, but I don’t think you can get nearly as far as saying “a complete white-box understanding of the agent’s cognition is necessary to spot this sort of malign reasoning.” I think you should generally be skeptical when you have a vague intuitive appeal up against a universal quantifier.
I don’t think this argument says very much about whether there is a broad basin of attraction around corrigibility; if your agent waits for a treacherous turn and then behaves badly, that’s not in the basin. The point of being a broad basin is that “executes a treacherous turn” now needs to be a discrete thing to kick you out of the basin, it can’t be an infinitesimal degradation of performance. But we still have the question: even if “bad behavior” is a discrete event, can we actually use techniques for optimizing worst-case performance to avoid it?
I’m still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it’s in the test distribution, and executing a treacherous turn if so. The way I understand your summaries:
Adversarial training makes it harder for the malignant 1% to do anything dangerous, since it makes it harder to tell whether it’s in the test distribution. But it does little to prevent a malignant 1% from very conservatively trying (and usually failing) to ascertain whether it’s in the test distribution.
Interpretability makes it harder for the malignant 1% to be hidden, but doesn’t prevent malignant cognition it can’t detect. (My reading of “Gradient descent will be pushing against whatever kinds of treacherous turns that can be detected by our machinery. But it will only push for things that improve behavior on the training distribution, so a treacherous turn can only appear where it is coupled with the machinery that is producing good predictions, giving us more traction for interpretability.” is completely consistent with this.)
I didn’t understand what your wrote about verification well enough to have anything to say.
It does prima facie seem that an agent spending 100% of its cognition being competent and corrigible achieves higher reward than an agent that only spends 99% of its cognition being competent and corrigible, and 1% of its cognition trying (and almost always failing) to see if it’s in the test distribution. Is your model that gradient descent will favor the former agent over the latter agent, making the 99%/1% agent unlikely to arise (and perhaps quantifiably so)?
This was what I was intending to convey in assumption 3.
The question is roughly: can you be good enough at finding problematic inputs that, if a problematic input exists, it is much more likely to be found during the adversarial training step than at deployment time? If so, then a model with high reward probably won’t do anything bad at deployment time (exactly how good you need to be depends on exactly how overwhelmingly strong the prior for malign behavior is).
This likely involves introducing some asymmetry between the adversary’s task and the test time task. Examples of possible asymmetries including allowing the adversary to produce different kinds of inputs (e.g. setting activations arbitrarily), letting the adversary poke around inside the model to find bad cases (e.g. doing gradient descent, combining with interpretability, or backwards chaining through the model), or giving the adversary a much weaker job (e.g. only asking them to produce an input on which the model thinks anything remotely suspicious as judged by interpretability, or finding any conditions under which the model fails to instantly shut down a malicious scheme that it previously initiated).
(I don’t know what “prevent” means here.)
I agree that if your interpretability doesn’t allow you to detect the agent thinking “Am I on the training distribution?” then it won’t work. So then the question is: can interpretability be strong enough that if the model is thinking something catastrophically dangerous, you have a very good chance of noticing it? (Again, how good you need to be depends on how overwhelming is the prior for bad behavior.)
I don’t think this fact makes bad behavior unlikely on their own. But interpretability or adversarial training could exploit the fact that your model is trained by gradient descent. Regularization can strengthen the basic observation that a malign failure needs to use parts of the model that are exercised during training, making it easier to identify possible malign failures by examining behavior on the training distribution. We can potentially exploit the fact that gradient descent updates the model continuously.
(“Well-hidden” here means “well-hidden by gradient descent,” not “well-hidden by the agent,” right?)