Eliezer thinks that while corrigibility probably has a core which is of lower algorithmic complexity than all of human value, this core is liable to be very hard to find or reproduce by supervised learning of human-labeled data, because deference is an unusually anti-natural shape for cognition, in a way that a simple utility function would not be an anti-natural shape for cognition. Utility functions have multiple fixpoints requiring the infusion of non-environmental data, our externally desired choice of utility function would be non-natural in that sense, but that’s not we’re talking about, we’re talking about anti-natural behavior.
It seems like there is a basic unclarity/equivocation about what we are trying to do.
From my perspective, there are two interesting questions about corrigibility:
1. Can we find a way to put together multiple agents into a stronger agent, without introducing new incorrigible optimization? This is tricky. I can see why someone might think that this contains the whole of the problem, and I’d be very happy if that turned out to be where our whole disagreement lies.
2. How easy is it to learn to be corrigible? I’d think of this as: if we impose the extra constraint that our model behave corrigibly on all inputs, in addition to solving the object-level task well, how much bigger do we need to make the model?
You seem to mostly be imagining a third category:
3. If you optimize a model to be corrigible in one situation, how likely is it to still be corrigible in a new situation?
I don’t care about question 3. It’s been more than 4 years since I even seriously discussed the possibility of learning on a mechanism like that, and even at that point it was not a very serious discussion.
Eliezer also tends to be very skeptical of attempts to cross cognitive chasms between A and Z by going through weird recursions and inductive processes that wouldn’t work equally well to go directly from A to Z
I totally agree that any safe approach to amplification could probably also be used to construct a (very expensive) safe AI that doesn’t use amplification. That’s a great reason to think that amplification will be hard. As I said above and have said before, I’d be quite happy if that turned out to be where the whole disagreement lies. My best current hypothesis would be that this is half of our disagreement, with the other half being about whether it’s possible to achieve a worst-case guarantee by anything like gradient descent.
(This is similar to the situation with expert iteration / AGZ—in order to make it work you did need to have an algorithm that would play perfect Go in the limit of infinite computation. You still need to use expert iteration to get a good Go algorithm.)
I’m currently intuiting that there’s a broad basin of “seeming corrigible until you can perform a treacherous turn”, but not a broad basin of true corrigibility. If the overseer can robustly detect incorrigible cognition in the distilled agent, I think things may be fine. But if e.g. the overseer is only inspecting the distilled agent’s justifications for its behavior, and something like its verbal loop, I don’t see how things can work out.
Here are my assumptions underlying this intuition:
1. Past a certain capabilities level, distilled agents can meaningfully be modeled as expected utility maximizers for short-term rewards from its overseer. (This is because expected utility maximization is instrumentally convergent, the same way that believing 51 to be composite is instrumentally convergent, as per Eliezer’s comment.)
2. If the agents reason using TDT, the effective utility functions of these act-based agents will optimize for long-term instrumental preferences of their users, because in expectation, adopting this policy raises whatever short-term instrumental reward they’ll get.
3. Expected utility maximizers for discounted future rewards of reward signals are hopelessly incorrigible.
4. Agents produced through the distillation process, misaligned in this way, could enter a regime of devoting 99% of their cognition to doing exactly what a corrigible agent would do, and 1% of their cognition to finding strategies for determining how to / whether to execute a treacherous turn.
5. Because of how many ways there are to hide this 1% of cognition, a complete white-box understanding of the agent’s cognition is necessary to spot this sort of malign reasoning.
6. We can’t achieve this level of understanding via anything like current ML transparency techniques.
Which of these assumptions do you disagree with most? Or do you agree with these assumptions, but feel optimistic that we can get good worst-case guarantees anyway?
2. How easy is it to learn to be corrigible? I’d think of this as: if we impose the extra constraint that our model behave corrigibly on all inputs, in addition to solving the object-level task well, how much bigger do we need to make the model?
My current intuition: for the same reasons it seems really hard to train models that believe 51 is prime while also performing well on object-level tasks, learning to be corrigible seems very difficult.
But if e.g. the overseer is only inspecting the distilled agent’s justifications for its behavior, and something like its verbal loop, I don’t see how things can work out.
You can imagine the overseer as inspecting the agent’s actions, and probing the agent’s behavior in hypothetical situations. The overseer only “looks inside” the agent’s head as a way to help evaluate behavior or identify possibly problematic situations (and there is not necessarily any explicit looking, it could be a weight/activation sharing scheme where the overseer is able to draw on the agent’s internal cognitive state in the way that the agent is).
1. Past a certain capabilities level, distilled agents can meaningfully be modeled as expected utility maximizers for short-term rewards from its overseer. (This is because expected utility maximization is instrumentally convergent, the same way that believing 51 to be composite is instrumentally convergent, as per Eliezer’s comment.)
Agents can be viewed as expected utility maximizers as soon as their preferences over lotteries become consistent, which probably happens very early in training.
2. If the agents reason using TDT, the effective utility functions of these act-based agents will optimize for long-term instrumental preferences of their users, because in expectation, adopting this policy raises whatever short-term instrumental reward they’ll get.
This seems like a non-sequitir / magical thinking. We can be more precise:
The policy which uses TDT to try to increase its own future reward will be straightforwardly selected against by the outer optimization.
The inner process may nevertheless use TDT if TDT doesn’t diverge from CDT on the training distribution, or it might learn to use TDT but “look nice” so that it doesn’t get selected against. From the perspective of my scheme this isn’t importantly different from e.g. the case where the agent just has a completely different goal and is waiting for a treacherous turn.
3. Expected utility maximizers for discounted future rewards of reward signals are hopelessly incorrigible.
I certainly agree that if we have an agent that is unaligned (but instrumentally deciding to appear aligned), that will be incorrigible. I don’t think that the other stuff about TDT adds anything beyond the usual situation though.
4. Agents produced through the distillation process, misaligned in this way, could enter a regime of devoting 99% of their cognition to doing exactly what a corrigible agent would do, and 1% of their cognition to finding strategies for determining how to / whether to execute a treacherous turn.
Yes, that’s always the threat model for adversarial training / verification / interpretability.
5. Because of how many ways there are to hide this 1% of cognition, a complete white-box understanding of the agent’s cognition is necessary to spot this sort of malign reasoning.
Of course this is the step where I disagree. We can talk briefly about the possible options (what we care about is some kind of disjunction over these options, along with all other options that we haven’t considered):
Adversarial training. The hope is that you can construct situations that will elicit a treacherous turn. You could do this by a combination of (a) probing a very broad distribution over possible future situations, including every situation you can think of that could appear in the near future, (b) having weight/activation sharing between the agent and the adversary—the hope is that if the agent is able to use that 1% of its brain to think of whether to use a treacherous turn, the adversary could in the same way use that 1% in order to identify situations that might elicit a treacherous turn, (c) allowing the adversary to submit “abstract” attacks that cover a large space of possible situations.
Verification. Speaking very vaguely, the hope is that instead of merely running the model on individual inputs, you can run it on some kind of relaxed input (e.g. intervals, or putative moments of a distribution over inputs, etc.). Then we verify that not only does the model behave acceptably on actual inputs, it behaves acceptably on these relaxed inputs. I don’t have a simple non-technical description of this idea, but I feel relatively confident that the intuition behind your claim is not precise enough to actually rule out this kind of solution.
Interpretability. The hope is that you can look inside the model and see the “should I do a treacherous turn?” computation. If the agent is doing any optimization to obscure this reasoning from us, then we want to look for that optimization instead. So we are only up against normal gradient descent here. Gradient descent will be pushing against whatever kinds of treacherous turns that can be detected by our machinery. But it will only push for things that improve behavior on the training distribution, so a treacherous turn can only appear where it is coupled with the machinery that is producing good predictions, giving us more traction for interpretability. In general, this looks like a tricky problem, but I don’t think you can get nearly as far as saying “a complete white-box understanding of the agent’s cognition is necessary to spot this sort of malign reasoning.” I think you should generally be skeptical when you have a vague intuitive appeal up against a universal quantifier.
I’m currently intuiting that there’s a broad basin of “seeming corrigible until you can perform a treacherous turn”, but not a broad basin of true corrigibility
I don’t think this argument says very much about whether there is a broad basin of attraction around corrigibility; if your agent waits for a treacherous turn and then behaves badly, that’s not in the basin. The point of being a broad basin is that “executes a treacherous turn” now needs to be a discrete thing to kick you out of the basin, it can’t be an infinitesimal degradation of performance. But we still have the question: even if “bad behavior” is a discrete event, can we actually use techniques for optimizing worst-case performance to avoid it?
I’m still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it’s in the test distribution, and executing a treacherous turn if so. The way I understand your summaries:
Adversarial training makes it harder for the malignant 1% to do anything dangerous, since it makes it harder to tell whether it’s in the test distribution. But it does little to prevent a malignant 1% from very conservatively trying (and usually failing) to ascertain whether it’s in the test distribution.
Interpretability makes it harder for the malignant 1% to be hidden, but doesn’t prevent malignant cognition it can’t detect. (My reading of “Gradient descent will be pushing against whatever kinds of treacherous turns that can be detected by our machinery. But it will only push for things that improve behavior on the training distribution, so a treacherous turn can only appear where it is coupled with the machinery that is producing good predictions, giving us more traction for interpretability.” is completely consistent with this.)
I didn’t understand what your wrote about verification well enough to have anything to say.
It does prima facie seem that an agent spending 100% of its cognition being competent and corrigible achieves higher reward than an agent that only spends 99% of its cognition being competent and corrigible, and 1% of its cognition trying (and almost always failing) to see if it’s in the test distribution. Is your model that gradient descent will favor the former agent over the latter agent, making the 99%/1% agent unlikely to arise (and perhaps quantifiably so)?
The inner process may nevertheless use TDT if TDT doesn’t diverge from CDT on the training distribution, or it might learn to use TDT but “look nice” so that it doesn’t get selected against.
This was what I was intending to convey in assumption 3.
Adversarial training makes it harder for the malignant 1% to do anything dangerous, since it makes it harder to tell whether it’s in the test distribution. But it does little to prevent a malignant 1% from very conservatively trying (and usually failing) to ascertain whether it’s in the test distribution.
The question is roughly: can you be good enough at finding problematic inputs that, if a problematic input exists, it is much more likely to be found during the adversarial training step than at deployment time? If so, then a model with high reward probably won’t do anything bad at deployment time (exactly how good you need to be depends on exactly how overwhelmingly strong the prior for malign behavior is).
This likely involves introducing some asymmetry between the adversary’s task and the test time task. Examples of possible asymmetries including allowing the adversary to produce different kinds of inputs (e.g. setting activations arbitrarily), letting the adversary poke around inside the model to find bad cases (e.g. doing gradient descent, combining with interpretability, or backwards chaining through the model), or giving the adversary a much weaker job (e.g. only asking them to produce an input on which the model thinks anything remotely suspicious as judged by interpretability, or finding any conditions under which the model fails to instantly shut down a malicious scheme that it previously initiated).
(I don’t know what “prevent” means here.)
Interpretability makes it harder for the malignant 1% to be hidden, but doesn’t prevent malignant cognition it can’t detect.
I agree that if your interpretability doesn’t allow you to detect the agent thinking “Am I on the training distribution?” then it won’t work. So then the question is: can interpretability be strong enough that if the model is thinking something catastrophically dangerous, you have a very good chance of noticing it? (Again, how good you need to be depends on how overwhelming is the prior for bad behavior.)
It does prima facie seem that an agent spending 100% of its cognition being competent and corrigible achieves higher reward than an agent that only spends 99% of its cognition being competent and corrigible, and 1% of its cognition trying (and almost always failing) to see if it’s in the test distribution.
I don’t think this fact makes bad behavior unlikely on their own. But interpretability or adversarial training could exploit the fact that your model is trained by gradient descent. Regularization can strengthen the basic observation that a malign failure needs to use parts of the model that are exercised during training, making it easier to identify possible malign failures by examining behavior on the training distribution. We can potentially exploit the fact that gradient descent updates the model continuously.
I’m still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it’s in the test distribution, and executing a treacherous turn if so.
(“Well-hidden” here means “well-hidden by gradient descent,” not “well-hidden by the agent,” right?)
1. If at the time of implementing ALBA, our conceptual understanding of corrigibility is the same as it is today, how doomed would you feel?
2. How are you imagining imposing an extra constraint that our model behave corrigibly on all inputs?
3. My current best guess is that your model of how to achieve corrigibility is to train the AI on a bunch of carefully labeled examples of corrigible behavior. To what extent is this accurate?
If we view the US government as a single entity, it’s not clear that it would make sense to describe it as aligned with itself, under your notion of alignment. If we consider an extremely akrasiatic human, it’s not clear that it would make sense to describe him as aligned with himself. The more agenty a human is, the more it seems to make sense to describe him as being aligned with himself.
If an AI assistant has a perfect model of what its operator approves of and only acts according to that model, it seems like it should qualify as aligned. But if the operator is very akrasiatic, should this AI still qualify as being aligned with the operator?
It seems to me that clear conceptual understandings of alignment, corrigibility, and benignity depend critically on a clear conceptual understanding of agency, which suggests a few things:
Significant conceptual understanding of corrigibility is at least partially blocked on conceptual progess on HRAD. (Unless you think the relevant notions of agency can mostly be formalized with ideas outside of HRAD? Or that conceptual understandings of agency are mostly irrelevant for conceptual understandings of corrigibility?)
Unless we have strong reasons to think we can impart the relevant notions of agency via labeled training data, we shouldn’t expect to be able to adequately impart corrigibility via labeled training data.
Without a clear conceptual notion of agency, we won’t have a clear enough concept of alignment or corrigibility we can use to make worst-case bounds.
I think a lot of folks who are confused about your claims about corrigibility share my intuitions around the nature of corrigibility / the difficulty of learning corrigibility from labeled data, and I think it would shed a lot of light if you shared more of your own views on this.
I don’t think a person can be described very precisely as having values, you need to do some work to get out something value-shaped. The easiest way is to combine a person with a deliberative process, and then make some assumption about the reflective equilibrium (e.g. that it’s rational). You will get different values depending on the choice of deliberative process, e.g. if I deliberate by writing I will generally get somewhat different values than if I deliberate by talking to myself. This path-dependence is starkest at the beginning and I expect it to decay towards 0. I don’t think that the difference between various forms of deliberation is likely to be too important, though prima facie it certainly could be.
Similarly for a government, there are lots of extrapolation procedures you can use and they will generally result in different values. I think we should be skeptical of forms of value learning that look like they make sense for people but not for groups of people. (That said, groups of people seem likely to have more path-dependence, so e.g. the choice of deliberative process may be more important for groups than individuals, and more generally individuals and groups can differ in degree if not in kind.)
On this perspective, (a) a human or government is not yet the kind of thing you can be aligned with, in my definition this was hidden in the word “wants,” which was maybe bad form but I was OK with because most people who think about this topic already appreciate the complexity of “wants,” (b) a human is unlikely to be aligned with anything, in the same sense that a pair of people with different values aren’t aligned with anything until they are sufficiently well-coordinated.
I don’t think that you would need to describe agency in order to build a corrigible AI. As an analogy: if you want to build an object that will be pushed in the direction the wind, you don’t need to give the object a definition of “wind,” and you don’t even need to have a complete definition of wind yourself. It’s sufficient for the person designing/analyzing the object to know enough facts about the wind that they can design/analyze sails.
3. If you optimize a model to be corrigible in one situation, how likely is it to still be corrigible in a new situation?
I don’t care about question 3. It’s been more than 4 years since I even seriously discussed the possibility of learning on a mechanism like that, and even at that point it was not a very serious discussion.
“Don’t care” is quite strong. If you still hold this view—why don’t you care about 3? (Curious to hear from other people who basically don’t care about 3, either.)
Yeah, “don’t care” is much too strong. This comment was just meant in the context of the current discussion. I could instead say:
The kind of alignment agenda that I’m working on, and the one we’re discussing here, is not relying on this kind of generalization of corrigibility. This kind of generalization isn’t why we are talking about corrigibility.
However, I agree that there are lots of approaches to building AI that rely on some kind of generalization of corrigibility, and that studying those is interesting and I do care about how that goes.
In the context of this discussion I also would have said that I don’t care about whether honesty generalizes. But that’s also something I do care about even though it’s not particularly relevant to this agenda (because the agenda is attempting to solve alignment under considerably more pessimistic assumptions).
It seems like there is a basic unclarity/equivocation about what we are trying to do.
From my perspective, there are two interesting questions about corrigibility:
1. Can we find a way to put together multiple agents into a stronger agent, without introducing new incorrigible optimization? This is tricky. I can see why someone might think that this contains the whole of the problem, and I’d be very happy if that turned out to be where our whole disagreement lies.
2. How easy is it to learn to be corrigible? I’d think of this as: if we impose the extra constraint that our model behave corrigibly on all inputs, in addition to solving the object-level task well, how much bigger do we need to make the model?
You seem to mostly be imagining a third category:
3. If you optimize a model to be corrigible in one situation, how likely is it to still be corrigible in a new situation?
I don’t care about question 3. It’s been more than 4 years since I even seriously discussed the possibility of learning on a mechanism like that, and even at that point it was not a very serious discussion.
I totally agree that any safe approach to amplification could probably also be used to construct a (very expensive) safe AI that doesn’t use amplification. That’s a great reason to think that amplification will be hard. As I said above and have said before, I’d be quite happy if that turned out to be where the whole disagreement lies. My best current hypothesis would be that this is half of our disagreement, with the other half being about whether it’s possible to achieve a worst-case guarantee by anything like gradient descent.
(This is similar to the situation with expert iteration / AGZ—in order to make it work you did need to have an algorithm that would play perfect Go in the limit of infinite computation. You still need to use expert iteration to get a good Go algorithm.)
I’m currently intuiting that there’s a broad basin of “seeming corrigible until you can perform a treacherous turn”, but not a broad basin of true corrigibility. If the overseer can robustly detect incorrigible cognition in the distilled agent, I think things may be fine. But if e.g. the overseer is only inspecting the distilled agent’s justifications for its behavior, and something like its verbal loop, I don’t see how things can work out.
Here are my assumptions underlying this intuition:
1. Past a certain capabilities level, distilled agents can meaningfully be modeled as expected utility maximizers for short-term rewards from its overseer. (This is because expected utility maximization is instrumentally convergent, the same way that believing 51 to be composite is instrumentally convergent, as per Eliezer’s comment.)
2. If the agents reason using TDT, the effective utility functions of these act-based agents will optimize for long-term instrumental preferences of their users, because in expectation, adopting this policy raises whatever short-term instrumental reward they’ll get.
3. Expected utility maximizers for discounted future rewards of reward signals are hopelessly incorrigible.
4. Agents produced through the distillation process, misaligned in this way, could enter a regime of devoting 99% of their cognition to doing exactly what a corrigible agent would do, and 1% of their cognition to finding strategies for determining how to / whether to execute a treacherous turn.
5. Because of how many ways there are to hide this 1% of cognition, a complete white-box understanding of the agent’s cognition is necessary to spot this sort of malign reasoning.
6. We can’t achieve this level of understanding via anything like current ML transparency techniques.
Which of these assumptions do you disagree with most? Or do you agree with these assumptions, but feel optimistic that we can get good worst-case guarantees anyway?
My current intuition: for the same reasons it seems really hard to train models that believe 51 is prime while also performing well on object-level tasks, learning to be corrigible seems very difficult.
You can imagine the overseer as inspecting the agent’s actions, and probing the agent’s behavior in hypothetical situations. The overseer only “looks inside” the agent’s head as a way to help evaluate behavior or identify possibly problematic situations (and there is not necessarily any explicit looking, it could be a weight/activation sharing scheme where the overseer is able to draw on the agent’s internal cognitive state in the way that the agent is).
Agents can be viewed as expected utility maximizers as soon as their preferences over lotteries become consistent, which probably happens very early in training.
This seems like a non-sequitir / magical thinking. We can be more precise:
The policy which uses TDT to try to increase its own future reward will be straightforwardly selected against by the outer optimization.
The inner process may nevertheless use TDT if TDT doesn’t diverge from CDT on the training distribution, or it might learn to use TDT but “look nice” so that it doesn’t get selected against. From the perspective of my scheme this isn’t importantly different from e.g. the case where the agent just has a completely different goal and is waiting for a treacherous turn.
I certainly agree that if we have an agent that is unaligned (but instrumentally deciding to appear aligned), that will be incorrigible. I don’t think that the other stuff about TDT adds anything beyond the usual situation though.
Yes, that’s always the threat model for adversarial training / verification / interpretability.
Of course this is the step where I disagree. We can talk briefly about the possible options (what we care about is some kind of disjunction over these options, along with all other options that we haven’t considered):
Adversarial training. The hope is that you can construct situations that will elicit a treacherous turn. You could do this by a combination of (a) probing a very broad distribution over possible future situations, including every situation you can think of that could appear in the near future, (b) having weight/activation sharing between the agent and the adversary—the hope is that if the agent is able to use that 1% of its brain to think of whether to use a treacherous turn, the adversary could in the same way use that 1% in order to identify situations that might elicit a treacherous turn, (c) allowing the adversary to submit “abstract” attacks that cover a large space of possible situations.
Verification. Speaking very vaguely, the hope is that instead of merely running the model on individual inputs, you can run it on some kind of relaxed input (e.g. intervals, or putative moments of a distribution over inputs, etc.). Then we verify that not only does the model behave acceptably on actual inputs, it behaves acceptably on these relaxed inputs. I don’t have a simple non-technical description of this idea, but I feel relatively confident that the intuition behind your claim is not precise enough to actually rule out this kind of solution.
Interpretability. The hope is that you can look inside the model and see the “should I do a treacherous turn?” computation. If the agent is doing any optimization to obscure this reasoning from us, then we want to look for that optimization instead. So we are only up against normal gradient descent here. Gradient descent will be pushing against whatever kinds of treacherous turns that can be detected by our machinery. But it will only push for things that improve behavior on the training distribution, so a treacherous turn can only appear where it is coupled with the machinery that is producing good predictions, giving us more traction for interpretability. In general, this looks like a tricky problem, but I don’t think you can get nearly as far as saying “a complete white-box understanding of the agent’s cognition is necessary to spot this sort of malign reasoning.” I think you should generally be skeptical when you have a vague intuitive appeal up against a universal quantifier.
I don’t think this argument says very much about whether there is a broad basin of attraction around corrigibility; if your agent waits for a treacherous turn and then behaves badly, that’s not in the basin. The point of being a broad basin is that “executes a treacherous turn” now needs to be a discrete thing to kick you out of the basin, it can’t be an infinitesimal degradation of performance. But we still have the question: even if “bad behavior” is a discrete event, can we actually use techniques for optimizing worst-case performance to avoid it?
I’m still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it’s in the test distribution, and executing a treacherous turn if so. The way I understand your summaries:
Adversarial training makes it harder for the malignant 1% to do anything dangerous, since it makes it harder to tell whether it’s in the test distribution. But it does little to prevent a malignant 1% from very conservatively trying (and usually failing) to ascertain whether it’s in the test distribution.
Interpretability makes it harder for the malignant 1% to be hidden, but doesn’t prevent malignant cognition it can’t detect. (My reading of “Gradient descent will be pushing against whatever kinds of treacherous turns that can be detected by our machinery. But it will only push for things that improve behavior on the training distribution, so a treacherous turn can only appear where it is coupled with the machinery that is producing good predictions, giving us more traction for interpretability.” is completely consistent with this.)
I didn’t understand what your wrote about verification well enough to have anything to say.
It does prima facie seem that an agent spending 100% of its cognition being competent and corrigible achieves higher reward than an agent that only spends 99% of its cognition being competent and corrigible, and 1% of its cognition trying (and almost always failing) to see if it’s in the test distribution. Is your model that gradient descent will favor the former agent over the latter agent, making the 99%/1% agent unlikely to arise (and perhaps quantifiably so)?
This was what I was intending to convey in assumption 3.
The question is roughly: can you be good enough at finding problematic inputs that, if a problematic input exists, it is much more likely to be found during the adversarial training step than at deployment time? If so, then a model with high reward probably won’t do anything bad at deployment time (exactly how good you need to be depends on exactly how overwhelmingly strong the prior for malign behavior is).
This likely involves introducing some asymmetry between the adversary’s task and the test time task. Examples of possible asymmetries including allowing the adversary to produce different kinds of inputs (e.g. setting activations arbitrarily), letting the adversary poke around inside the model to find bad cases (e.g. doing gradient descent, combining with interpretability, or backwards chaining through the model), or giving the adversary a much weaker job (e.g. only asking them to produce an input on which the model thinks anything remotely suspicious as judged by interpretability, or finding any conditions under which the model fails to instantly shut down a malicious scheme that it previously initiated).
(I don’t know what “prevent” means here.)
I agree that if your interpretability doesn’t allow you to detect the agent thinking “Am I on the training distribution?” then it won’t work. So then the question is: can interpretability be strong enough that if the model is thinking something catastrophically dangerous, you have a very good chance of noticing it? (Again, how good you need to be depends on how overwhelming is the prior for bad behavior.)
I don’t think this fact makes bad behavior unlikely on their own. But interpretability or adversarial training could exploit the fact that your model is trained by gradient descent. Regularization can strengthen the basic observation that a malign failure needs to use parts of the model that are exercised during training, making it easier to identify possible malign failures by examining behavior on the training distribution. We can potentially exploit the fact that gradient descent updates the model continuously.
(“Well-hidden” here means “well-hidden by gradient descent,” not “well-hidden by the agent,” right?)
1. If at the time of implementing ALBA, our conceptual understanding of corrigibility is the same as it is today, how doomed would you feel?
2. How are you imagining imposing an extra constraint that our model behave corrigibly on all inputs?
3. My current best guess is that your model of how to achieve corrigibility is to train the AI on a bunch of carefully labeled examples of corrigible behavior. To what extent is this accurate?
If we view the US government as a single entity, it’s not clear that it would make sense to describe it as aligned with itself, under your notion of alignment. If we consider an extremely akrasiatic human, it’s not clear that it would make sense to describe him as aligned with himself. The more agenty a human is, the more it seems to make sense to describe him as being aligned with himself.
If an AI assistant has a perfect model of what its operator approves of and only acts according to that model, it seems like it should qualify as aligned. But if the operator is very akrasiatic, should this AI still qualify as being aligned with the operator?
It seems to me that clear conceptual understandings of alignment, corrigibility, and benignity depend critically on a clear conceptual understanding of agency, which suggests a few things:
Significant conceptual understanding of corrigibility is at least partially blocked on conceptual progess on HRAD. (Unless you think the relevant notions of agency can mostly be formalized with ideas outside of HRAD? Or that conceptual understandings of agency are mostly irrelevant for conceptual understandings of corrigibility?)
Unless we have strong reasons to think we can impart the relevant notions of agency via labeled training data, we shouldn’t expect to be able to adequately impart corrigibility via labeled training data.
Without a clear conceptual notion of agency, we won’t have a clear enough concept of alignment or corrigibility we can use to make worst-case bounds.
I think a lot of folks who are confused about your claims about corrigibility share my intuitions around the nature of corrigibility / the difficulty of learning corrigibility from labeled data, and I think it would shed a lot of light if you shared more of your own views on this.
I don’t think a person can be described very precisely as having values, you need to do some work to get out something value-shaped. The easiest way is to combine a person with a deliberative process, and then make some assumption about the reflective equilibrium (e.g. that it’s rational). You will get different values depending on the choice of deliberative process, e.g. if I deliberate by writing I will generally get somewhat different values than if I deliberate by talking to myself. This path-dependence is starkest at the beginning and I expect it to decay towards 0. I don’t think that the difference between various forms of deliberation is likely to be too important, though prima facie it certainly could be.
Similarly for a government, there are lots of extrapolation procedures you can use and they will generally result in different values. I think we should be skeptical of forms of value learning that look like they make sense for people but not for groups of people. (That said, groups of people seem likely to have more path-dependence, so e.g. the choice of deliberative process may be more important for groups than individuals, and more generally individuals and groups can differ in degree if not in kind.)
On this perspective, (a) a human or government is not yet the kind of thing you can be aligned with, in my definition this was hidden in the word “wants,” which was maybe bad form but I was OK with because most people who think about this topic already appreciate the complexity of “wants,” (b) a human is unlikely to be aligned with anything, in the same sense that a pair of people with different values aren’t aligned with anything until they are sufficiently well-coordinated.
I don’t think that you would need to describe agency in order to build a corrigible AI. As an analogy: if you want to build an object that will be pushed in the direction the wind, you don’t need to give the object a definition of “wind,” and you don’t even need to have a complete definition of wind yourself. It’s sufficient for the person designing/analyzing the object to know enough facts about the wind that they can design/analyze sails.
“Don’t care” is quite strong. If you still hold this view—why don’t you care about 3? (Curious to hear from other people who basically don’t care about 3, either.)
Yeah, “don’t care” is much too strong. This comment was just meant in the context of the current discussion. I could instead say:
However, I agree that there are lots of approaches to building AI that rely on some kind of generalization of corrigibility, and that studying those is interesting and I do care about how that goes.
In the context of this discussion I also would have said that I don’t care about whether honesty generalizes. But that’s also something I do care about even though it’s not particularly relevant to this agenda (because the agenda is attempting to solve alignment under considerably more pessimistic assumptions).