There’s a nice summary of Eliezer’s post in Rohin Shah’s Alignment Newsletter#7 (which I broke up into a numbered list for clarity), along with Rohin’s response:
A list of challenges faced by iterated distillation and amplification.
First, a collection of aligned agents interacting does not necessarily lead to aligned behavior. (Paul’s response: That’s not the reason for optimism, it’s more that there is no optimization pressure to be unaligned.)
Second, it’s unclear that even with high bandwidth oversight, that a collection of agents could reach arbitrary levels of capability. For example, how could agents with an understanding of arithmetic invent Hessian-free optimization? (Paul’s response: This is an empirical disagreement, hopefully it can be resolved with experiments.)
Third, while it is true that exact imitation of a human would avoid the issues of RL, it is harder to create exact imitation than to create superintelligence, and as soon as you have any imperfection in your imitation of a human, you very quickly get back the problems of RL. (Paul’s response: He’s not aiming for exact imitation, he wants to deal with this problem by having a strong overseer aka informed oversight, and by having techniques that optimize worst-case performance.)
Fourth, since Paul wants to use big unaligned neural nets to imitate humans, we have to worry about the possibility of adversarial behavior. He has suggested using large ensembles of agents and detecting and pruning the ones that are adversarial. However, this would require millions of samples per unaligned agent, which is prohibitively expensive. (Paul’s response: He’s no longer optimistic about ensembles and instead prefers the techniques in this post, but he could see ways of reducing the sample complexity further.)
My opinion: Of all of these, I’m most worried about the second and third problems. I definitely have a weak intuition that there are many important tasks that we care about that can’t easily be decomposed, but I’m optimistic that we can find out with experiments. For the point about having to train a by-default unaligned neural net to imitate aligned agents, I’m somewhat optimistic about informed oversight with strong interpretability techniques, but I become a lot less optimistic if we think that won’t be enough and need to use other techniques like verification, which seem unlikely to scale that far. In any case, I’d recommend reading this post for a good explanation of common critiques of IDA.
(Paul’s response: That’s not the reason for optimism, it’s more that there is no optimization pressure to be unaligned.)
This is the fundamental reason I don’t trust Paul to not destroy the world. There is optimization pressure to be unaligned; of course there is! Even among normal humans there are principal-agent problems. To claim that you have removed optimization pressure to be unaligned, despite years of argument to the contrary, is nothing less than willful blindness and self-delusion.
To claim that you have removed optimization pressure to be unaligned
The goal is to remove the optimization pressure to be misaligned, and that’s the reason you might hope for the system to be aligned. Where did I make the stronger claim you’re attributing to me?
I’m happy to edit the offending text, I often write sloppily. But Rohin is summarizing the part of this post where I wrote “The argument for alignment isn’t that “a system made of aligned neurons is aligned.” Unalignment isn’t a thing that magically happens; it’s the result of specific optimization pressures in the system that create trouble. My goal is to (a) first construct weaker agents who aren’t internally doing problematic optimization, (b) put them together in a way that improves capability without doing other problematic optimization, (c) iterate that process.” So in this case it seems clear that I was stating a goal.
Even among normal humans there are principal-agent problems.
In the scenario of a human principal delegating to a human agent there is a huge amount of optimization pressure to be misaligned. All of the agents’ evolutionary history and cognition. So I don’t think the word “even” belongs here.
There is optimization pressure to be unaligned; of course there is!
I agree that there are many possible malign optimization pressures, e.g.: (i) the optimization done deliberately by those humans as part of being competitive, which they may not be able to align, (ii) “memetic” selection amongst patterns propagating through the humans, (iii) malign consequentialism that arises sometimes in the human policy (either randomly or in some situations). I’ve written about these and it should be obvious they are something I think a lot about, am struggling with, and believe there are plausible approaches to dealing with.
(I think it would be defensible for you to say something like “I don’t believe that Paul’s writings give any real reason for optimism on these points and the fact that he finds them reassuring seems to indicate wishful thinking,” and if that’s a fair description of your position then we can leave it at that.)
You have not written anything that seems to me to be an honest attempt to seriously grapple with any of those issues. There’s lots of affirming the consequent and appeals to ‘of course it works this way’ without anything backing it up.
I remember that post and agree with the point from its comments section that any solution must be resilient against an adversarial environment.
Which is the core problem of the distillation and amplification approach: It fundamentally assumes it is possible to engineer the environment to be non-adversarial. Every reply I’ve seen you write to this point, which has come in many guises over the years, seemed to dodge the question. I therefore don’t trust you to think clearly enough to not destroy the world.
This topic is more technical than you’re treating it; I think you have probably misunderstood things, but the combative stance you’ve taken makes it impossible to identify what the misunderstandings are.
I am pretty sure I haven’t. Paul’s statements on the subject haven’t substantively changed in years and still look just like his inline responses in the body of the OP here, i.e. handwaving and appeals to common sense.
I may not be treating this as sufficiently technical, granted, but neither has he. And unlike me, this is his day job; failing to deal with it in a technical way is far less defensible.
There’s a nice summary of Eliezer’s post in Rohin Shah’s Alignment Newsletter #7 (which I broke up into a numbered list for clarity), along with Rohin’s response:
This is the fundamental reason I don’t trust Paul to not destroy the world. There is optimization pressure to be unaligned; of course there is! Even among normal humans there are principal-agent problems. To claim that you have removed optimization pressure to be unaligned, despite years of argument to the contrary, is nothing less than willful blindness and self-delusion.
The goal is to remove the optimization pressure to be misaligned, and that’s the reason you might hope for the system to be aligned. Where did I make the stronger claim you’re attributing to me?
I’m happy to edit the offending text, I often write sloppily. But Rohin is summarizing the part of this post where I wrote “The argument for alignment isn’t that “a system made of aligned neurons is aligned.” Unalignment isn’t a thing that magically happens; it’s the result of specific optimization pressures in the system that create trouble. My goal is to (a) first construct weaker agents who aren’t internally doing problematic optimization, (b) put them together in a way that improves capability without doing other problematic optimization, (c) iterate that process.” So in this case it seems clear that I was stating a goal.
In the scenario of a human principal delegating to a human agent there is a huge amount of optimization pressure to be misaligned. All of the agents’ evolutionary history and cognition. So I don’t think the word “even” belongs here.
I agree that there are many possible malign optimization pressures, e.g.: (i) the optimization done deliberately by those humans as part of being competitive, which they may not be able to align, (ii) “memetic” selection amongst patterns propagating through the humans, (iii) malign consequentialism that arises sometimes in the human policy (either randomly or in some situations). I’ve written about these and it should be obvious they are something I think a lot about, am struggling with, and believe there are plausible approaches to dealing with.
(I think it would be defensible for you to say something like “I don’t believe that Paul’s writings give any real reason for optimism on these points and the fact that he finds them reassuring seems to indicate wishful thinking,” and if that’s a fair description of your position then we can leave it at that.)
You have not written anything that seems to me to be an honest attempt to seriously grapple with any of those issues. There’s lots of affirming the consequent and appeals to ‘of course it works this way’ without anything backing it up.
Providing context for readers: here is a post someone wrote a few years ago about issues (ii)+(iii) which I assume is the kind of thing Czynski has in mind. The most relevant thing I’ve written on issues (ii)+(iii) are Universality and consequentialism within HCH, and prior to that Security amplification and Reliability amplification.
I remember that post and agree with the point from its comments section that any solution must be resilient against an adversarial environment.
Which is the core problem of the distillation and amplification approach: It fundamentally assumes it is possible to engineer the environment to be non-adversarial. Every reply I’ve seen you write to this point, which has come in many guises over the years, seemed to dodge the question. I therefore don’t trust you to think clearly enough to not destroy the world.
This topic is more technical than you’re treating it; I think you have probably misunderstood things, but the combative stance you’ve taken makes it impossible to identify what the misunderstandings are.
I am pretty sure I haven’t. Paul’s statements on the subject haven’t substantively changed in years and still look just like his inline responses in the body of the OP here, i.e. handwaving and appeals to common sense.
I may not be treating this as sufficiently technical, granted, but neither has he. And unlike me, this is his day job; failing to deal with it in a technical way is far less defensible.
As just one example, does this not count?
No, it’s only tangentially related. It in some sense describes the problem, but nonspecifically and without meaningful work to attack the problem.