To claim that you have removed optimization pressure to be unaligned
The goal is to remove the optimization pressure to be misaligned, and that’s the reason you might hope for the system to be aligned. Where did I make the stronger claim you’re attributing to me?
I’m happy to edit the offending text, I often write sloppily. But Rohin is summarizing the part of this post where I wrote “The argument for alignment isn’t that “a system made of aligned neurons is aligned.” Unalignment isn’t a thing that magically happens; it’s the result of specific optimization pressures in the system that create trouble. My goal is to (a) first construct weaker agents who aren’t internally doing problematic optimization, (b) put them together in a way that improves capability without doing other problematic optimization, (c) iterate that process.” So in this case it seems clear that I was stating a goal.
Even among normal humans there are principal-agent problems.
In the scenario of a human principal delegating to a human agent there is a huge amount of optimization pressure to be misaligned. All of the agents’ evolutionary history and cognition. So I don’t think the word “even” belongs here.
There is optimization pressure to be unaligned; of course there is!
I agree that there are many possible malign optimization pressures, e.g.: (i) the optimization done deliberately by those humans as part of being competitive, which they may not be able to align, (ii) “memetic” selection amongst patterns propagating through the humans, (iii) malign consequentialism that arises sometimes in the human policy (either randomly or in some situations). I’ve written about these and it should be obvious they are something I think a lot about, am struggling with, and believe there are plausible approaches to dealing with.
(I think it would be defensible for you to say something like “I don’t believe that Paul’s writings give any real reason for optimism on these points and the fact that he finds them reassuring seems to indicate wishful thinking,” and if that’s a fair description of your position then we can leave it at that.)
You have not written anything that seems to me to be an honest attempt to seriously grapple with any of those issues. There’s lots of affirming the consequent and appeals to ‘of course it works this way’ without anything backing it up.
I remember that post and agree with the point from its comments section that any solution must be resilient against an adversarial environment.
Which is the core problem of the distillation and amplification approach: It fundamentally assumes it is possible to engineer the environment to be non-adversarial. Every reply I’ve seen you write to this point, which has come in many guises over the years, seemed to dodge the question. I therefore don’t trust you to think clearly enough to not destroy the world.
This topic is more technical than you’re treating it; I think you have probably misunderstood things, but the combative stance you’ve taken makes it impossible to identify what the misunderstandings are.
I am pretty sure I haven’t. Paul’s statements on the subject haven’t substantively changed in years and still look just like his inline responses in the body of the OP here, i.e. handwaving and appeals to common sense.
I may not be treating this as sufficiently technical, granted, but neither has he. And unlike me, this is his day job; failing to deal with it in a technical way is far less defensible.
The goal is to remove the optimization pressure to be misaligned, and that’s the reason you might hope for the system to be aligned. Where did I make the stronger claim you’re attributing to me?
I’m happy to edit the offending text, I often write sloppily. But Rohin is summarizing the part of this post where I wrote “The argument for alignment isn’t that “a system made of aligned neurons is aligned.” Unalignment isn’t a thing that magically happens; it’s the result of specific optimization pressures in the system that create trouble. My goal is to (a) first construct weaker agents who aren’t internally doing problematic optimization, (b) put them together in a way that improves capability without doing other problematic optimization, (c) iterate that process.” So in this case it seems clear that I was stating a goal.
In the scenario of a human principal delegating to a human agent there is a huge amount of optimization pressure to be misaligned. All of the agents’ evolutionary history and cognition. So I don’t think the word “even” belongs here.
I agree that there are many possible malign optimization pressures, e.g.: (i) the optimization done deliberately by those humans as part of being competitive, which they may not be able to align, (ii) “memetic” selection amongst patterns propagating through the humans, (iii) malign consequentialism that arises sometimes in the human policy (either randomly or in some situations). I’ve written about these and it should be obvious they are something I think a lot about, am struggling with, and believe there are plausible approaches to dealing with.
(I think it would be defensible for you to say something like “I don’t believe that Paul’s writings give any real reason for optimism on these points and the fact that he finds them reassuring seems to indicate wishful thinking,” and if that’s a fair description of your position then we can leave it at that.)
You have not written anything that seems to me to be an honest attempt to seriously grapple with any of those issues. There’s lots of affirming the consequent and appeals to ‘of course it works this way’ without anything backing it up.
Providing context for readers: here is a post someone wrote a few years ago about issues (ii)+(iii) which I assume is the kind of thing Czynski has in mind. The most relevant thing I’ve written on issues (ii)+(iii) are Universality and consequentialism within HCH, and prior to that Security amplification and Reliability amplification.
I remember that post and agree with the point from its comments section that any solution must be resilient against an adversarial environment.
Which is the core problem of the distillation and amplification approach: It fundamentally assumes it is possible to engineer the environment to be non-adversarial. Every reply I’ve seen you write to this point, which has come in many guises over the years, seemed to dodge the question. I therefore don’t trust you to think clearly enough to not destroy the world.
This topic is more technical than you’re treating it; I think you have probably misunderstood things, but the combative stance you’ve taken makes it impossible to identify what the misunderstandings are.
I am pretty sure I haven’t. Paul’s statements on the subject haven’t substantively changed in years and still look just like his inline responses in the body of the OP here, i.e. handwaving and appeals to common sense.
I may not be treating this as sufficiently technical, granted, but neither has he. And unlike me, this is his day job; failing to deal with it in a technical way is far less defensible.
As just one example, does this not count?
No, it’s only tangentially related. It in some sense describes the problem, but nonspecifically and without meaningful work to attack the problem.