It’s a little unclear what “orthogonal” means for processes; here I give a more precise statement. Given a process for developing an intelligent, goal-directed system, my version of the process orthogonality thesis states that:
The overall process involves two (possibly simultaneous) subprocesses: one which builds intelligence into the system, and one which builds goals into the system.
The former subprocess could vary greatly how intelligent it makes the system, and the latter subprocess could vary greatly which goals it specifies, without significantly affecting each other’s performance.
While I agree with your analysis that a strong version of this sort of process orthogonality thesis is wrong—in the sense that your agent has to learn a goal that actually results in good training behavior—I do think it’s very possible for capabilities to progress faster than alignment as in the 2D robustness picture. Also, if that were not the case, I think it would knock out a lot of the argument for why inner alignment is likely to be a problem, suggesting that at least some version of a process orthogonality thesis is pretty important.
These days I’m confused about why it took me so long to understand this outer/inner alignment distinction, but I guess that’s a good lesson about hindsight bias.
In terms of assessing the counterfactual impact of Risks from Learned Optimization, I’m curious to what extent you feel like your understanding here is directly downstream of the paper or whether you think you resolved your confusions mostly independently—and if you do think it’s downstream of the paper, I’m curious whether/at what point you think you would have eventually figured it out regardless.
Re counterfactual impact: the biggest shift came from talking to Nate at BAGI, after which I wrote this post on disentangling arguments about AI risk, in which I identified the “target loading problem”. This seems roughly equivalent to inner alignment, but was meant to avoid the difficulties of defining an “inner optimiser”. At some subsequent point I changed my mind and decided it was better to focus on inner optimisers—I think this was probably catalysed by your paper, or by conversations with Vlad which were downstream of the paper. I think the paper definitely gave me some better terminology for me to mentally latch onto, which helped steer my thoughts in more productive directions.
Re 2d robustness: this is a good point. So maybe we could say that the process orthogonality thesis is somewhat true, in a “spherical cow” sense. There are some interventions that only affect capabilities, or only alignment. And it’s sometimes useful to think of alignment as being all about the reward function, and capabilities as involving everything else. But as with all spherical cow models, this breaks down when you look at it closely—e.g. when you’re thinking about the “curriculum” which an agent needs to undergo to become generally intelligent. Does this seem reasonable?
Also, I think that many other people believe in the process orthogonality thesis to a greater extent than I do. So even if we don’t agree about how much it breaks down, if this is a convenient axis which points in roughly the direction on which we disagree, then I’d still be happy about that.
While I agree with your analysis that a strong version of this sort of process orthogonality thesis is wrong—in the sense that your agent has to learn a goal that actually results in good training behavior—I do think it’s very possible for capabilities to progress faster than alignment as in the 2D robustness picture. Also, if that were not the case, I think it would knock out a lot of the argument for why inner alignment is likely to be a problem, suggesting that at least some version of a process orthogonality thesis is pretty important.
In terms of assessing the counterfactual impact of Risks from Learned Optimization, I’m curious to what extent you feel like your understanding here is directly downstream of the paper or whether you think you resolved your confusions mostly independently—and if you do think it’s downstream of the paper, I’m curious whether/at what point you think you would have eventually figured it out regardless.
Re counterfactual impact: the biggest shift came from talking to Nate at BAGI, after which I wrote this post on disentangling arguments about AI risk, in which I identified the “target loading problem”. This seems roughly equivalent to inner alignment, but was meant to avoid the difficulties of defining an “inner optimiser”. At some subsequent point I changed my mind and decided it was better to focus on inner optimisers—I think this was probably catalysed by your paper, or by conversations with Vlad which were downstream of the paper. I think the paper definitely gave me some better terminology for me to mentally latch onto, which helped steer my thoughts in more productive directions.
Re 2d robustness: this is a good point. So maybe we could say that the process orthogonality thesis is somewhat true, in a “spherical cow” sense. There are some interventions that only affect capabilities, or only alignment. And it’s sometimes useful to think of alignment as being all about the reward function, and capabilities as involving everything else. But as with all spherical cow models, this breaks down when you look at it closely—e.g. when you’re thinking about the “curriculum” which an agent needs to undergo to become generally intelligent. Does this seem reasonable?
Also, I think that many other people believe in the process orthogonality thesis to a greater extent than I do. So even if we don’t agree about how much it breaks down, if this is a convenient axis which points in roughly the direction on which we disagree, then I’d still be happy about that.