There’s ~one natural structure for capabilities, such that (assuming we don’t have deep mastery of intelligence) nearly anything we build that is an AGI will have that structure.
Given this, there will be a point where an AI system switches from everything-muddled-in-a-soup to clean capabilities and muddled alignment (the “sharp left turn”).
I basically agree that the plans I consider don’t engage much with this sort of scenario. This is mostly because I don’t expect this scenario and so I’m trying to solve the alignment problem in the worlds I do expect.
(For the reader: I am not saying “we’re screwed if the sharp left turn happens so we should ignore it”, I am saying that the sharp left turn is unlikely.)
A consequence is that I care a lot about knowing whether the sharp left turn is actually likely. Unfortunately so far I have found it pretty hard to understand why exactly you and Eliezer find it so likely. I think current SOTA on this disagreement is this post and I’d be keen on more work along those lines.
Some commentary on the conversation with me:
Imaginary Richard/Rohin: You seem awfully confident in this sharp left turn thing. And that the goals it was trained for won’t just generalize. This seems characteristically overconfident.
This isn’t exactly wrong—I do think you are overconfident—but I wouldn’t say something like “characteristically overconfident” unless you were advocating for some particular decision right now which depended on others deferring to your high credences in something. It just doesn’t seem useful to argue this point most of the time and it doesn’t feature much in my reasoning.
For instance, observe that natural selection didn’t try to get the inner optimizer to be aligned with inclusive genetic fitness at all. For all we know, a small amount of cleverness in exposing inner-misaligned behavior to the gradients will just be enough to fix the problem.
Good description of why I don’t find the evolution analogy compelling for “sharp left turn is very likely”.
And even if not that-exact-thing, then there are all sorts of ways that some other thing could come out of left field and just render the problem easy. So I don’t see why you’re worried.
I’d phrase it as “I don’t see why you think [sharp left turn leading to failures of generalization of alignment that we can’t notice and fix before we’re dead] is very likely to happen”. I’m worried too!
Nate: My model says that the hard problem rears its ugly head by default, in a pretty robust way. Clever ideas might suffice to subvert the hard problem (though my guess is that we need something more like understanding and mastery, rather than just a few clever ideas). I have considered an array of clever ideas that look to me like they would predictably-to-me fail to solve the problems, and I admit that my guess is that you’re putting most of your hope on small clever ideas that I can already see would fail. But perhaps you have ideas that I do not. Do you yourself have any specific ideas for tackling the hard problem?
Imaginary Richard/Rohin: Train it, while being aware of inner alignment issues, and hope for the best.
I think if you define the hard problem to be the sharp left turn as described at the beginning of my comment then my response is “no, I don’t usually focus on that problem” (which I would defend as the correct action to take).
Also if I had to summarize the plan in a sentence it would be “empower your oversight process as much as possible to detect problems in the AI system you’re training (both in the outcomes it produces and the reasoning process it employs)”.
Nate: That doesn’t seem to me to even start to engage with the issue where the capabilities fall into an attractor and the alignment doesn’t.
Yup, agreed.
Though if you weaken claim 1, that there is ~one natural structure to capabilities, to instead say that there are many possible structures to capabilities but the default one is deadly EU maximization, then I no longer agree. It seems pretty plausible to me that stronger oversight changes the structure of your capabilities.
Perhaps sometime we can both make a list of ways to train with inner alignment issues in mind, and then share them with each other, so that you can see whether you think I’m lacking awareness of some important tool you expect to be at our disposal, and so that I can go down your list and rattle off the reasons why the proposed training tools don’t look to me like they result in alignment that is robust to sharp left turns. (Or find one that surprises me, and update.) But I don’t want to delay this post any longer, so, some other time, maybe.
I think the more relevant cruxes are the claims at the top of this comment (particularly claim 1); I think if I’ve understood the “sharp left turn” correctly I agree with you that the approaches I have in mind don’t help much (unless the approaches succeed wildly, to the point of mastering intelligence, e.g. my approaches include mechanistic interpretability which as you agree could in theory get to that point even if they aren’t likely to in practice).
My guess at part of your views:
There’s ~one natural structure for capabilities, such that (assuming we don’t have deep mastery of intelligence) nearly anything we build that is an AGI will have that structure.
Given this, there will be a point where an AI system switches from everything-muddled-in-a-soup to clean capabilities and muddled alignment (the “sharp left turn”).
I basically agree that the plans I consider don’t engage much with this sort of scenario. This is mostly because I don’t expect this scenario and so I’m trying to solve the alignment problem in the worlds I do expect.
(For the reader: I am not saying “we’re screwed if the sharp left turn happens so we should ignore it”, I am saying that the sharp left turn is unlikely.)
A consequence is that I care a lot about knowing whether the sharp left turn is actually likely. Unfortunately so far I have found it pretty hard to understand why exactly you and Eliezer find it so likely. I think current SOTA on this disagreement is this post and I’d be keen on more work along those lines.
Some commentary on the conversation with me:
This isn’t exactly wrong—I do think you are overconfident—but I wouldn’t say something like “characteristically overconfident” unless you were advocating for some particular decision right now which depended on others deferring to your high credences in something. It just doesn’t seem useful to argue this point most of the time and it doesn’t feature much in my reasoning.
Good description of why I don’t find the evolution analogy compelling for “sharp left turn is very likely”.
I’d phrase it as “I don’t see why you think [sharp left turn leading to failures of generalization of alignment that we can’t notice and fix before we’re dead] is very likely to happen”. I’m worried too!
I think if you define the hard problem to be the sharp left turn as described at the beginning of my comment then my response is “no, I don’t usually focus on that problem” (which I would defend as the correct action to take).
Also if I had to summarize the plan in a sentence it would be “empower your oversight process as much as possible to detect problems in the AI system you’re training (both in the outcomes it produces and the reasoning process it employs)”.
Yup, agreed.
Though if you weaken claim 1, that there is ~one natural structure to capabilities, to instead say that there are many possible structures to capabilities but the default one is deadly EU maximization, then I no longer agree. It seems pretty plausible to me that stronger oversight changes the structure of your capabilities.
I think the more relevant cruxes are the claims at the top of this comment (particularly claim 1); I think if I’ve understood the “sharp left turn” correctly I agree with you that the approaches I have in mind don’t help much (unless the approaches succeed wildly, to the point of mastering intelligence, e.g. my approaches include mechanistic interpretability which as you agree could in theory get to that point even if they aren’t likely to in practice).