We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this
Depending on what you mean by “on their way towards being solved” I’d agree. The way I’d put it is: “We didn’t know what the path to AGI would look like; in particular we didn’t know whether we’d have agency first and then world-understanding, or world-understanding first and then agency. Now we know we are getting the latter, and while that’s good in some ways and bad in other ways, it’s probably overall good. Huzzah! However, our core problems remain, and we don’t have much time left to solve them.”
(Also, fwiw, I have myself updated over the course of the last five years or so. First update was reading Paul’s stuff and related literatures convincing me that corrigibility-based stuff would probably work. Second update was all the recent faithful CoT and control and mechinterp progress etc., plus also the LLM stuff. The LLM stuff was less than 50% of the overall update for me, but it mattered.)
I agree that current frontier models are only a “tiny bit agentic”. I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we’ve seen enough to know that corrigibility probably won’t be that hard to train into a system that’s only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different?
Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I’m interested in seeing if we can make some bets on this though; if we can, great; if we can’t, then at least we can avoid future disagreements about who should update.
There’s a bit of a trivial definitional problem here. If it’s easy to create a corrigible, helpful, and useful AI that allows itself to get shut down, one can always say “those aren’t the type of AIs we were worried about”. But, ultimately, if the corrigible AIs that let you shut them down are competitive with the agentic consequentialist AIs, then it’s not clear why we should care? Just create the corrigible AIs. We don’t need to create the things that you were worried about!
I don’t think that we know how to “just create the corrigible AIs.” The next step on the path to AGI seems to be to make our AIs much more agentic; I am concerned that our current methods of instilling corrigibility (basically: prompting and imperfect training) won’t work on much more agentic AIs. To be clear I think they might work, there’s a lot of uncertainty, but I think they probably won’t. I think it might be easier to see why I think this if you try to prove the opposite in detail—like, write a mini-scenario in which we have something like AutoGPT but much better, and it’s being continually trained to accomplish diverse long-horizon tasks involving pursuing goals in challenging environments, and write down what the corrigibility-related parts of its prompt and/or constitution or whatever are, and write down what the training signal is roughly including the bit about RLHF or whatever, and then imagine that said system is mildly superhuman across the board (and vastly superhuman in some domains) and is being asked to design it’s own successor. (I’m trying to do this myself as we speak. Again I feel like it could work out OK, but it could be disastrous. I think writing some good and bad scenarios will help me decide where to put my probability mass.)
I think this was a helpful thing to say. To be clear: I am in ~full agreement with the reasons you gave here, regarding why current LLM behavior provides evidence that the “world isn’t as grim as it could have been”. For brevity, and in part due to laziness, I omitted these more concrete mechanisms why I think the current evidence is good news from a technical alignment perspective. But ultimately I agree with the mechanisms you offered, and I’m glad you spelled it out more clearly.
Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I’m interested in seeing if we can make some bets on this though; if we can, great; if we can’t, then at least we can avoid future disagreements about who should update.
I’ll note that my prediction was for the next “few years” and the 1-3 OOMs of compute. It seems your timelines are even shorter than I thought if you think the apocalypse, or point of no return, will happen before that point.
With timelines that short, I think betting is overrated. From my perspective, I’d prefer to simply wait and become vindicated as the world does not end in the meantime. However, I acknowledge that simply waiting is not very satisfying from your perspective, as you want to show the world that you’re right before the catastrophe. If you have any suggestions for what we can bet on that would resolve in such a short period of time, I’m happy to hear them.
It’s not about timelines, it’s about capabilities. My tentative prediction is that the sole remaining major bottleneck/gap between current systems and dangerous powerful agent AGIs is ‘agency skills.’ So, skills relevant to being an agent, i.e. ability to autonomously work towards ambitious goals in diverse challenging environments over long periods. I don’t know how many years it’s going to take to get to human-level in agency skills, but I fear that corrigibility problems won’t be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic. Thus, whether AGI is reached next year or in 2030, we’ll face the problem of corrigibility breakdowns only really happening right around the time when it’s too late or almost too late.
I don’t know how many years it’s going to take to get to human-level in agency skills, but I fear that corrigibility problems won’t be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic.
How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are “getting really agentic” and therefore dangerous? I’m imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remaining roughly constant (according to all observable evidence). It’s possible that your model looks like:
In years 1-3, systems will gradually get more agentic, and will remain ~corrigible, but then
In year 4, systems will reach human-level agency, at which point they will be dangerous and powerful, and able to overthrow humanity
Whereas my model looks more like,
In years 1-4 systems will get gradually more agentic
There isn’t a clear, sharp, and discrete point at which their agency reaches or surpasses human-level
They will remain ~corrigible throughout the entire development, even after it’s clear they’ve surpassed human-level agency (which, to be clear, might take longer than 4 years)
Good question. I want to think about this more, I don’t have a ready answer. I have a lot of uncertainty about how long it’ll take to get to human-level agency skills; it could be this year, it could be five more years, it could be anything in between. Could even be longer than five more years though I’m skeptical. The longer it takes, the more likely it is that we’ll have a significant period of kinda-agentic-but-not-super-agentic systems, and so then that raises the question of what we should expect to see re: corrigibility in that case. Idk. Would be interesting to discuss sometime and maybe place some bets!
Thanks for this detailed reply!
Depending on what you mean by “on their way towards being solved” I’d agree. The way I’d put it is: “We didn’t know what the path to AGI would look like; in particular we didn’t know whether we’d have agency first and then world-understanding, or world-understanding first and then agency. Now we know we are getting the latter, and while that’s good in some ways and bad in other ways, it’s probably overall good. Huzzah! However, our core problems remain, and we don’t have much time left to solve them.”
(Also, fwiw, I have myself updated over the course of the last five years or so. First update was reading Paul’s stuff and related literatures convincing me that corrigibility-based stuff would probably work. Second update was all the recent faithful CoT and control and mechinterp progress etc., plus also the LLM stuff. The LLM stuff was less than 50% of the overall update for me, but it mattered.)
Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I’m interested in seeing if we can make some bets on this though; if we can, great; if we can’t, then at least we can avoid future disagreements about who should update.
I don’t think that we know how to “just create the corrigible AIs.” The next step on the path to AGI seems to be to make our AIs much more agentic; I am concerned that our current methods of instilling corrigibility (basically: prompting and imperfect training) won’t work on much more agentic AIs. To be clear I think they might work, there’s a lot of uncertainty, but I think they probably won’t. I think it might be easier to see why I think this if you try to prove the opposite in detail—like, write a mini-scenario in which we have something like AutoGPT but much better, and it’s being continually trained to accomplish diverse long-horizon tasks involving pursuing goals in challenging environments, and write down what the corrigibility-related parts of its prompt and/or constitution or whatever are, and write down what the training signal is roughly including the bit about RLHF or whatever, and then imagine that said system is mildly superhuman across the board (and vastly superhuman in some domains) and is being asked to design it’s own successor. (I’m trying to do this myself as we speak. Again I feel like it could work out OK, but it could be disastrous. I think writing some good and bad scenarios will help me decide where to put my probability mass.)
Yay, thanks!
Just a quick reply to this:
I’ll note that my prediction was for the next “few years” and the 1-3 OOMs of compute. It seems your timelines are even shorter than I thought if you think the apocalypse, or point of no return, will happen before that point.
With timelines that short, I think betting is overrated. From my perspective, I’d prefer to simply wait and become vindicated as the world does not end in the meantime. However, I acknowledge that simply waiting is not very satisfying from your perspective, as you want to show the world that you’re right before the catastrophe. If you have any suggestions for what we can bet on that would resolve in such a short period of time, I’m happy to hear them.
It’s not about timelines, it’s about capabilities. My tentative prediction is that the sole remaining major bottleneck/gap between current systems and dangerous powerful agent AGIs is ‘agency skills.’ So, skills relevant to being an agent, i.e. ability to autonomously work towards ambitious goals in diverse challenging environments over long periods. I don’t know how many years it’s going to take to get to human-level in agency skills, but I fear that corrigibility problems won’t be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic. Thus, whether AGI is reached next year or in 2030, we’ll face the problem of corrigibility breakdowns only really happening right around the time when it’s too late or almost too late.
How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are “getting really agentic” and therefore dangerous? I’m imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remaining roughly constant (according to all observable evidence). It’s possible that your model looks like:
In years 1-3, systems will gradually get more agentic, and will remain ~corrigible, but then
In year 4, systems will reach human-level agency, at which point they will be dangerous and powerful, and able to overthrow humanity
Whereas my model looks more like,
In years 1-4 systems will get gradually more agentic
There isn’t a clear, sharp, and discrete point at which their agency reaches or surpasses human-level
They will remain ~corrigible throughout the entire development, even after it’s clear they’ve surpassed human-level agency (which, to be clear, might take longer than 4 years)
Good question. I want to think about this more, I don’t have a ready answer. I have a lot of uncertainty about how long it’ll take to get to human-level agency skills; it could be this year, it could be five more years, it could be anything in between. Could even be longer than five more years though I’m skeptical. The longer it takes, the more likely it is that we’ll have a significant period of kinda-agentic-but-not-super-agentic systems, and so then that raises the question of what we should expect to see re: corrigibility in that case. Idk. Would be interesting to discuss sometime and maybe place some bets!