Any updates to your model of the socioeconomic path to aligned AI deployment? Namely:
Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?
Still on the “figure out agency and train up an aligned AGI unilaterally” path?
Has the FTX fiasco impacted your expectation of us-in-the-future having enough money=compute to do the latter?
I expect there to be no major updates, but seems worthwhile to keep an eye on this.
So my new main position is: which potential alignment targets (human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AI’s internal language (which itself probably includes a lot of mathematics) is an empirical question, and that’s the main question which determines what we should target.
I’d like to make a case that Do What I Mean will potentially turn out to be the better target than corrigibility/value learning.
Primarily, “Do What I Mean” is about translation. Entity 1 compresses some problem specification defined over Entity 1′s world-model into a short data structure — an order, a set of values, an objective function, etc. — then Entity 2 uses some algorithm to de-compress that data structure and translate it into a problem specification defined over Entity 2′s world-model. The problem of alignment via Do What I Mean, then, is the problem of ensuring that Entity 2 (which we’ll assume to be bigger) decompresses a specific type of compressed data structures using the same algorithm that was used to compress them in the first place — i. e., interprets orders the way they were intended/acts on our actual values and not the misspecified proxy/extrapolates our values from the crude objective function/etc.
This potentially has the nice property of collapsing the problem of alignment to the problem of ontology translation, and so unifying the problem of interpreting an NN and the problem of aligning an NN into the same problem.
In addition, it’s probably a natural concept, in the sense that “how do I map this high-level description onto a lower-level model” seems like a problem any advanced agent would be running into all the time. There’ll almost definitely be concepts and algorithms about that in the AI’s world-model, and they may be easily repluggable.
Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?
Here’s a dump of my current timeline models. (I actually originally drafted this as part of the post, then cut it.)
My current intuition is that deep learning is approximately one transformer-level paradigm shift away from human-level AGI. (And, obviously, once we have human-level AGI things foom relatively quickly.) That comes from an intuitive extrapolation: if something were about as much better as the models of the last 2-3 years, as the models of the last 2-3 years are compared to pre-transformer models, then I’d expect them to be at least human-level. That does not mean that nets will get to human level immediately after that transformer-level shift comes along; e.g. with transformers it still took ~2-3 years before transformer models really started to look impressive.
So the most important update from deep learning over the past year has been the lack of any transformer-level paradigm shift in algorithms, architectures, etc.
There are of course other potential paths to human-level (or higher) which don’t route through a transformer-level paradigm shift in deep learning. One obvious path is to just keep scaling; I expect we’ll see a paradigm shift well before scaling alone achieves human-level AGI (and this seems even more likely post-Chinchilla). The main other path is that somebody wires together a bunch of GPT-style AGIs in such a way that they achieve greater intelligence by talking to each other (sort of like how humans took off via cultural accumulation); I don’t think that’s very likely to happen near-term, but I do think it’s the main path by which 5-year timelines would happen without a paradigm shift. Call it maybe 5-10%. Finally, of course, there’s always the “unknown unknowns” possibility.
How long until the next shift?
Back around 2014 or 2015, I was visiting my alma mater, and a professor asked me what I thought about the deep learning wave. I said it looked pretty much like all the previous ML/AI hype cycles: everyone would be very excited for a while and make grand claims, but the algorithms would be super finicky and unreliable. Eventually the hype would die down, and we’d go into another AI winter. About ten years after the start of the wave someone would show that the method (in this case large CNNs) was equivalent to some Bayesian model, and then it would make sense when it did/didn’t work, and it would join the standard toolbox of workhorse ML algorithms. Eventually some new paradigm would come along, and the hype cycle would start again.
… and in hindsight, I think that was basically correct up until transformers came along around 2017. Pre-transformer nets were indeed very finicky, and were indeed shown equivalent to some Bayesian model about ten years after the excitement started, at which point we had a much better idea of what they did and did not do well. The big difference from previous ML/AI hype waves was that the next paradigm—transformers—came along before the previous wave had died out. We skipped an AI winter; the paradigm shift came in ~5 years rather than 10-15.
… and now it’s been about five years since transformers came along. Just naively extrapolating from the two most recent data points says it’s time for the next shift. And we haven’t seen that shift yet. (Yes, diffusion models came along, but those don’t seem likely to become a transformer-level paradigm shift; they don’t open up whole new classes of applications in the same way.)
So on the one hand, I’m definitely nervous that the next shift is imminent. On the other hand, it’s already very slightly on the late side, and if another 1-2 years go by I’ll update quite a bit toward that shift taking much longer.
Also, on an inside view, I expect the next shift to be quite a bit more difficult than the transformers shift. (I don’t plan to discuss the reasons for that, because spelling out exactly which technical hurdles need to be cleared in order to get nets to human level is exactly the sort of thing which potentially accelerates the shift.) That inside view is a big part of why my timelines last year were 10-15 years, and not 5. The other main reasons my timelines were 10-15 years were regression to the mean (i.e. the transformers paradigm shift came along very unusually quickly, and it was only one data point), general hype-wariness, and an intuitive sense that unknown unknowns in this case will tend to push toward longer timelines rather than shorter on net.
Put all that together, and there’s a big blob of probability mass on ~5 year timelines; call that 20-30% or so. But if we get through the next couple years without a transformer-level paradigm shift, and without a bunch of wired-together GPTs spontaneously taking off, then timelines get a fair bit lot longer, and that’s where my median world is.
Still on the “figure out agency and train up an aligned AGI unilaterally” path?
“Train up an AGI unilaterally” doesn’t quite carve my plans at the joints.
One of the most common ways I see people fail to have any effect at all is to think in terms of “we”. They come up with plans which “we” could follow, for some “we” which is not in fact going to follow that plan. And then they take political-flavored actions which symbolically promote the plan, but are not in fact going to result in “we” implementing the plan. (And also, usually, the “we” in question is too dysfunctional as a group to implement the plan even if all the individuals wanted to, because that is how approximately 100% of organizations of more than 10 people operate.) In cognitive terms, the plan is pretending that lots of other peoples’ actions are choosable/controllable, when in fact those other peoples’ actions are not choosable/controllable, at least relative to the planner’s actual capabilities.
The simplest and most robust counter to this failure mode is to always make unilateral plans.
But to counter the failure mode, plans don’t need to be completely unilateral. They can involve other people doing things which those other people will actually predictably do. So, for instance, maybe I’ll write a paper about natural abstractions in hopes of nerd-sniping some complex systems theorists to further develop the theory. That’s fine; the actions which I need to counterfact over in order for that plan to work are actions which I can in fact take unilaterally (i.e. write a paper). Other than that, I’m just relying on other people acting in ways in which they’ll predictably act anyway.
Point is: in order for a plan to be a “real plan” (as opposed to e.g. a fabricated option, or a de-facto applause light), all of the actions which the plan treats as “under the planner’s control” must be actions which can be taken unilaterally. Any non-unilateral actions need to be things which we actually expect people to do by default, not things we wish they would do.
Coming back to the question: my plans certainly do not live in some childrens’ fantasy world where one or more major AI labs magically become the least-dysfunctional multiple-hundred-person organizations on the planet, and then we all build an aligned AGI via the magic of Friendship and Cooperation. The realistic assumption is that large organizations are mostly carried wherever the memetic waves drift. Now, the memetic waves may drift in a good direction—if e.g. the field of alignment does indeed converge to a paradigm around decoding the internal language of nets and expressing our targets in that language, then there’s a strong chance the major labs follow that tide, and do a lot of useful work. And I do unilaterally have nonzero ability to steer that memetic drift—for instance, by creating public knowledge of various useful lines of alignment research converging, or by training lots of competent people.
That’s the sort of non-unilaterality which I’m fine having in my plans: relying on other people to behave in realistic ways, conditional on me doing things which I can actually unilaterally do.
Any updates to your model of the socioeconomic path to aligned AI deployment? Namely:
Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?
Still on the “figure out agency and train up an aligned AGI unilaterally” path?
Has the FTX fiasco impacted your expectation of us-in-the-future having enough money=compute to do the latter?
I expect there to be no major updates, but seems worthwhile to keep an eye on this.
I’d like to make a case that Do What I Mean will potentially turn out to be the better target than corrigibility/value learning.
Primarily, “Do What I Mean” is about translation. Entity 1 compresses some problem specification defined over Entity 1′s world-model into a short data structure — an order, a set of values, an objective function, etc. — then Entity 2 uses some algorithm to de-compress that data structure and translate it into a problem specification defined over Entity 2′s world-model. The problem of alignment via Do What I Mean, then, is the problem of ensuring that Entity 2 (which we’ll assume to be bigger) decompresses a specific type of compressed data structures using the same algorithm that was used to compress them in the first place — i. e., interprets orders the way they were intended/acts on our actual values and not the misspecified proxy/extrapolates our values from the crude objective function/etc.
This potentially has the nice property of collapsing the problem of alignment to the problem of ontology translation, and so unifying the problem of interpreting an NN and the problem of aligning an NN into the same problem.
In addition, it’s probably a natural concept, in the sense that “how do I map this high-level description onto a lower-level model” seems like a problem any advanced agent would be running into all the time. There’ll almost definitely be concepts and algorithms about that in the AI’s world-model, and they may be easily repluggable.
Here’s a dump of my current timeline models. (I actually originally drafted this as part of the post, then cut it.)
My current intuition is that deep learning is approximately one transformer-level paradigm shift away from human-level AGI. (And, obviously, once we have human-level AGI things foom relatively quickly.) That comes from an intuitive extrapolation: if something were about as much better as the models of the last 2-3 years, as the models of the last 2-3 years are compared to pre-transformer models, then I’d expect them to be at least human-level. That does not mean that nets will get to human level immediately after that transformer-level shift comes along; e.g. with transformers it still took ~2-3 years before transformer models really started to look impressive.
So the most important update from deep learning over the past year has been the lack of any transformer-level paradigm shift in algorithms, architectures, etc.
There are of course other potential paths to human-level (or higher) which don’t route through a transformer-level paradigm shift in deep learning. One obvious path is to just keep scaling; I expect we’ll see a paradigm shift well before scaling alone achieves human-level AGI (and this seems even more likely post-Chinchilla). The main other path is that somebody wires together a bunch of GPT-style AGIs in such a way that they achieve greater intelligence by talking to each other (sort of like how humans took off via cultural accumulation); I don’t think that’s very likely to happen near-term, but I do think it’s the main path by which 5-year timelines would happen without a paradigm shift. Call it maybe 5-10%. Finally, of course, there’s always the “unknown unknowns” possibility.
How long until the next shift?
Back around 2014 or 2015, I was visiting my alma mater, and a professor asked me what I thought about the deep learning wave. I said it looked pretty much like all the previous ML/AI hype cycles: everyone would be very excited for a while and make grand claims, but the algorithms would be super finicky and unreliable. Eventually the hype would die down, and we’d go into another AI winter. About ten years after the start of the wave someone would show that the method (in this case large CNNs) was equivalent to some Bayesian model, and then it would make sense when it did/didn’t work, and it would join the standard toolbox of workhorse ML algorithms. Eventually some new paradigm would come along, and the hype cycle would start again.
… and in hindsight, I think that was basically correct up until transformers came along around 2017. Pre-transformer nets were indeed very finicky, and were indeed shown equivalent to some Bayesian model about ten years after the excitement started, at which point we had a much better idea of what they did and did not do well. The big difference from previous ML/AI hype waves was that the next paradigm—transformers—came along before the previous wave had died out. We skipped an AI winter; the paradigm shift came in ~5 years rather than 10-15.
… and now it’s been about five years since transformers came along. Just naively extrapolating from the two most recent data points says it’s time for the next shift. And we haven’t seen that shift yet. (Yes, diffusion models came along, but those don’t seem likely to become a transformer-level paradigm shift; they don’t open up whole new classes of applications in the same way.)
So on the one hand, I’m definitely nervous that the next shift is imminent. On the other hand, it’s already very slightly on the late side, and if another 1-2 years go by I’ll update quite a bit toward that shift taking much longer.
Also, on an inside view, I expect the next shift to be quite a bit more difficult than the transformers shift. (I don’t plan to discuss the reasons for that, because spelling out exactly which technical hurdles need to be cleared in order to get nets to human level is exactly the sort of thing which potentially accelerates the shift.) That inside view is a big part of why my timelines last year were 10-15 years, and not 5. The other main reasons my timelines were 10-15 years were regression to the mean (i.e. the transformers paradigm shift came along very unusually quickly, and it was only one data point), general hype-wariness, and an intuitive sense that unknown unknowns in this case will tend to push toward longer timelines rather than shorter on net.
Put all that together, and there’s a big blob of probability mass on ~5 year timelines; call that 20-30% or so. But if we get through the next couple years without a transformer-level paradigm shift, and without a bunch of wired-together GPTs spontaneously taking off, then timelines get a fair bit lot longer, and that’s where my median world is.
“Train up an AGI unilaterally” doesn’t quite carve my plans at the joints.
One of the most common ways I see people fail to have any effect at all is to think in terms of “we”. They come up with plans which “we” could follow, for some “we” which is not in fact going to follow that plan. And then they take political-flavored actions which symbolically promote the plan, but are not in fact going to result in “we” implementing the plan. (And also, usually, the “we” in question is too dysfunctional as a group to implement the plan even if all the individuals wanted to, because that is how approximately 100% of organizations of more than 10 people operate.) In cognitive terms, the plan is pretending that lots of other peoples’ actions are choosable/controllable, when in fact those other peoples’ actions are not choosable/controllable, at least relative to the planner’s actual capabilities.
The simplest and most robust counter to this failure mode is to always make unilateral plans.
But to counter the failure mode, plans don’t need to be completely unilateral. They can involve other people doing things which those other people will actually predictably do. So, for instance, maybe I’ll write a paper about natural abstractions in hopes of nerd-sniping some complex systems theorists to further develop the theory. That’s fine; the actions which I need to counterfact over in order for that plan to work are actions which I can in fact take unilaterally (i.e. write a paper). Other than that, I’m just relying on other people acting in ways in which they’ll predictably act anyway.
Point is: in order for a plan to be a “real plan” (as opposed to e.g. a fabricated option, or a de-facto applause light), all of the actions which the plan treats as “under the planner’s control” must be actions which can be taken unilaterally. Any non-unilateral actions need to be things which we actually expect people to do by default, not things we wish they would do.
Coming back to the question: my plans certainly do not live in some childrens’ fantasy world where one or more major AI labs magically become the least-dysfunctional multiple-hundred-person organizations on the planet, and then we all build an aligned AGI via the magic of Friendship and Cooperation. The realistic assumption is that large organizations are mostly carried wherever the memetic waves drift. Now, the memetic waves may drift in a good direction—if e.g. the field of alignment does indeed converge to a paradigm around decoding the internal language of nets and expressing our targets in that language, then there’s a strong chance the major labs follow that tide, and do a lot of useful work. And I do unilaterally have nonzero ability to steer that memetic drift—for instance, by creating public knowledge of various useful lines of alignment research converging, or by training lots of competent people.
That’s the sort of non-unilaterality which I’m fine having in my plans: relying on other people to behave in realistic ways, conditional on me doing things which I can actually unilaterally do.
Basically no.
I basically buy your argument, though there’s still the question of how safe a target DWIM is.