Until relatively recently (2018-2019?) I did not seriously entertain the possibility that AGI in our lifetime was possible. This was a mistake, an epistemic error. A rational observer calmly and objectively considering the evidence for AI progress over the prior decades—especially in the light of rapid progress in deep learning—should have come to the reasonable position that AGI within 50 years was a serious possibility (>10%).
AGI plausibly arriving in our lifetime was a reasonable position. Yet this possibility was almost universally ridiculed or ignored or by academics and domain experts. One can find quite funny interview with AI experts on Lesswrong from 15 years ago. The only AI expert agreeing with the Yudkowskian view of AI in our lifetime was Jurgen Schmidthuber. The other dozen AI experts denied it as unknowable or even denied the hypothetical possibility of AGI.
Yudkowsky earns a ton of Bayes points for anticipating the likely arrival of AGI in our lifetime long before the deep learning took off.
**************************
We are currently experiencing a rapid AI takeoff, plausibly culminating in superintelligence by the end of this decade. I know of only two people who anticipated something like what we are seeing far ahead of time; Hans Moravec and Jan Leike Shane Legg*. Both forecast fairly precise dates decades before it happened—and the reasons why they thought it would happen are basically the reasons it did (i.e. Moravec very early on realized the primacy of compute). Moreover, they didn’t forecast a whole lot of things that didn’t happen (like Kurzweil).
Did I make an epistemic error by not believing them earlier? Well for starters I wasn’t really plugged in to the AI scene so I hadn’t heard of them or their views. But suppose I did; should I have beieved them? I’d argue I shouldn’t give their view back then more a little bit of credence.
Entropy is a mysterious physics word for irreducible uncertainty; the uncertainty that remains about the future even after accounting for all the data. In hindsight, we can say that massive GPU training on next-token prediction of all internet text data was (almost**) all you need for AGI. But was this forecasteable?
For every Moravec and Leike Legg who turns out to be extraordinairly right in forecasting the future there re dozens that weren’t. Even in 2018 when the first evidence for strong scaling laws on text-data was being published by Baidu I’d argue that an impartial observer should have only updated a moderate amount. Actually even OpenAI itself wasn’t sold on unsupervised learning on textdata until early gpt showed signs of life—they thought (like many other players in the field, e.g. DeepMind) that RL (in diverse environments) was the way to go.
To me the takeaway is that explicit forecasting can be useful but it is exactly the blackswan events that are irreducibly uncertain (high entropy) that move history.
*the story is that Leike Legg’s timelines have been 2030 for the past two decades.
** regular readers will know my beef with the pure scaling hypothesis.
Fwiw, in 2016 I would have put something like 20% probability on what became known as ‘the scaling hypothesis’. I still had past-2035 median timelines, though.
What did you mean exactly in 2016 by the scaling hypothesis ?
Having past 2035 timelines and believing in the pure scaling maximalist hypothesis (which fwiw i don’t believe in for reasons i have explained elsewhere) are in direct conflict so id be curious if you could more exactly detail your beliefs back then.
What did you mean exactly in 2016 by the scaling hypothesis ?
Something like ‘we could have AGI just by scaling up deep learning / deep RL, without any need for major algorithmic breakthroughs’.
Having past 2035 timelines and believing in the pure scaling maximalist hypothesis (which fwiw i don’t believe in for reasons i have explained elsewhere) are in direct conflict so id be curious if you could more exactly detail your beliefs back then.
I’m not sure this is strictly true, though I agree with the ‘vibe’. I think there were probably a couple of things in play:
I still only had something like 20% on scaling, and I expected much more compute would likely be needed, especially in that scenario, but also more broadly (e.g. maybe something like the median in ‘bioanchors’ − 35 OOMs of pretraining-equivalent compute, if I don’t misremember; though I definitely hadn’t thought very explicitly about how many OOMs of compute at that time) - so I thought it would probably take decades to get to the required amount of compute.
I very likely hadn’t thought hard and long enough to necessarily integrate/make coherent my various beliefs.
Probably at least partly because there seemed to be a lot of social pressure from academic peers against even something like ’20% on scaling’, and even against taking AGI and AGI safety seriously at all. This likely made it harder to ‘viscerally feel’ what some of my beliefs might imply, and especially that it might happen very soon (which also had consequences in delaying when I’d go full-time into working on AI safety; along with thinking I’d have more time to prepare for it, before going all in).
Yeah, I do think that Moravec and Leike got the AI situation most correct, and yeah people were wrong to dismiss Yudkowsky for having short timelines.
This was the thing they got most correct, which is interesting because unfortunately, Yudkowsky got almost everything else incorrect about how superhuman AIs would work, and also got the alignment situation very wrong as well, which is very important to take note of.
LW in general got short timelines and the idea that AI will probably be the biggest deal in history correct, but went wrong in assuming they knew well about how AI would eventually work (remember the times when Eliezer Yudkowsky dismissed neural networks working for capabilities instead of legible logic?) and also got the alignment situation very wrong, due to way overcomplexifying human values and relying on the evopsych frame way too much for human values, combined with not noticing that the differences between humans and evolution that mattered for capabilities also mattered for alignment.
I believe a lot of the issue comes down to incorrectly conflating the logical possibility of misalignment with the probability of misalignement being high enough that we should take serious action, and the interlocutors they talked with often denied the possibility that misalignment could happen at all, but LWers then didn’t realize that reality doesn’t grade on a curve, and though their arguments were better than their interlocutors, that didn’t mean they were right.
Yudkowsky didnt dismiss neural networks iirc. He just said that there were a lot of different approaches to AI and from the Outside View it didnt seem clear which was promising—and plausibly on an Inside View it wasnt very clear that aritificial neural networks were going to work and work so well.
Re:alignment
I dont follow. We dont know who will be proved ultimately right on alignment so im not sure how you can make such strong statements about whether Yudkowsky was right or wrong on this aspect.
We havent really gained that much bits on this question and plausibly will not gain many until later (by which time it might be too late if Yudkowsky is right).
I do agree that Yudkowsky’s statements occasionally feel too confidently and dogmatically pessimistic on the question of Doom. But I would argue that the problem is that we simply dont know well because of irreducible uncertainty—not that Doom is unlikely.
Mostly, I’m annoyed by how much his argumentation around alignment matches the pattern of dismissing various approaches to alignment using similar reasoning to how he dismissed neural networks:
Even if it was correct to dismiss neural networks years ago, it isn’t now, so it’s not a good sign that the arguments rely on this issue:
I am going to argue that we do have quite a lot of bits on alignment, and the basic argument can be summarized like this:
Human values are much less complicated than people thought, and also more influenced by data than people thought 15-20 years ago, and thus much, much easier to specify than people thought 15-20 years ago.
That’s the takeaway I have from current LLMs handling human values, and I basically agree with Linch’s summary of Matthew Barnett’s post on the historical value misspecification argument of what that means in practice for alignment:
It’s not about LLM safety properties, but about what has been revealed about our values.
Another way to say it is that we don’t need to reverse-engineer social instincts for alignment, contra @Steven Byrnes, because we can massively simplify what the social instinct parts of our brain that contribute to alignment are doing in code, because while the mechanisms for how humans get their morality and not be psychopaths are complicated, it doesn’t matter, because we can replicate it’s function with much simpler code and data, and go to a more blank-slate design for AIs:
(A similar trick is one path to solving robotics for AIs, but note this is only one part, it might be that the solution routes through a different mechanism).
Really, I’m not mad about his original ideas, because they might have been correct, and it wasn’t obviously incorrect, I’m just mad that he didn’t realize that he had to update to reality more radically than he had realized, and seems to conflate the bad argument for AI will understand our values, therefore it’s safe, with the better argument that LLMs show it’s easier to specify values without drastically wrong results, and that it’s not a complete solution to alignment, but a big advance on outer alignment in the usual dichotomy.
To my mind an important dimension, perhaps the most important dimensions is how values be evolve under reflection.
It’s quite plausible to me that starting with an AI that has pretty aligned values it will self-reflect into evil. This is certainly not unheard of in the real world (let alone fiction!).
Of course it’s a question about the basin of attraction around helpfulness and harmlessness.
I guess I have only weak priors on what this might look like under reflection, although plausibly friendliness is magic.
It’s quite plausible to me that starting with an AI that has pretty aligned values it will self-reflect into evil.
I disagree, but could be a difference in definition of what “perfectly aligned values” means. Eg if the AI is dumb (for an AGI) and in a rush, sure. If its a superintelligence already, even in a rush, seems unlikely. [edit:] If we have found an SAE feature which seems to light up for good stuff, and down for bad stuff 100% of the time, then we clamp it, then yeah, that could go away on reflection.
Another way to say it is how values evolve in OOD situations.
My general prior, albeit reasonably weak is that the best single way to predict how values evolve is looking at their data sources, as well as what data they received up to now, and the second best way to predict it is looking at what their algorithms are, especially for social situations, and that most of the other factors don’t matter nearly as much.
Yudkowsky got almost everything else incorrect about how superhuman AIs would work,
I think this statement is incredibly overconfident, because literally nobody knows how superhuman AI would work.
And, I think, this is general shape of problem: incredible number of people got incredibly overindexed on how LLMs worked in 2022-2023 and drew conclusions which seem to be plausible, but not as probable as these people think.
The really short summary is human values are less complicated and more dependent on data than people thought, and we can specify our values rather easily without it going drastically wrong:
Entropy and AI Forecasting
Until relatively recently (2018-2019?) I did not seriously entertain the possibility that AGI in our lifetime was possible. This was a mistake, an epistemic error. A rational observer calmly and objectively considering the evidence for AI progress over the prior decades—especially in the light of rapid progress in deep learning—should have come to the reasonable position that AGI within 50 years was a serious possibility (>10%).
AGI plausibly arriving in our lifetime was a reasonable position. Yet this possibility was almost universally ridiculed or ignored or by academics and domain experts. One can find quite funny interview with AI experts on Lesswrong from 15 years ago. The only AI expert agreeing with the Yudkowskian view of AI in our lifetime was Jurgen Schmidthuber. The other dozen AI experts denied it as unknowable or even denied the hypothetical possibility of AGI.
Yudkowsky earns a ton of Bayes points for anticipating the likely arrival of AGI in our lifetime long before the deep learning took off.
**************************
We are currently experiencing a rapid AI takeoff, plausibly culminating in superintelligence by the end of this decade. I know of only two people who anticipated something like what we are seeing far ahead of time; Hans Moravec and
Jan LeikeShane Legg*. Both forecast fairly precise dates decades before it happened—and the reasons why they thought it would happen are basically the reasons it did (i.e. Moravec very early on realized the primacy of compute). Moreover, they didn’t forecast a whole lot of things that didn’t happen (like Kurzweil).Did I make an epistemic error by not believing them earlier? Well for starters I wasn’t really plugged in to the AI scene so I hadn’t heard of them or their views. But suppose I did; should I have beieved them? I’d argue I shouldn’t give their view back then more a little bit of credence.
Entropy is a mysterious physics word for irreducible uncertainty; the uncertainty that remains about the future even after accounting for all the data. In hindsight, we can say that massive GPU training on next-token prediction of all internet text data was (almost**) all you need for AGI. But was this forecasteable?
For every Moravec and
LeikeLegg who turns out to be extraordinairly right in forecasting the future there re dozens that weren’t. Even in 2018 when the first evidence for strong scaling laws on text-data was being published by Baidu I’d argue that an impartial observer should have only updated a moderate amount. Actually even OpenAI itself wasn’t sold on unsupervised learning on textdata until early gpt showed signs of life—they thought (like many other players in the field, e.g. DeepMind) that RL (in diverse environments) was the way to go.To me the takeaway is that explicit forecasting can be useful but it is exactly the blackswan events that are irreducibly uncertain (high entropy) that move history.
*the story is that
LeikeLegg’s timelines have been 2030 for the past two decades.** regular readers will know my beef with the pure scaling hypothesis.
I didn’t know about Jan’s AI timelines. Shane Legg also had some decently early predictions of AI around 2030(~2007 was the earliest I knew about)
shane legg had 2028 median back in 2008, see e.g. https://e-discoveryteam.com/2023/11/17/shane-leggs-vision-agi-is-likely-by-2028-as-soon-as-we-overcome-ais-senior-moments/
That’s probably the one I was thinking of.
Oh no uh-oh I think I might have confused Shane Legg with Jan Leike
Fwiw, in 2016 I would have put something like 20% probability on what became known as ‘the scaling hypothesis’. I still had past-2035 median timelines, though.
What did you mean exactly in 2016 by the scaling hypothesis ?
Having past 2035 timelines and believing in the pure scaling maximalist hypothesis (which fwiw i don’t believe in for reasons i have explained elsewhere) are in direct conflict so id be curious if you could more exactly detail your beliefs back then.
Something like ‘we could have AGI just by scaling up deep learning / deep RL, without any need for major algorithmic breakthroughs’.
I’m not sure this is strictly true, though I agree with the ‘vibe’. I think there were probably a couple of things in play:
I still only had something like 20% on scaling, and I expected much more compute would likely be needed, especially in that scenario, but also more broadly (e.g. maybe something like the median in ‘bioanchors’ − 35 OOMs of pretraining-equivalent compute, if I don’t misremember; though I definitely hadn’t thought very explicitly about how many OOMs of compute at that time) - so I thought it would probably take decades to get to the required amount of compute.
I very likely hadn’t thought hard and long enough to necessarily integrate/make coherent my various beliefs.
Probably at least partly because there seemed to be a lot of social pressure from academic peers against even something like ’20% on scaling’, and even against taking AGI and AGI safety seriously at all. This likely made it harder to ‘viscerally feel’ what some of my beliefs might imply, and especially that it might happen very soon (which also had consequences in delaying when I’d go full-time into working on AI safety; along with thinking I’d have more time to prepare for it, before going all in).
Yeah, I do think that Moravec and Leike got the AI situation most correct, and yeah people were wrong to dismiss Yudkowsky for having short timelines.
This was the thing they got most correct, which is interesting because unfortunately, Yudkowsky got almost everything else incorrect about how superhuman AIs would work, and also got the alignment situation very wrong as well, which is very important to take note of.
LW in general got short timelines and the idea that AI will probably be the biggest deal in history correct, but went wrong in assuming they knew well about how AI would eventually work (remember the times when Eliezer Yudkowsky dismissed neural networks working for capabilities instead of legible logic?) and also got the alignment situation very wrong, due to way overcomplexifying human values and relying on the evopsych frame way too much for human values, combined with not noticing that the differences between humans and evolution that mattered for capabilities also mattered for alignment.
I believe a lot of the issue comes down to incorrectly conflating the logical possibility of misalignment with the probability of misalignement being high enough that we should take serious action, and the interlocutors they talked with often denied the possibility that misalignment could happen at all, but LWers then didn’t realize that reality doesn’t grade on a curve, and though their arguments were better than their interlocutors, that didn’t mean they were right.
Yudkowsky didnt dismiss neural networks iirc. He just said that there were a lot of different approaches to AI and from the Outside View it didnt seem clear which was promising—and plausibly on an Inside View it wasnt very clear that aritificial neural networks were going to work and work so well.
Re:alignment I dont follow. We dont know who will be proved ultimately right on alignment so im not sure how you can make such strong statements about whether Yudkowsky was right or wrong on this aspect.
We havent really gained that much bits on this question and plausibly will not gain many until later (by which time it might be too late if Yudkowsky is right).
I do agree that Yudkowsky’s statements occasionally feel too confidently and dogmatically pessimistic on the question of Doom. But I would argue that the problem is that we simply dont know well because of irreducible uncertainty—not that Doom is unlikely.
Mostly, I’m annoyed by how much his argumentation around alignment matches the pattern of dismissing various approaches to alignment using similar reasoning to how he dismissed neural networks:
Even if it was correct to dismiss neural networks years ago, it isn’t now, so it’s not a good sign that the arguments rely on this issue:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#HpPcxG9bPDFTB4i6a
I am going to argue that we do have quite a lot of bits on alignment, and the basic argument can be summarized like this:
Human values are much less complicated than people thought, and also more influenced by data than people thought 15-20 years ago, and thus much, much easier to specify than people thought 15-20 years ago.
That’s the takeaway I have from current LLMs handling human values, and I basically agree with Linch’s summary of Matthew Barnett’s post on the historical value misspecification argument of what that means in practice for alignment:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument#N9ManBfJ7ahhnqmu7
It’s not about LLM safety properties, but about what has been revealed about our values.
Another way to say it is that we don’t need to reverse-engineer social instincts for alignment, contra @Steven Byrnes, because we can massively simplify what the social instinct parts of our brain that contribute to alignment are doing in code, because while the mechanisms for how humans get their morality and not be psychopaths are complicated, it doesn’t matter, because we can replicate it’s function with much simpler code and data, and go to a more blank-slate design for AIs:
https://www.lesswrong.com/posts/PTkd8nazvH9HQpwP8/building-brain-inspired-agi-is-infinitely-easier-than#If_some_circuit_in_the_brain_is_doing_something_useful__then_it_s_humanly_feasible_to_understand_what_that_thing_is_and_why_it_s_useful__and_to_write_our_own_CPU_code_that_does_the_same_useful_thing_
(A similar trick is one path to solving robotics for AIs, but note this is only one part, it might be that the solution routes through a different mechanism).
Really, I’m not mad about his original ideas, because they might have been correct, and it wasn’t obviously incorrect, I’m just mad that he didn’t realize that he had to update to reality more radically than he had realized, and seems to conflate the bad argument for AI will understand our values, therefore it’s safe, with the better argument that LLMs show it’s easier to specify values without drastically wrong results, and that it’s not a complete solution to alignment, but a big advance on outer alignment in the usual dichotomy.
It’s a plausible argument imho. Time will tell.
To my mind an important dimension, perhaps the most important dimensions is how values be evolve under reflection.
It’s quite plausible to me that starting with an AI that has pretty aligned values it will self-reflect into evil. This is certainly not unheard of in the real world (let alone fiction!). Of course it’s a question about the basin of attraction around helpfulness and harmlessness. I guess I have only weak priors on what this might look like under reflection, although plausibly friendliness is magic.
I disagree, but could be a difference in definition of what “perfectly aligned values” means. Eg if the AI is dumb (for an AGI) and in a rush, sure. If its a superintelligence already, even in a rush, seems unlikely. [edit:] If we have found an SAE feature which seems to light up for good stuff, and down for bad stuff 100% of the time, then we clamp it, then yeah, that could go away on reflection.
Another way to say it is how values evolve in OOD situations.
My general prior, albeit reasonably weak is that the best single way to predict how values evolve is looking at their data sources, as well as what data they received up to now, and the second best way to predict it is looking at what their algorithms are, especially for social situations, and that most of the other factors don’t matter nearly as much.
I think this statement is incredibly overconfident, because literally nobody knows how superhuman AI would work.
And, I think, this is general shape of problem: incredible number of people got incredibly overindexed on how LLMs worked in 2022-2023 and drew conclusions which seem to be plausible, but not as probable as these people think.
Okay, I talked more on what conclusions we can draw from LLMs that actually generalize to superhuman AI here, so go check that out:
https://www.lesswrong.com/posts/tDkYdyJSqe3DddtK4/alexander-gietelink-oldenziel-s-shortform#mPaBbsfpwgdvoK2Z2
The really short summary is human values are less complicated and more dependent on data than people thought, and we can specify our values rather easily without it going drastically wrong:
This is not a property of LLMs, but of us.
is that supposed to be a link?
I rewrote the comment to put the link immediately below the first sentence.
The link is at the very bottom of the comment.