sort of person who gets taken in by Hanson’s arguments in 2008 and gets caught flatfooted by AlphaGo and GPT-3 and AlphaFold 2
I find this kind of bluster pretty frustrating and condescending. I also feel like the implication is just wrong—if Eliezer and I disagree, I’d guess it’s because he’s worse at predicting ML progress. To me GPT-3 feels much (much) closer to my mainline than to Eliezer’s, and AlphaGo is very unsurprising. But it’s hard to say who was actually “caught flatfooted” unless we are willing to state some of these predictions in advance.
I got pulled into this interaction because I wanted to get Eliezer to make some real predictions, on the record, so that we could have a better version of this discussion in 5 years rather than continuing to both say “yeah, in hindsight this looks like evidence for my view.” I apologize if my tone (both in that discussion and in this comment) is a bit frustrated.
It currently feels from the inside like I’m holding the epistemic high ground on this point, though I expect Eliezer disagrees strongly:
I’m willing to bet on anything Eliezer wants, or to propose my own questions if Eliezer is willing in principle to make forecasts. I expect to outperform Eliezer on these bets and am happy to state in advance that I’d update in his direction if his predictions turned out to be as good as mine. It’s possible that we don’t have disagreements, but I doubt it. (See my other comment.)
I’m not talking this much smack based on “track records” imagined in hindsight. I think that if you want to do this then you should have been making predictions in the past, and you definitely should be willing to make predictions about the future. (I suspect you’ll often find that other people don’t disagree with the predictions that turned out to be reasonable, even if from your perspective it was all part of one coherent story.)
I wish to acknowledge this frustration, and state generally that I think Paul Christiano occupies a distinct and more clueful class than a lot of, like, early EAs who mm-hmmmed along with Robin Hanson on AI—I wouldn’t put, eg, Dario Amodei in that class either, though we disagree about other things.
But again, Paul, it’s not enough to say that you weren’t surprised by GPT-2/3 in retrospect, it kinda is important to say it in advance, ideally where other people can see? Dario picks up some credit for GPT-2/3 because he clearly called it in advance. You don’t need to find exact disagreements with me to start going on the record as a forecaster, if you think the course of the future is generally narrower than my own guesses—if you think that trends stay on course, where I shrug and say that they might stay on course or break. (Except that of course in hindsight somebody will always be able to draw a straight-line graph, once they know which graph to draw, so my statement “it might stay on trend or maybe break” applies only to graphs extrapolating into what is currently the future.)
Suppose your view is “crazy stuff happens all the time” and my view is “crazy stuff happens rarely.” (Of course “crazy” is my word, to you it’s just normal stuff.) Then what am I supposed to do, in your game?
More broadly: if you aren’t making bold predictions about the future, why do you think that other people will? (My predictions all feel boring to me.) And if you do have bold predictions, can we talk about some of them instead?
It seems to me like I want you to say “well I think 20% chance something crazy happens here” and I say “nah, that’s more like 5%” and then we batch up 5 of those and when none of them happen I get a bayes point.
I could just give my forecast. But then if I observe that 2⁄20 of them happen, how exactly does that help me in figuring out whether I should be paying more attention to your views (or help you snap out of it)?
I can list some particular past bets and future forecasts, but it’s really unclear what to do with them without quantitative numbers or a point of comparison.
Like you I’ve predicted that AI is undervalued and will grow in importance, although I think I made a much more specific prediction that investment in AI would go up a lot in the short term. This made me some money, but like you I just don’t care much about money and it’s not a game worth playing. I bet quite explicitly on deep learning by pivoting my career into practical ML and then spending years of my life working on it, despite loving theory and thinking it’s extremely important. We can debate whether the bet is good, but it was certainly a bet and by my lights it looks very reasonable in retrospect.
Over the next 10 years I think powerful ML systems will be trained mostly by imitating human behavior over short horizons, and then fine-tuned using much smaller amounts of long-horizon feedback. This has long been my prediction, and it’s why I’ve been interested in language modeling, and has informed some of my research. I think that’s still basically valid and will hold up in the future. I predict that people will explicitly collect much larger datasets of human behavior as the economic stakes rise. This is in contrast to e.g. theorem-proving working well, although I think that theorem-proving may end up being an important bellwether because it allows you to assess the capabilities of large models without multi-billion-dollar investments in training infrastructure.
I expect to see truly massive training runs in the not that distant future. I think the current rate of scaling won’t be sustained, but that over the next 10-20 years scaling will get us into human-level behavior for “short-horizon” tasks which may or may not suffice for transformative AI. I expect that to happen at model sizes within 2 orders of magnitude of the human brain on one side or the other, i.e. 1e12 to 1e16 parameters.
I could list a lot more, but I don’t think any of it seems bold and it’s not clear what the game is. It’s clearly bold by comparison to market forecasts or broader elite consensus, but so what? I understand much better how to compare one predictor to another. I mostly don’t know what it means to evaluate a predictor on an absolute scale.
I predict that people will explicitly collect much larger datasets of human behavior as the economic stakes rise. This is in contrast to e.g. theorem-proving working well, although I think that theorem-proving may end up being an important bellwether because it allows you to assess the capabilities of large models without multi-billion-dollar investments in training infrastructure.
Well, it sounds like I might be more bullish than you on theorem-proving, possibly. Not on it being useful or profitable, but in terms of underlying technology making progress on non-profitable amazing demo feats, maybe I’m more bullish on theorem-proving than you are? Is there anything you think it shouldn’t be able to do in the next 5 years?
I’m going to make predictions by drawing straight-ish lines through metrics like the ones in the gpt-f paper. Big unknowns are then (i) how many orders of magnitude of “low-hanging fruit” are there before theorem-proving even catches up to the rest of NLP? (ii) how hard their benchmarks are compared to other tasks we care about. On (i) my guess is maybe 2? On (ii) my guess is “they are pretty easy” / “humans are pretty bad at these tasks,” but it’s somewhat harder to quantify. If you think your methodology is different from that then we will probably end up disagreeing.
Looking towards more ambitious benchmarks, I think that the IMO grand challenge is currently significantly more than 5 years away. In 5 year’s time my median guess (without almost any thinking about it) is that automated solvers can do 10% of non-geometry, non-3-variable-inequality IMO shortlist problems.
So yeah, I’m happy to play ball in this area, and I expect my predictions to be somewhat more right than yours after the dust settles. Is there some way of measuring such that you are willing to state any prediction?
(I still feel like I’m basically looking for any predictions at all beyond sometimes saying “my model wouldn’t be surprised by <vague thing X>”, whereas I’m pretty constantly throwing out made-up guesses which I’m happy to refine with more effort. Obviously I’m going to look worse in retrospect than you if we keep up this way though, that particular asymmetry is a lot of the reason people mostly don’t play ball. ETA: that’s a bit unfair, the romantic chatbot vs self-driving car prediction is one where we’ve both given off-the-cuff takes.)
I have a sense that there’s a lot of latent potential for theorem-proving to advance if more energy gets thrown at it, in part because current algorithms seem a bit weird to me—that we are waiting on the equivalent of neural MCTS as an enabler for AlphaGo, not just a bigger investment, though of course the key trick could already have been published in any of a thousand papers I haven’t read. I feel like I “would not be surprised at all” if we get a bunch of shocking headlines in 2023 about theorem-proving problems falling, after which the IMO challenge falls in 2024 - though of course, as events like this lie in the Future, they are very hard to predict.
Can you say more about why or whether you would, in this case, say that this was an un-Paulian set of events? As I have trouble manipulating my Paul model, it does not exclude Paul saying, “Ah, yes, well, they were using 700M models in that paper, so if you jump to 70B, of course the IMO grand challenge could fall; there wasn’t a lot of money there.” Though I haven’t even glanced at any metrics here, let alone metrics that the IMO grand challenge could be plotted on, so if smooth metrics rule out IMO in 5yrs, I am more interested yet—it legit decrements my belief, but not nearly as much as I imagine it would decrement yours.
(Edit: Also, on the meta-level, is this, like, anywhere at all near the sort of thing you were hoping to hear from me? Am I now being a better epistemic citizen, if maybe not a good one by your lights?)
Yes, IMO challenge falling in 2024 is surprising to me at something like the 1% level or maybe even more extreme (though could also go down if I thought about it a lot or if commenters brought up relevant considerations, e.g. I’d look at IMO problems and gold medal cutoffs and think about what tasks ought to be easy or hard; I’m also happy to make more concrete per-question predictions). I do think that there could be huge amounts of progress from picking the low hanging fruit and scaling up spending by a few orders of magnitude, but I still don’t expect it to get you that far.
I don’t think this is an easy prediction to extract from a trendline, in significant part because you can’t extrapolate trendlines this early that far out. So this is stress-testing different parts of my model, which is fine by me.
At the meta-level, this is the kind of thing I’m looking for, though I’d prefer have some kind of quantitative measure of how not-surprised you are. If you are only saying 2% then we probably want to talk about things less far in your tails than the IMO challenge.
Okay, then we’ve got at least one Eliezerverse item, because I’ve said below that I think I’m at least 16% for IMO theorem-proving by end of 2025. The drastic difference here causes me to feel nervous, and my second-order estimate has probably shifted some in your direction just from hearing you put 1% on 2024, but that’s irrelevant because it’s first-order estimates we should be comparing here.
So we’ve got huge GDP increases for before-End-days signs of Paulverse and quick IMO proving for before-End-days signs of Eliezerverse? Pretty bare portfolio but it’s at least a start in both directions. If we say 5% instead of 1%, how much further would you extend the time limit out beyond 2024?
I also don’t know at all what part of your model forbids theorem-proving to fall in a shocking headline followed by another headline a year later—it doesn’t sound like it’s from looking at a graph—and I think that explaining reasons behind our predictions in advance, not just making quantitative predictions in advance, will help others a lot here.
EDIT: Though the formal IMO challenge has a barnacle about the AI being open-sourced, which is a separate sociological prediction I’m not taking on.
I think IMO gold medal could be well before massive economic impact, I’m just surprised if it happens in the next 3 years. After a bit more thinking (but not actually looking at IMO problems or the state of theorem proving) I probably want to bump that up a bit, maybe 2%, it’s hard reasoning about the tails.
I’d say <4% on end of 2025.
I think this is the flipside of me having an intuition where I say things like “AlphaGo and GPT-3 aren’t that surprising”—I have a sense for what things are and aren’t surprising, and not many things happen that are so surprising.
If I’m at 4% and you are 12% and we had 8 such bets, then I can get a factor of 2 if they all come out my way, and you get a factor of ~1.5 if one of them comes out your way.
I might think more about this and get a more coherent probability distribution, but unless I say something else by end of 2021 you can consider 4% on end of 2025 this my prediction.
Maybe another way of phrasing this—how much warning do you expect to get, how far out does your Nope Vision extend? Do you expect to be able to say “We’re now in the ‘for all I know the IMO challenge could be won in 4 years’ regime” more than 4 years before it happens, in general? Would it be fair to ask you again at the end of 2022 and every year thereafter if we’ve entered the ‘for all I know, within 4 years’ regime?
Added: This question fits into a larger concern I have about AI soberskeptics in general (not you, the soberskeptics would not consider you one of their own) where they saunter around saying “X will not occur in the next 5 / 10 / 20 years” and they’re often right for the next couple of years, because there’s only one year where X shows up for any particular definition of that, and most years are not that year; but also they’re saying exactly the same thing up until 2 years before X shows up, if there’s any early warning on X at all. It seems to me that 2 years is about as far as Nope Vision extends in real life, for any case that isn’t completely slam-dunk; when I called upon those gathered AI luminaries to say the least impressive thing that definitely couldn’t be done in 2 years, and they all fell silent, and then a single one of them named Winograd schemas, they were right that Winograd schemas at the stated level didn’t fall within 2 years, but very barely so (they fell the year after). So part of what I’m flailingly asking here, is whether you think you have reliable and sensitive Nope Vision that extends out beyond 2 years, in general, such that you can go on saying “Not for 4 years” up until we are actually within 6 years of the thing, and then, you think, your Nope Vision will actually flash an alert and you will change your tune, before you are actually within 4 years of the thing. Or maybe you think you’ve got Nope Vision extending out 6 years? 10 years? Or maybe theorem-proving is just a special case and usually your Nope Vision would be limited to 2 years or 3 years?
This is all an extremely Yudkowskian frame on things, of course, so feel free to reframe.
I think I’ll get less confident as our accomplishments get closer to the IMO grand challenge. Or maybe I’ll get much more confident if we scale up from $1M → $1B and pick the low hanging fruit without getting fairly close, since at that point further progress gets a lot easier to predict
There’s not really a constant time horizon for my pessimism, it depends on how long and robust a trend you are extrapolating from. 4 years feels like a relatively short horizon, because theorem-proving has not had much investment so compute can be scaled up several orders of magnitude, and there is likely lots of low-hanging fruit to pick, and we just don’t have much to extrapolate from (compared to more mature technologies, or how I expect AI will be shortly before the end of days), and for similar reasons there aren’t really any benchmarks to extrapolate.
(Also note that it matters a lot whether you know what problems labs will try to take a stab at. For the purpose of all of these forecasts, I am trying insofar as possible to set aside all knowledge about what labs are planning to do though that’s obviously not incentive-compatible and there’s no particular reason you should trust me to do that.)
I feel like I “would not be surprised at all” if we get a bunch of shocking headlines in 2023 about theorem-proving problems falling, after which the IMO challenge falls in 2024
Possibly helpful: Metaculus currently puts the chances of the IMO grand challenge falling by 2025 at about 8%. Their median is 2039.
I think this would make a great bet, as it would definitely show that your model can strongly outperform a lot of people (and potentially Paul too). And the operationalization for the bet is already there—so little work will be needed to do that part.
Ha! Okay then. My probability is at least 16%, though I’d have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more. Paul?
EDIT: I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists. I’ll stand by a >16% probability of the technical capability existing by end of 2025, as reported on eg solving a non-trained/heldout dataset of past IMO problems, conditional on such a dataset being available; I frame no separate sociological prediction about whether somebody is willing to open-source the AI model that does it.
I don’t care about whether the AI is open-sourced (I don’t expect anyone to publish the weights even if they describe their method) and I’m not that worried about our ability to arbitrate overfitting.
Ajeya suggested that I clarify: I’m significantly more impressed by an AI getting a gold medal than getting a bronze, and my 4% probability is for getting a gold in particular (as described in the IMO grand challenge). There are some categories of problems that can be solved using easy automation (I’d guess about 5-10% could be done with no deep learning and modest effort). Together with modest progress in deep learning based methods, and a somewhat serious effort, I wouldn’t be surprised by people getting up to 20-40% of problems. The bronze cutoff is usually 3⁄6 problems, and the gold cutoff is usually 5⁄6 (assuming the AI doesn’t get partial credit). The difficulty of problems also increases very rapidly for humans—there are often 3 problems that a human can do more-or-less mechanically.
I could tighten any of these estimates by looking at the distribution more carefully rather than going off of my recollections from 2008, and if this was going to be one of a handful of things we’d bet about I’d probably spend a few hours doing that and some other basic digging.
I looked at a few recent IMOs to get better calibrated. I think the main update is that I significantly underestimated how many years you can get a gold with only 4⁄6 problems.
For example I don’t have the same “this is impossible” reaction about IMO 2012 or IMO 2015 as about most years. That said, I feel like they do have to get reasonably lucky with both IMO content and someone has to make a serious and mostly-successful effort, but I’m at least a bit scared by that. There’s also quite often a geo problem as 3 or 6.
Might be good to make some side bets:
Conditioned on winning I think it’s only maybe 20% probability to get all 6 problems (whereas I think you might have a higher probability on jumping right past human level, or at least have 50% on 6 vs 5?).
Conditioned on a model getting 3+ problems I feel like we have a pretty good guess about what algorithm will be SOTA on this problem (e.g. I’d give 50% to a pretty narrow class of algorithms with some uncertain bells and whistles, with no inside knowledge). Whereas I’d guess you have a much broader distribution.
But more useful to get other categories of bets. (Maybe in programming, investment in AI, economic impact from robotics, economic impact from chatbots, translation?)
Going through previous ten IMOs, and imagining a very impressive automated theorem prover, I think
2020 - unlikely, need 5⁄6 and probably can’t get problems 3 or 6. Also good chance to mess up at 4 or 5
2019 - tough but possible, 3 seems hard but even that is not unimaginable, 5 might be hard but might be straightforward, and it can afford to get one wrong
2018 - tough but possible, 3 is easier for machine than human but probably still hard, 5 may be hard, can afford to miss one
2017 - tough but possible, 3 looks out of reach, 6 looks hard but not sure about that, 5 looks maybe hard, 1 is probably easy. But it can miss 2, which could happen.
2016 - probably not possible, 3 and 6 again look hard, and good chance to fail on 2 and 5, only allowed to miss 1
2015 - seems possible, 3 might be hard but like 50-50 it’s simple for machine, 6 is probably hard, but you can miss 2
2014 - probably not possible, can only miss 1, probably miss one of 2 or 5 and 6
2013 - probably not possible, 6 seems hard, 2 seems very hard, can only miss 1
2012 - tough but possible, 6 and 3 look hard but you can miss 2
2011 - seems possible, allowed to miss two and both 3 and 6 look brute-forceable
Overall this was much easier than I expected. 4⁄10 seem unlikely, 4⁄10 seem tough but possible, 2⁄10 I can imagine a machine doing it. There are a lot of problems that look really hard, but there are a fair number of tests where you can just skip those.
That said, even to get the possible ones you do need to be surprisingly impressive, and that’s getting cut down by like 25-50% for a solvable test. That said, they get to keep trying (assuming they get promising results in early years) and eventually they will hit one of the easier years.
It also looks fairly likely to me that if one of DeepMind or OpenAI tries seriously they will be able to get an HM with a quite reasonable chance at bronze, and this is maybe enough of a PR coup to motivate work, and then it’s more likely there will be a large effort subsequently to finish the job or to opportunistically take advantage of an easy test.
Overall I’m feeling bad about my 4%, I deserve to lose some points regardless but might think about what my real probability is after looking at tests (though I was also probably moved by other folks in EA systematically giving higher estimates than I did).
Based on the other thread I now want to revise this prediction, both because 4% was too low and “IMO gold” has a lot of noise in it based on test difficulty.
I’d put 4% on “For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem” where “hardest problem” = “usually problem #6, but use problem #3 instead if either: (i) problem 6 is geo or (ii) problem 3 is combinatorics and problem 6 is algebra.” (Would prefer just pick the hardest problem after seeing the test but seems better to commit to a procedure.)
Maybe I’ll go 8% on “gets gold” instead of “solves hardest problem.”
Would be good to get your updated view on this so that we can treat it as staked out predictions.
(News: OpenAI has built a theorem-prover that solved many AMC12 and AIME competition problems, and 2 IMO problems, and they say they hope this leads to work that wins the IMO Grand Challenge.)
I’ll stand by a >16% probability of the technical capability existing by end of 2025, as reported on eg solving a non-trained/heldout dataset of past IMO problems, conditional on such a dataset being available
It feels like this bet would look a lot better if it were about something that you predict at well over 50% (with people in Paul’s camp still maintaining less than 50%). So, we could perhaps modify the terms such that the bot would only need to surpass a certain rank or percentile-equivalent in the competition (and not necessarily receive the equivalent of a Gold medal).
The relevant question is which rank/percentile you think is likely to be attained by 2025 under your model but you predict would be implausible under Paul’s model. This may be a daunting task, but one way to get started is to put a probability distribution over what you think the state-of-the-art will look like by 2025, and then compare to Paul’s.
I expect it to be hella difficult to pick anything where I’m at 75% that it happens in the next 5 years and Paul is at 25%. Heck, it’s not easy to find things where I’m at over 75% that aren’t just obvious slam dunks; the Future isn’t that easy to predict. Let’s get up to a nice crawl first, and then maybe a small portfolio of crawlings, before we start trying to make single runs that pierce the sound barrier.
I frame no prediction about whether Paul is under 16%. That’s a separate matter. I think a little progress is made toward eventual epistemic virtue if you hand me a Metaculus forecast and I’m like “lol wut” and double their probability, even if it turns out that Paul agrees with me about it.
It feels like this bet would look a lot better if it were about something that you predict at well over 50% (with people in Paul’s camp still maintaining less than 50%).
My model of Eliezer may be wrong, but I’d guess that this isn’t a domain where he has many over-50% predictions of novel events at all? See also ‘I don’t necessarily expect self-driving cars before the apocalypse’.
My Eliezer-model has a more flat prior over what might happen, which therefore includes stuff like ‘maybe we’ll make insane progress on theorem-proving (or whatever) out of the blue’. Again, I may be wrong, but my intuition is that you’re Paul-omorphizing Eliezer when you assume that >16% probability of huge progress in X by year Y implies >50% probability of smaller-but-meaningful progress in X by year Y.
If this task is bad for operationalization reasons, there are other theorem proving benchmarks. Unfortunately it looks like there aren’t a lot of people that are currently trying to improve on the known benchmarks, as far as I’m aware.
The code generation benchmarks are slightly more active. I’m personally partial to Hendrycks et al.’s APPS benchmark, which includes problems that “range in difficulty from introductory to collegiate competition level and measure coding and problem-solving ability.” (Github link).
I think Metaculus is closer to Eliezer here: conditioned on this problem being resolved it seems unlikely for the AI to be either open-sourced or easily reproducible.
To me GPT-3 feels much (much) closer to my mainline than to Eliezer’s
To add to this sentiment, I’ll post the graph from my notebook on language model progress. I refer to the Penn Treebank task a lot when making this point because it seems to have a lot of good data, but you can also look at the other tasks and see basically the same thing.
The last dip in the chart is from GPT-3. It looks like GPT-3 was indeed a discontinuity in progress but not a very shocking one. It roughly would have taken about one or two more years at ordinary progress to get to that point anyway—which I just don’t see as being all that impressive.
I sorta feel like the main reason why lots of people found GPT-3 so impressive was because OpenAI was just good at marketing the results [ETA: sorry, I take back the use of the word “marketing”]. Maybe OpenAI saw an opportunity to dump a lot of compute into language models and have a two year discontinuity ahead of everyone else, and showcase their work. And that strategy seemed to really worked well for them.
I admit this is an uncharitable explanation, but is there a better story to tell about why GPT-3 captured so much attention?
The impact of GPT-3 had nothing whatsoever to do with its perplexity on Penn Treebank. I think this is a good example of why focusing on perplexity and ‘straight lines on graph go brr’ is so terrible, such cargo cult mystical thinking, and crippling. There’s something astonishing to see someone resort to explaining away GPT-3′s impact as ‘OpenAI was just good at marketing the results’. Said marketing consisted of: ‘dropping a paper on Arxiv’. Not even tweeting it! They didn’t even tweet the paper! (Forget an OA blog post, accompanying NYT/TR articles, tweets by everyone at OA, a fancy interactive interface—none of that.) And most of the initial reaction was “GPT-3: A Disappointing Paper”-style. If this is marketing genius, then it is truly 40-d chess, is all I can say.
The impact of GPT-3 was in establishing that trendlines did continue in a way that shocked pretty much everyone who’d written off ‘naive’ scaling strategies. Progress is made out of stacked sigmoids: if the next sigmoid doesn’t show up, progress doesn’t happen. Trends happen, until they stop. Trendlines are not caused by the laws of physics. You can dismiss AlphaGo by saying “oh, that just continues the trendline in ELO I just drew based on MCTS bots”, but the fact remains that MCTS progress had stagnated, and here we are in 2021, and pure MCTS approaches do not approach human champions, much less beat them. (This is also true of SVMs. Notice SVMs solving ImageNet because the trendlines continued? No, of course you did not. It drives me bonkers to see AI Impacts etc make arguments like “deep learning is unimportant because look, ImageNet follows a trendline”. Sheer numerology.) Appealing to trendlines is roughly as informative as “calories in calories out”; ‘the trend continued because the trend continued’. A new sigmoid being discovered is extremely important.
GPT-3 further showed completely unpredicted emergence of capabilities across downstream tasks which are not measured in PTB perplexity. There is nothing obvious about a PTB BPC of 0.80 that causes it to be useful where 0.90 is largely useless and 0.95 is a laughable toy. (OAers may have had faith in scaling, but they could not have told you in 2015 that interesting behavior would start at 𝒪(1b), and it’d get really cool at 𝒪(100b).) That’s why it’s such a useless metric. There’s only one thing that a PTB perplexity can tell you, under the pretraining paradigm: when you have reached human AGI level. (Which is useless for obvious reasons: much like saying that “if you hear the revolver click, the bullet wasn’t in that chamber and it was safe”. Surely true, but a bit late.) It tells you nothing about intermediate levels. I’m reminded of the Steven Kaas line:
Why idly theorize when you can JUST CHECK and find out the ACTUAL ANSWER to a superficially similar-sounding question SCIENTIFICALLY?
Using PBT, and talking only about perplexity, is a precise answer to the wrong question. (This is a much better argument when it comes to AlphaGo/ELO, because at least there, ‘ELO’ is in fact the ultimate objective, and not a proxy pretext. But perplexity is of no interest to anyone except an information theorist. Unfortunately, we lack any ‘take-over-the-world-ELO’ we can benchmark models on and extrapolate there. If we did and there was a smooth curve, I would indeed agree that we should adopt that as the baseline. But the closest things we have to downstream tasks are all wildly jumpy—even superimposing scores of downstream tasks barely gives you a recognizable smooth curve, and certainly nothing remotely as smooth as the perplexity curve. My belief is that this is because the overall perplexity curve comes from hundreds or thousands of stacked sigmoids and plateau/breakthroughs averaging out in terms of prediction improvements.) It sure would be convenient if the only number that mattered in AI or its real-world impact or risk was also the single easiest one to measure!
I emphasized this poverty of extrapolation in my scaling hypothesis writeup already, but permit me to vent a little more here:
“So, you’re forecasting AI progress using PTB perplexity/BPC. Cool, good work, nice notebook, surely this must be useful for forecasting on substantive AI safety/capability questions of interest to us. I see it’s a pretty straight line on a graph. OK, can you tell me at what BPC a large language model could do stuff like hack computers and escape onto the Internet?”
“No. I can tell you what happens if I draw the line out x units, though.”
“Perhaps that’s an unfairly specific question to ask, as important as it is. OK, can you tell me when we can expect to see well-known benchmarks like Winograd schemas be solved?”
“No. I can draw you a line on PTB to estimate when PTB is solved, though, if you give me a second and define a bound for ‘PTB is solved’.”
“Hm. Can you at least tell me when we can expect to see meta-learning emerge, with good few-shot learning—does the graph predict 0.1b, 1b, 10b, 100b, or what?”
“No idea.”
“Do you know what capabilities will be next to emerge? We got pretty good programming performance in Copilot at 𝒪(100b), what’s next?”
“I don’t know.”
“Can you qualitatively describe what we’d get at 1t, or 10t?”
“No, but I can draw the line in perplexity. It gets pretty low.”
“How about the existence of any increasing returns to scale in downstream tasks? Does it tell us anything about spikes in capabilities (such as we observe in many places, such as text style transfer and inner monologue in LaMDA at 100b; most recently BIG-bench)? Such as whether there are any more spikes past 𝒪(100b), whether we’ll see holdouts like causality suddenly fall at 𝒪(1000b), anything like that?”
“No.”
“How about RL: what sort of world modeling can we get by plugging them into DRL agents?”
“I don’t know.”
“Fine, let’s leave it at tool AIs doing text in text out. Can you tell me how much economic value will be driven by dropping another 0.01 BPC?”
“No. I can tell you how much it’d cost in GPU-time, though, by the awesome power of drawing lines!”
“OK, how about that: how low does it need to go to support a multi-billion dollar company running something like the OA API, to defray the next 0.01 drop and pay for the GPU-time to get more drops?”
“No idea.”
“How do you know BPC is the right metric to use?”
“Oh, we have lots of theories about it, but I’ll level with you: we always have theories for everything, but really, we chose BPC post hoc out of a few thousand metrics littering Arxiv like BLEU, ROUGE, SSA etc after seeing that it worked and better BPC = better models.”
“Can you write down your predictions about any of this?”
“Absolutely not.”
“Can anyone else?”
“Sure. But they’re all terribly busy.”
“Did you write down your predictions before now, then?”
“Oh gosh no, I wasn’t around then.”
“Did… someone… else… write down their predictions before?”
“Not that I’m aware of.”
“Ugh. Fine, what can you tell me about AI safety/risk/capabilities/economics/societal-disruption with these analyses of absolute loss?”
“Lines go straight?”
Seems to me that instead of gradualist narratives it would be preferable to say with Socrates that we are wise about scaling only in that we know we know little & about the least.
And to say it also explicitly, I think this is part of why I have trouble betting with Paul. I have a lot of ? marks on the questions that the Gwern voice is asking above, regarding them as potentially important breaks from trend that just get dumped into my generalized inbox one day. If a gradualist thinks that there ought to be a smooth graph of perplexity with respect to computing power spent, in the future, that’s something I don’t care very much about except insofar as it relates in any known way whatsoever to questions like those the Gwern voice is asking. What does it even mean to be a gradualist about any of the important questions like those of the Gwern-voice, when they don’t relate in known ways to the trend lines that are smooth? Isn’t this sort of a shell game where our surface capabilities do weird jumpy things, we can point to some trend lines that were nonetheless smooth, and then the shells are swapped and we’re told to expect gradualist AGI surface stuff? This is part of the idea that I’m referring to when I say that, even as the world ends, maybe there’ll be a bunch of smooth trendlines underneath it that somebody could look back and point out. (Which you could in fact have used to predict all the key jumpy surface thresholds, if you’d watched it all happen on a few other planets and had any idea of where jumpy surface events were located on the smooth trendlines—but we haven’t watched it happen on other planets so the trends don’t tell us much we want to know.)
It feels to me like you mostly don’t have views about the actual impact of AI as measured by jobs that it does or the $s people pay for them, or performance on any benchmarks that we are currently measuring, while I’m saying I’m totally happy to use gradualist metrics to predict any of those things. If you want to say “what does it mean to be a gradualist” I can just give you predictions on them.
To you this seems reasonable, because e.g. $ and benchmarks are not the right way to measure the kinds of impacts we care about. That’s fine, you can propose something other than $ or measurable benchmarks. If you can’t propose anything, I’m skeptical.
My basic guess is that you probably can’t effectively predict $ or benchmarks or anything else quantitative. If you actually agreed with me on all that stuff, then I might suspect that you are equivocating between a gradualist-like view that you use for making predictions about everything near term and then switching to a more bizarre perspective when talking about the future. But fortunately I think this is more straightforward, because you are basically being honest when you say that you don’t understand how the gradualist perspective makes predictions.
I kind of want to see you fight this out with Gwern (not least for social reasons, so that people would perhaps see that it wasn’t just me, if it wasn’t just me).
But it seems to me that the very obvious GPT-5 continuation of Gwern would say, “Gradualists can predict meaningless benchmarks, but they can’t predict the jumpy surface phenomena we see in real life.” We want to know when humans land on the moon, not whether their brain sizes continued on a smooth trend extrapolated over the last million years.
I think there’s a very real sense in which, yes, what we’re interested in are milestones, and often milestones that aren’t easy to define even after the fact. GPT-2 was shocking, and then GPT-3 carried that shock further in that direction, but how do you talk with that about somebody who thinks that perplexity loss is smooth? I can handwave statements like “GPT-3 started to be useful without retraining via just prompt engineering” but qualitative statements like those aren’t good for betting, and it’s much much harder to come up with the right milestone like that in advance, instead of looking back in your rearview mirror afterwards.
But you say—I think? - that you were less shocked by this sort of thing than I am. So, I mean, can you prophesy to us about milestones and headlines in the next five years? I think I kept thinking this during our dialogue, but never saying it, because it seemed like such an unfair demand to make! But it’s also part of the whole point that AGI and superintelligence and the world ending are all qualitative milestones like that. Whereas such trend points as Moravec was readily able to forecast correctly—like 10 teraops / plausibly-human-equivalent-computation being available in a $10 million supercomputer around 2010 - are really entirely unanchored from AGI, at least relative to our current knowledge about AGI. (They would be anchored if we’d seen other planets go through this, but we haven’t.)
But it seems to me that the very obvious GPT-5 continuation of Gwern would say, “Gradualists can predict meaningless benchmarks, but they can’t predict the jumpy surface phenomena we see in real life.”
Don’t you think you’re making a falsifiable prediction here?
Name something that you consider part of the “jumpy surface phenomena” that will show up substantially before the world ends (that you think Paul doesn’t expect). Predict a discontinuity. Operationalize everything and then propose the bet.
What does it even mean to be a gradualist about any of the important questions like those of the Gwern-voice, when they don’t relate in known ways to the trend lines that are smooth?
Perplexity is one general “intrinsic” measure of language models, but there are many task-specific measures too. Studying the relationship between perplexity and task-specific measures is an important part of the research process. We shouldn’t speak as if people do not actively try to uncover these relationships.
I would generally be surprised if there were many highly non-linear relationship between perplexity and something like Winograd accuracy, human evaluation, or whatever other concrete measure you can come up with, such that the underlying behavior of the surface phenomenon is best described as a discontinuity with the past even when the latent perplexity changed smoothly. I admit the existence of some measures that exhibit these qualities (such as, potentially, the ability to do arithmetic), but I expect them to be quite a bit harder to find than the reverse.
Furthermore, it seems like if this is the crux — ie. that surface-level qualitative phenomena will experience discontinuities even while latent variables do not — then I do not understand why it’s hard to come up with bet conditions.
Can’t you just pick a surface level phenomenon that’s easy to measure and strongly interpretable in a qualitative sense — like Sensibleness and Specificity Average from the paper on Google’s chatbot — and then predict discontinuities in that metric?
(I should note that the paper shows a highly linear relationship between perplexity and Sensibleness and Specificity Average. Just look at the first plot in the PDF.)
I think that most people who work on models like GPT-3 seem more interested in trendlines than you do here.
That said, it’s not super clear to me what you are saying so I’m not sure I disagree. Your narrative sounds like a strawman since people usually extrapolate performance on downstream tasks they care about rather than on perplexity. But I do agree that the updates from GPT-3 are not from OpenAI’s marketing but instead from people’s legitimate surprise about how smart big language models seem to be.
As you say, I think the interesting claim in GPT-3 was basically that scaling trends would continue, where pessimists incorrectly expected they would break based on weak arguments. I think that looking at all the graphs, both of perplexity and performance on individual tasks, helps establish this as the story. I don’t really think this lines up with Eliezer’s picture of AGI but that’s presumably up for debate.
There are always a lot of people willing to confidently decree that trendlines will break down without much argument. (I do think that eventually the GPT-3 trendline will break if you don’t change the data, but for the boring reason that the entropy of natural language will eventually dominate the gradient noise and so lead to a predictable slowdown.)
“Everyone chose it post hoc after seeing that it worked and better BPC = better models.”
I realize your comment is in context of a comment I also disagree with, and I also think I agree with most what you’re saying, but I want to challenge this framing you have at the end.
BPC is at its core a continuous generalization of the Turing Test, aka. the imitation game. It is not an exact translation, but it preserves all the key difficulties, and therefore keeps most of its same strengths, and it does this while extrapolating to weaker models in a useful and modelable way. We might only have started caring viscerally about the numbers that BPC gives, or associating them directly to things of huge importance, around the advent of GPT, but that’s largely just a situational byproduct of our understanding. Turing understood the importance of the imitation game back in 1950, enough to write a paper on it, and certainly that paper didn’t go unnoticed.
Nor can I see the core BPC:Turing Test correspondance as something purely post-hoc. If people didn’t give it much thought, that’s probably because there never was a scaling law then, there never was an expectation that you could just take your hacky grammar-infused Markov chain and extrapolate it out to capture more than just surface level syntax. Even among earlier neural models, what’s the point of looking at extrapolations of a generalized Turing Test, when the models are still figuring out surface level syntactic details? Like, is it really an indictment of BPC, to say that when we saw
the meaning of life is that only if an end would be of the whole supplier. widespread rules are regarded as the companies of refuses to deliver. in balance of the nation’s information and loan growth associated with the carrier thrifts are in the process of slowing the seed and commercial paper
we weren’t asking, ‘gee, I wonder how close this is to passing the Turing Test, by some generalized continuous measure’?
I think it’s quite surprising—importantly surprising—how it’s turned out that it actually is a relevant question, that performance on this original datapoint does actually bear some continuous mathematical relationship with models for which mere grammar is a been-there-done-that, and we now regularly test for the strength of their world models. And I get the dismissal, that it’s no proven law that it goes so far before stopping, rather than some other stretch, or that it gives no concrete conclusions for what happens at each 0.01 perplexity increment, but I look at my other passion with a straight line, hardware, and I see exactly the same argument applied to almost the same arrow-straight trendline, and I think, I’d still much rather trust the person willing to look at the plot and say, gosh, those transistors will be absurdly cheap.
Would that person predict today, back at the start? Hell no. Knowing transistor scaling laws doesn’t directly tell you all that much about the discontinuous changes in how computation is done. You can’t look at a graph and say “at a transistor density of X, there will be the iPhone, and at a transistor density of Y, microcontrollers will get so cheap that they will start replacing simple physical switches.” It certainly will not tell you when people will start using the technology to print out tiny displays they will stick inside your glasses, or build MEMS accelerometers, nor can it tell you all of the discrete and independent innovations that overcame the challenges that got us here.
But yet, but yet, lines go straight. Moore’s Law pushed computing forward not because of these concrete individual predictions, but because it told us there was more of the same surprising progress to come, and that the well has yet to run dry. That too is why I think seeing GPT-3′s perplexity is so important. I agree with you, it’s not that we need the perplexity to tell us what GPT-3 can do. GPT-3 will happily tell us that itself. And I think you will agree with me when I say that what’s most important about these trends is that they’re saying there’s more to come, that the next jump will be just as surprising as the last.
Where we maybe disagree is that I’m willing to say these lines can stand by themselves; that you don’t need to actually see anything more of GPT-3 than its perplexity to know that its capabilities must be so impressive, even if you might need to see it to feel it emotionally. You don’t even need to know anything about neural networks or their output samples to see a straight line of bits-per-character that threatens to go so low in order to forecast that something big is going on. You didn’t need to know anything about CPU microarchitecture to imagine that having ten billion transistors per square centimeter would have massive societal impacts either, as long as you knew what a transistor was and understood its fundamental relations to computation.
There’s something astonishing to see someone resort to explaining away GPT-3′s impact as ‘OpenAI was just good at marketing the results’. Said marketing consisted of: ‘dropping a paper on Arxiv’. Not even tweeting it!
Yeah, my phrasing there was not ideal here. I regret using the word “marketing”, but to be fair, I mostly meant what I said in the next few sentences, “Maybe OpenAI saw an opportunity to dump a lot of compute into language models and have a two year discontinuity ahead of everyone else, and showcase their work. And that strategy seemed to really worked well for them.”
Of course, seeing that such an opportunity exists is itself laudable and I give them Bayes points for realizing that scaling laws are important. At the same time, don’t you think we would have expected similar results in like two more years at ordinary progress?
I do agree that it’s extremely interesting to know why the lines go straight. I feel like I wasn’t trying to say that GPT-3 wasn’t intrinsically interesting. I was more saying it wasn’t unpredictable, in the sense that Paul Christiano would have strongly said “no I do not expect that to happen” in 2018.
Again, the fact that it is a straight line on a metric which is, if not meaningless, is extremely difficult to interpret, is irrelevant. Maybe OA moved up by 2 years. Why would anyone care in the slightest bit? That is, before they knew about how interesting the consequences would be of that small change in BPC?
At the same time, don’t you think we would have expected similar results in like two more years at ordinary progress?
Who’s ‘we’, exactly? Who are these people who expected all of this to happen, and are going around saying “ah yes, these BIG-Bench results are exactly as I calculated back in 2018, the capabilities are all emerging like clockwork, each at their assigned BPC; next is capability Z, obviously”? And what are they saying about 500b, 1000b, and so on?
I was more saying it wasn’t unpredictable, in the sense that Paul Christiano would have strongly said “no I do not expect that to happen” in 2018.
OK. So can you link me to someone saying in 2018 that we’d see GPT-2-1.5b’s behavior at ~1.5b parameters, and that we’d get few-shot metalearning and instructability past that with another OOM? And while you’re at it, if it’s so predictable, please answer all the other questions I gave, even if only the ones about scale. After all, you’re claiming it’s so easy to predict based on straight lines on convenient metrics like BPC and that there’s nothing special or unpredictable about jumping 2 years. So, please jump merely 2 years ahead and tell me what I can look forward as the SOTA in Nov 2023, I’m dying of excitement here.
I’m confused why you think looking at the rate and lumpiness of historical progress on narrowly circumscribed performance metrics is not meaningful, because it seems like you do seem to think that drawing straight lines is fine when compute is on the x-axis—which seems like a similar exercise. What’s going on there?
Again, the fact that it is a straight line on a metric which is, if not meaningless, is extremely difficult to interpret, is irrelevant. Maybe OA moved up by 2 years. Why would anyone care in the slightest bit?
Because the point I was trying to make was that the result was relatively predictable? I’m genuinely confused what you’re asking. I get a slight sense that you’re interpreting me as saying something about the inherent dullness of GPT-3 or that it doesn’t teach us anything interesting about AI, but I don’t see myself as saying anything like that. I actually really enjoy reading the output from it, your commentary on it, and what it reveals about the nature of intelligence.
I am making purely a point about predictability, and whether the result was a “discontinuity” from past progress, in the sense meant by Paul Christiano (in the way I think he means these things).
Who’s ‘we’, exactly
We refers in that sentence to competent observers in 2018 who predict when we’ll get ML milestones mostly by using the outside view, ie. by extrapolating trends on charts.
OK. So can you link me to someone saying in 2018 that we’d see GPT-2-1.5b’s behavior at ~1.5b parameters, and that we’d get few-shot metalearning and instructability past that with another OOM?
No, but
That seems like a different and far more specific question than whether we’d have language models that perform at roughly the same measured-level as GPT-3.
In general, people make very few specific predictions about what they expect to happen in the future about these sorts of things (though, if I may add, I’ve been making modest progress trying to fix this broad problem by writing lots of specific questions on Metaculus).
I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we’ve seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance. Qualitative capabilities are more relevant to safety than benchmark performance is, because while qualitative capabilities include things like “code a simple video game” and “summarize movies with emojis”, they also include things like “break out of confinement and kill everyone”. It’s the latter capability, and not PTB performance, that you’d need to predict if you wanted to reliably stay out of the x-risk regime — and the fact that we can’t currently do so is, I imagine, what brought to mind the analogy between scaling and Russian roulette.
I.e., a straight line in domain X is indeed not surprising; what’s surprising is the way in which that straight line maps to the things we care about more than X.
(Usual caveats apply here that I may be misinterpreting folks, but that is my best read of the argument.)
I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we’ve seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance.
This is a reasonable thesis, and if indeed it’s the one Gwern intended, then I apologize for missing it!
That said, I have a few objections,
Isn’t it a bit suspicious that the thing-that’s-discontinuous is hard to measure, but the-thing-that’s-continuous isn’t? I mean, this isn’t totally suspicious, because subjective experiences are often hard to pin down and explain using numbers and statistics. I can understand that, but the suspicion is still there.
“No one predicted X in advance” is only damning to a theory if people who believed that theory were making predictions about it at all. If people who generally align with Paul Christiano were indeed making predictions to the effect of GPT-3 capabilities being impossible or very unlikely within a narrow future time window, then I agree that would be damning to Paul’s worldview. But—and maybe I missed something—I didn’t see that. Did you?
There seems to be an implicit claim that Paul Christiano’s theory was falsified via failure to retrodict the data. But that’s weird, because much of the evidence being presented is mainly that the previous trends were upheld (for example, with Gwern saying, “The impact of GPT-3 was in establishing that trendlines did continue...”). But if Paul’s worldview is that “we should extrapolate trends, generally” then that piece of evidence seems like a remarkable confirmation of his theory, not a disconfirmation.
Do we actually have strong evidence that the qualitative things being mentioned were discontinuous with respect to time? I can certainly see some things being discontinuous with past progress (like the ability for GPT-3 to do arithmetic). But overall I feel like I’m being asked to believe something quite strong about GPT-3 breaking trends without actual references to what progress really looked like in the past.
I don’t deny that you can find quite a few discontinuities on a variety of metrics, especially if you search for them post-hoc. I think it would be fairly strawmanish to say that people in Paul Christiano’s camp don’t expect those at all. My impression is that they just don’t expect those to be overwhelming in a way that makes reliable reference class forecasting qualitatively useless; it seems like extrapolating from the past still gives you a lot better of a model than most available alternatives.
it seems like extrapolating from the past still gives you a lot better of a model than most available alternatives.
My impression is that some people are impressed by GPT-3′s capabilities, whereas your response is “ok, but it’s part of the straight-line trend on Penn Treebank; maybe it’s a little ahead of schedule, but nothing to write home about.” But clearly you and they are focused on different metrics!
That is, suppose it’s the case that GPT-3 is the first successfully commercialized language model. (I think in order to make this literally true you have to throw on additional qualifiers that I’m not going to look up; pretend I did that.) So on a graph of “language model of type X revenue over time”, total revenue is static at 0 for a long time and then shortly after GPT-3′s creation departs from 0.
It seems like the fact that GPT-3 could be commercialized in this way when GPT-2 couldn’t is a result of something that Penn Treebank perplexity is sort of pointing at. (That is, it’d be hard to get a model with GPT-3′s commercializability but GPT-2′s Penn Treebank score.) But what we need in order for the straight line on PTB to be useful as a model for predicting revenue is to know ahead of time what PTB threshold you need for commercialization.
And so this is where the charge of irrelevancy is coming from: yes, you can draw straight lines, but they’re straight lines in the wrong variables. In the interesting variables (from the “what’s the broader situation?” worldview), we do see discontinuities, even if there are continuities in different variables.
[As an example of the sort of story that I’d want, imagine we drew the straight line of ELO ratings for Go-bots, had a horizontal line of “human professionals” on that graph, and were able to forecast the discontinuity in “number of AI wins against human grandmasters” by looking at straight-line forecasts in ELO.]
That is, suppose it’s the case that GPT-3 is the first successfully commercialized language model. (I think in order to make this literally true you have to throw on additional qualifiers that I’m not going to look up; pretend I did that.) So on a graph of “language model of type X revenue over time”, total revenue is static at 0 for a long time and then shortly after GPT-3′s creation departs from 0.
I think it’s the nature of every product that comes on the market that it will experience a discontinuity from having zero revenue to having some revenue at some point. It’s an interesting question of when that will happen, and maybe your point is simply that it’s hard to predict when that will happen when you just look at the Penn Treebank trend.
However, I suspect that the revenue curve will look pretty continuous, now that it’s gone from zero to one. Do you disagree?
In a world with continuous, gradual progress across a ton of metrics, you’re going to get discontinuities from zero to one. I don’t think anyone from the Paul camp disagrees with that (in fact, Katja Grace talked about this in her article). From the continuous takeoff perspective, these discontinuities don’t seem very relevant unless going from zero to one is very important in a qualitative sense. But I would contend that going from “no revenue” to “some revenue” is not actually that meaningful in the sense of distinguishing AI from the large class of other economic products that have gradual development curves.
your point is simply that it’s hard to predict when that will happen when you just look at the Penn Treebank trend.
This is a big part of my point; a smaller elaboration is that it can be easy to trick yourself into thinking that, because you understand what will happen with PTB, you’ll understand what will happen with economics/security/etc., when in fact you don’t have much understanding of the connection between those, and there might be significant discontinuities. [To be clear, I don’t have much understanding of this either; I wish I did!]
For example, I imagine that, by thirty years from now, we’ll have language/code models that can do significant security analysis of the code that was available in 2020, and that this would have been highly relevant/valuable to people in 2020 interested in computer security. But when will this happen in the 2020-2050 range that seems likely to me? I’m pretty uncertain, and I expect this to look a lot like ‘flicking a switch’ in retrospect, even tho the leadup to flicking that switch will probably look like smoothly increasing capabilities on ‘toy’ problems.
[My current guess is that Paul / people in “Paul’s camp” would mostly agree with the previous paragraph, except for thinking that it’s sort of weird to focus on specifically AI computer security productivity, rather than the overall productivity of the computer security ecosystem, and this misplaced focus will generate the ‘flipping the switch’ impression. I think most of the disagreements are about ‘where to place the focus’, and this is one of the reasons it’s hard to find bets; it seems to me like Eliezer doesn’t care much about the lines Paul is drawing, and Paul doesn’t care much about the lines Eliezer is drawing.]
However, I suspect that the revenue curve will look pretty continuous, now that it’s gone from zero to one. Do you disagree?
I think I agree in a narrow sense and disagree in a broad sense. For this particular example, I expect OpenAI’s revenues from GPT-3 to look roughly continuous now that they’re selling/licensing it at all (until another major change happens; like, the introduction of a competitor would likely cause the revenue trend to change).
More generally, suppose we looked at something like “the total economic value of horses over the course of human history”. I think we would see mostly smooth trends plus some implied starting and stopping points for those trends. (Like, “first domestication of a horse” probably starts a positive trend, “invention of stirrups” probably starts another positive trend, “introduction of horses to America” starts another positive trend, “invention of the automobile” probably starts a negative trend that ends with “last horse that gets replaced by a tractor/car”.)
In my view, ‘understanding the world’ looks like having a causal model that you can imagine variations on (and have those imaginations be meaningfully grounded in reality), and I think the bits that are most useful for building that causal model are the starts and stops of the trends, rather than the smooth adoption curves or mostly steady equilibria in between. So it seems sort of backwards to me to say that for most of the time, most of the changes in the graph are smooth, because what I want out of the graph is to figure out the underlying generator, where the non-smooth bits are the most informative. The graph itself only seems useful as a means to that end, rather than an end in itself.
Isn’t it a bit suspicious that the thing-that’s-discontinuous is hard to measure, but the-thing-that’s-continuous isn’t? I mean, this isn’t totally suspicious, because subjective experiences are often hard to pin down and explain using numbers and statistics. I can understand that, but the suspicion is still there.
I sympathize with this view, and I agree there is some element of truth to it that may point to a fundamental gap in our understanding (or at least in mine). But I’m not sure I entirely agree that discontinuous capabilities are necessarily hard to measure: for example, there are benchmarks available for things like arithmetic, which one can train on and make quantitative statements about.
I think the key to the discontinuity question is rather that 1) it’s the jumps in model scaling that are happening in discrete increments; and 2) everything is S-curves, and a discontinuity always has a linear regime if you zoom in enough. Those two things together mean that, while a capability like arithmetic might have a continuous performance regime on some domain, in reality you can find yourself halfway up the performance curve in a single scaling jump (and this is in fact what happened with arithmetic and GPT-3). So the risk, as I understand it, is that you end up surprisingly far up the scale of “world-ending” capability from one generation to the next, with no detectable warning shot beforehand.
“No one predicted X in advance” is only damning to a theory if people who believed that theory were making predictions about it at all. If people who generally align with Paul Christiano were indeed making predictions to the effect of GPT-3 capabilities being impossible or very unlikely within a narrow future time window, then I agree that would be damning to Paul’s worldview. But—and maybe I missed something—I didn’t see that. Did you?
No, you’re right as far as I know; at least I’m not aware of any such attempted predictions. And in fact, the very absence of such prediction attempts is interesting in itself. One would imagine that correctly predicting the capabilities of an AI from its scale ought to be a phenomenally valuable skill — not just from a safety standpoint, but from an economic one too. So why, indeed, didn’t we see people make such predictions, or at least try to?
There could be several reasons. For example, perhaps Paul (and other folks who subscribe to the “continuum” world-model) could have done it, but they were unaware of the enormous value of their predictive abilities. That seems implausible, so let’s assume they knew the value of such predictions would be huge. But if you know the value of doing something is huge, why aren’t you doing it? Well, if you’re rational, there’s only one reason: you aren’t doing it because it’s too hard, or otherwise too expensive compared to your alternatives. So we are forced to conclude that this world-model — by its own implied self-assessment — has, so far, proved inadequate to generate predictions about the kinds of capabilities we really care about.
(Note: you could make the argument that OpenAI did make such a prediction, in the approximate yet very strong sense that they bet big on a meaningful increase in aggregate capabilities from scale, and won. You could also make the argument that Paul, having been at OpenAI during the critical period, deserves some credit for that decision. I’m not aware of Paul ever making this argument, but if made, it would be a point in favor of such a view and against my argument above.)
Can I try to parse out what you’re saying about stacked sigmoids? Because it seems weird to me. Like, in that view, it still seems like showing a trendline is some evidence that it’s not “interesting”. I feel like this because I expect the asymptote of the AlphaGo sigmoid to be independent of MCTS bots, so surely you should see some trends where AlphaGo (or equivalent) was invented first, and jumped the trendline up really fast. So not seeing jumps should indicate that it is more a gradual progression, because otherwise, if they were independent, about half the time the more powerful technique should come first.
The “what counter argument can I come up with” part of me says, tho, that how quickly the sigmoid grows likely depends on lots of external factors (like compute available or something). So instead of sometimes seeing a sigmoid that grows twice as fast as the previous ones, you should expect one that’s not just twice as tall, but twice as wide, too. And if you have that case, you should expect the “AlphaGo was invented first” sigmoid to be under the MCTS bots sigmoid for some parts of the graph, where it then reaches the same asymptote as AlphaGo in the mainline. So, if we’re in the world where AlphaGo is invented first, you can make gains by inventing MCTS bots, which will also set the trendline. And so, seeing a jump would be less “AlphaGo was invented first” and more “MCTS bots were never invented during the long time when they would’ve outcompeted AlphaGo version −1″
Does that seem accurate, or am I still missing something?
I find this kind of bluster pretty frustrating and condescending. I also feel like the implication is just wrong—if Eliezer and I disagree, I’d guess it’s because he’s worse at predicting ML progress. To me GPT-3 feels much (much) closer to my mainline than to Eliezer’s, and AlphaGo is very unsurprising. But it’s hard to say who was actually “caught flatfooted” unless we are willing to state some of these predictions in advance.
I got pulled into this interaction because I wanted to get Eliezer to make some real predictions, on the record, so that we could have a better version of this discussion in 5 years rather than continuing to both say “yeah, in hindsight this looks like evidence for my view.” I apologize if my tone (both in that discussion and in this comment) is a bit frustrated.
It currently feels from the inside like I’m holding the epistemic high ground on this point, though I expect Eliezer disagrees strongly:
I’m willing to bet on anything Eliezer wants, or to propose my own questions if Eliezer is willing in principle to make forecasts. I expect to outperform Eliezer on these bets and am happy to state in advance that I’d update in his direction if his predictions turned out to be as good as mine. It’s possible that we don’t have disagreements, but I doubt it. (See my other comment.)
I’m not talking this much smack based on “track records” imagined in hindsight. I think that if you want to do this then you should have been making predictions in the past, and you definitely should be willing to make predictions about the future. (I suspect you’ll often find that other people don’t disagree with the predictions that turned out to be reasonable, even if from your perspective it was all part of one coherent story.)
I wish to acknowledge this frustration, and state generally that I think Paul Christiano occupies a distinct and more clueful class than a lot of, like, early EAs who mm-hmmmed along with Robin Hanson on AI—I wouldn’t put, eg, Dario Amodei in that class either, though we disagree about other things.
But again, Paul, it’s not enough to say that you weren’t surprised by GPT-2/3 in retrospect, it kinda is important to say it in advance, ideally where other people can see? Dario picks up some credit for GPT-2/3 because he clearly called it in advance. You don’t need to find exact disagreements with me to start going on the record as a forecaster, if you think the course of the future is generally narrower than my own guesses—if you think that trends stay on course, where I shrug and say that they might stay on course or break. (Except that of course in hindsight somebody will always be able to draw a straight-line graph, once they know which graph to draw, so my statement “it might stay on trend or maybe break” applies only to graphs extrapolating into what is currently the future.)
Suppose your view is “crazy stuff happens all the time” and my view is “crazy stuff happens rarely.” (Of course “crazy” is my word, to you it’s just normal stuff.) Then what am I supposed to do, in your game?
More broadly: if you aren’t making bold predictions about the future, why do you think that other people will? (My predictions all feel boring to me.) And if you do have bold predictions, can we talk about some of them instead?
It seems to me like I want you to say “well I think 20% chance something crazy happens here” and I say “nah, that’s more like 5%” and then we batch up 5 of those and when none of them happen I get a bayes point.
I could just give my forecast. But then if I observe that 2⁄20 of them happen, how exactly does that help me in figuring out whether I should be paying more attention to your views (or help you snap out of it)?
I can list some particular past bets and future forecasts, but it’s really unclear what to do with them without quantitative numbers or a point of comparison.
Like you I’ve predicted that AI is undervalued and will grow in importance, although I think I made a much more specific prediction that investment in AI would go up a lot in the short term. This made me some money, but like you I just don’t care much about money and it’s not a game worth playing. I bet quite explicitly on deep learning by pivoting my career into practical ML and then spending years of my life working on it, despite loving theory and thinking it’s extremely important. We can debate whether the bet is good, but it was certainly a bet and by my lights it looks very reasonable in retrospect.
Over the next 10 years I think powerful ML systems will be trained mostly by imitating human behavior over short horizons, and then fine-tuned using much smaller amounts of long-horizon feedback. This has long been my prediction, and it’s why I’ve been interested in language modeling, and has informed some of my research. I think that’s still basically valid and will hold up in the future. I predict that people will explicitly collect much larger datasets of human behavior as the economic stakes rise. This is in contrast to e.g. theorem-proving working well, although I think that theorem-proving may end up being an important bellwether because it allows you to assess the capabilities of large models without multi-billion-dollar investments in training infrastructure.
I expect to see truly massive training runs in the not that distant future. I think the current rate of scaling won’t be sustained, but that over the next 10-20 years scaling will get us into human-level behavior for “short-horizon” tasks which may or may not suffice for transformative AI. I expect that to happen at model sizes within 2 orders of magnitude of the human brain on one side or the other, i.e. 1e12 to 1e16 parameters.
I could list a lot more, but I don’t think any of it seems bold and it’s not clear what the game is. It’s clearly bold by comparison to market forecasts or broader elite consensus, but so what? I understand much better how to compare one predictor to another. I mostly don’t know what it means to evaluate a predictor on an absolute scale.
Well, it sounds like I might be more bullish than you on theorem-proving, possibly. Not on it being useful or profitable, but in terms of underlying technology making progress on non-profitable amazing demo feats, maybe I’m more bullish on theorem-proving than you are? Is there anything you think it shouldn’t be able to do in the next 5 years?
I’m going to make predictions by drawing straight-ish lines through metrics like the ones in the gpt-f paper. Big unknowns are then (i) how many orders of magnitude of “low-hanging fruit” are there before theorem-proving even catches up to the rest of NLP? (ii) how hard their benchmarks are compared to other tasks we care about. On (i) my guess is maybe 2? On (ii) my guess is “they are pretty easy” / “humans are pretty bad at these tasks,” but it’s somewhat harder to quantify. If you think your methodology is different from that then we will probably end up disagreeing.
Looking towards more ambitious benchmarks, I think that the IMO grand challenge is currently significantly more than 5 years away. In 5 year’s time my median guess (without almost any thinking about it) is that automated solvers can do 10% of non-geometry, non-3-variable-inequality IMO shortlist problems.
So yeah, I’m happy to play ball in this area, and I expect my predictions to be somewhat more right than yours after the dust settles. Is there some way of measuring such that you are willing to state any prediction?
(I still feel like I’m basically looking for any predictions at all beyond sometimes saying “my model wouldn’t be surprised by <vague thing X>”, whereas I’m pretty constantly throwing out made-up guesses which I’m happy to refine with more effort. Obviously I’m going to look worse in retrospect than you if we keep up this way though, that particular asymmetry is a lot of the reason people mostly don’t play ball. ETA: that’s a bit unfair, the romantic chatbot vs self-driving car prediction is one where we’ve both given off-the-cuff takes.)
I have a sense that there’s a lot of latent potential for theorem-proving to advance if more energy gets thrown at it, in part because current algorithms seem a bit weird to me—that we are waiting on the equivalent of neural MCTS as an enabler for AlphaGo, not just a bigger investment, though of course the key trick could already have been published in any of a thousand papers I haven’t read. I feel like I “would not be surprised at all” if we get a bunch of shocking headlines in 2023 about theorem-proving problems falling, after which the IMO challenge falls in 2024 - though of course, as events like this lie in the Future, they are very hard to predict.
Can you say more about why or whether you would, in this case, say that this was an un-Paulian set of events? As I have trouble manipulating my Paul model, it does not exclude Paul saying, “Ah, yes, well, they were using 700M models in that paper, so if you jump to 70B, of course the IMO grand challenge could fall; there wasn’t a lot of money there.” Though I haven’t even glanced at any metrics here, let alone metrics that the IMO grand challenge could be plotted on, so if smooth metrics rule out IMO in 5yrs, I am more interested yet—it legit decrements my belief, but not nearly as much as I imagine it would decrement yours.
(Edit: Also, on the meta-level, is this, like, anywhere at all near the sort of thing you were hoping to hear from me? Am I now being a better epistemic citizen, if maybe not a good one by your lights?)
Yes, IMO challenge falling in 2024 is surprising to me at something like the 1% level or maybe even more extreme (though could also go down if I thought about it a lot or if commenters brought up relevant considerations, e.g. I’d look at IMO problems and gold medal cutoffs and think about what tasks ought to be easy or hard; I’m also happy to make more concrete per-question predictions). I do think that there could be huge amounts of progress from picking the low hanging fruit and scaling up spending by a few orders of magnitude, but I still don’t expect it to get you that far.
I don’t think this is an easy prediction to extract from a trendline, in significant part because you can’t extrapolate trendlines this early that far out. So this is stress-testing different parts of my model, which is fine by me.
At the meta-level, this is the kind of thing I’m looking for, though I’d prefer have some kind of quantitative measure of how not-surprised you are. If you are only saying 2% then we probably want to talk about things less far in your tails than the IMO challenge.
Okay, then we’ve got at least one Eliezerverse item, because I’ve said below that I think I’m at least 16% for IMO theorem-proving by end of 2025. The drastic difference here causes me to feel nervous, and my second-order estimate has probably shifted some in your direction just from hearing you put 1% on 2024, but that’s irrelevant because it’s first-order estimates we should be comparing here.
So we’ve got huge GDP increases for before-End-days signs of Paulverse and quick IMO proving for before-End-days signs of Eliezerverse? Pretty bare portfolio but it’s at least a start in both directions. If we say 5% instead of 1%, how much further would you extend the time limit out beyond 2024?
I also don’t know at all what part of your model forbids theorem-proving to fall in a shocking headline followed by another headline a year later—it doesn’t sound like it’s from looking at a graph—and I think that explaining reasons behind our predictions in advance, not just making quantitative predictions in advance, will help others a lot here.
EDIT: Though the formal IMO challenge has a barnacle about the AI being open-sourced, which is a separate sociological prediction I’m not taking on.
I think IMO gold medal could be well before massive economic impact, I’m just surprised if it happens in the next 3 years. After a bit more thinking (but not actually looking at IMO problems or the state of theorem proving) I probably want to bump that up a bit, maybe 2%, it’s hard reasoning about the tails.
I’d say <4% on end of 2025.
I think this is the flipside of me having an intuition where I say things like “AlphaGo and GPT-3 aren’t that surprising”—I have a sense for what things are and aren’t surprising, and not many things happen that are so surprising.
If I’m at 4% and you are 12% and we had 8 such bets, then I can get a factor of 2 if they all come out my way, and you get a factor of ~1.5 if one of them comes out your way.
I might think more about this and get a more coherent probability distribution, but unless I say something else by end of 2021 you can consider 4% on end of 2025 this my prediction.
Maybe another way of phrasing this—how much warning do you expect to get, how far out does your Nope Vision extend? Do you expect to be able to say “We’re now in the ‘for all I know the IMO challenge could be won in 4 years’ regime” more than 4 years before it happens, in general? Would it be fair to ask you again at the end of 2022 and every year thereafter if we’ve entered the ‘for all I know, within 4 years’ regime?
Added: This question fits into a larger concern I have about AI soberskeptics in general (not you, the soberskeptics would not consider you one of their own) where they saunter around saying “X will not occur in the next 5 / 10 / 20 years” and they’re often right for the next couple of years, because there’s only one year where X shows up for any particular definition of that, and most years are not that year; but also they’re saying exactly the same thing up until 2 years before X shows up, if there’s any early warning on X at all. It seems to me that 2 years is about as far as Nope Vision extends in real life, for any case that isn’t completely slam-dunk; when I called upon those gathered AI luminaries to say the least impressive thing that definitely couldn’t be done in 2 years, and they all fell silent, and then a single one of them named Winograd schemas, they were right that Winograd schemas at the stated level didn’t fall within 2 years, but very barely so (they fell the year after). So part of what I’m flailingly asking here, is whether you think you have reliable and sensitive Nope Vision that extends out beyond 2 years, in general, such that you can go on saying “Not for 4 years” up until we are actually within 6 years of the thing, and then, you think, your Nope Vision will actually flash an alert and you will change your tune, before you are actually within 4 years of the thing. Or maybe you think you’ve got Nope Vision extending out 6 years? 10 years? Or maybe theorem-proving is just a special case and usually your Nope Vision would be limited to 2 years or 3 years?
This is all an extremely Yudkowskian frame on things, of course, so feel free to reframe.
I think I’ll get less confident as our accomplishments get closer to the IMO grand challenge. Or maybe I’ll get much more confident if we scale up from $1M → $1B and pick the low hanging fruit without getting fairly close, since at that point further progress gets a lot easier to predict
There’s not really a constant time horizon for my pessimism, it depends on how long and robust a trend you are extrapolating from. 4 years feels like a relatively short horizon, because theorem-proving has not had much investment so compute can be scaled up several orders of magnitude, and there is likely lots of low-hanging fruit to pick, and we just don’t have much to extrapolate from (compared to more mature technologies, or how I expect AI will be shortly before the end of days), and for similar reasons there aren’t really any benchmarks to extrapolate.
(Also note that it matters a lot whether you know what problems labs will try to take a stab at. For the purpose of all of these forecasts, I am trying insofar as possible to set aside all knowledge about what labs are planning to do though that’s obviously not incentive-compatible and there’s no particular reason you should trust me to do that.)
Possibly helpful: Metaculus currently puts the chances of the IMO grand challenge falling by 2025 at about 8%. Their median is 2039.
I think this would make a great bet, as it would definitely show that your model can strongly outperform a lot of people (and potentially Paul too). And the operationalization for the bet is already there—so little work will be needed to do that part.
Ha! Okay then. My probability is at least 16%, though I’d have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more. Paul?
EDIT: I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists. I’ll stand by a >16% probability of the technical capability existing by end of 2025, as reported on eg solving a non-trained/heldout dataset of past IMO problems, conditional on such a dataset being available; I frame no separate sociological prediction about whether somebody is willing to open-source the AI model that does it.
I don’t care about whether the AI is open-sourced (I don’t expect anyone to publish the weights even if they describe their method) and I’m not that worried about our ability to arbitrate overfitting.
Ajeya suggested that I clarify: I’m significantly more impressed by an AI getting a gold medal than getting a bronze, and my 4% probability is for getting a gold in particular (as described in the IMO grand challenge). There are some categories of problems that can be solved using easy automation (I’d guess about 5-10% could be done with no deep learning and modest effort). Together with modest progress in deep learning based methods, and a somewhat serious effort, I wouldn’t be surprised by people getting up to 20-40% of problems. The bronze cutoff is usually 3⁄6 problems, and the gold cutoff is usually 5⁄6 (assuming the AI doesn’t get partial credit). The difficulty of problems also increases very rapidly for humans—there are often 3 problems that a human can do more-or-less mechanically.
I could tighten any of these estimates by looking at the distribution more carefully rather than going off of my recollections from 2008, and if this was going to be one of a handful of things we’d bet about I’d probably spend a few hours doing that and some other basic digging.
I looked at a few recent IMOs to get better calibrated. I think the main update is that I significantly underestimated how many years you can get a gold with only 4⁄6 problems.
For example I don’t have the same “this is impossible” reaction about IMO 2012 or IMO 2015 as about most years. That said, I feel like they do have to get reasonably lucky with both IMO content and someone has to make a serious and mostly-successful effort, but I’m at least a bit scared by that. There’s also quite often a geo problem as 3 or 6.
Might be good to make some side bets:
Conditioned on winning I think it’s only maybe 20% probability to get all 6 problems (whereas I think you might have a higher probability on jumping right past human level, or at least have 50% on 6 vs 5?).
Conditioned on a model getting 3+ problems I feel like we have a pretty good guess about what algorithm will be SOTA on this problem (e.g. I’d give 50% to a pretty narrow class of algorithms with some uncertain bells and whistles, with no inside knowledge). Whereas I’d guess you have a much broader distribution.
But more useful to get other categories of bets. (Maybe in programming, investment in AI, economic impact from robotics, economic impact from chatbots, translation?)
Going through previous ten IMOs, and imagining a very impressive automated theorem prover, I think
2020 - unlikely, need 5⁄6 and probably can’t get problems 3 or 6. Also good chance to mess up at 4 or 5
2019 - tough but possible, 3 seems hard but even that is not unimaginable, 5 might be hard but might be straightforward, and it can afford to get one wrong
2018 - tough but possible, 3 is easier for machine than human but probably still hard, 5 may be hard, can afford to miss one
2017 - tough but possible, 3 looks out of reach, 6 looks hard but not sure about that, 5 looks maybe hard, 1 is probably easy. But it can miss 2, which could happen.
2016 - probably not possible, 3 and 6 again look hard, and good chance to fail on 2 and 5, only allowed to miss 1
2015 - seems possible, 3 might be hard but like 50-50 it’s simple for machine, 6 is probably hard, but you can miss 2
2014 - probably not possible, can only miss 1, probably miss one of 2 or 5 and 6
2013 - probably not possible, 6 seems hard, 2 seems very hard, can only miss 1
2012 - tough but possible, 6 and 3 look hard but you can miss 2
2011 - seems possible, allowed to miss two and both 3 and 6 look brute-forceable
Overall this was much easier than I expected. 4⁄10 seem unlikely, 4⁄10 seem tough but possible, 2⁄10 I can imagine a machine doing it. There are a lot of problems that look really hard, but there are a fair number of tests where you can just skip those.
That said, even to get the possible ones you do need to be surprisingly impressive, and that’s getting cut down by like 25-50% for a solvable test. That said, they get to keep trying (assuming they get promising results in early years) and eventually they will hit one of the easier years.
It also looks fairly likely to me that if one of DeepMind or OpenAI tries seriously they will be able to get an HM with a quite reasonable chance at bronze, and this is maybe enough of a PR coup to motivate work, and then it’s more likely there will be a large effort subsequently to finish the job or to opportunistically take advantage of an easy test.
Overall I’m feeling bad about my 4%, I deserve to lose some points regardless but might think about what my real probability is after looking at tests (though I was also probably moved by other folks in EA systematically giving higher estimates than I did).
What do you think of Deepmind’s new whoop-de-doo about doing research-level math assisted by GNNs?
Not surprising in any of the ways that good IMO performance would be surprising.
Based on the other thread I now want to revise this prediction, both because 4% was too low and “IMO gold” has a lot of noise in it based on test difficulty.
I’d put 4% on “For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem” where “hardest problem” = “usually problem #6, but use problem #3 instead if either: (i) problem 6 is geo or (ii) problem 3 is combinatorics and problem 6 is algebra.” (Would prefer just pick the hardest problem after seeing the test but seems better to commit to a procedure.)
Maybe I’ll go 8% on “gets gold” instead of “solves hardest problem.”
Would be good to get your updated view on this so that we can treat it as staked out predictions.
(News: OpenAI has built a theorem-prover that solved many AMC12 and AIME competition problems, and 2 IMO problems, and they say they hope this leads to work that wins the IMO Grand Challenge.)
It feels like this bet would look a lot better if it were about something that you predict at well over 50% (with people in Paul’s camp still maintaining less than 50%). So, we could perhaps modify the terms such that the bot would only need to surpass a certain rank or percentile-equivalent in the competition (and not necessarily receive the equivalent of a Gold medal).
The relevant question is which rank/percentile you think is likely to be attained by 2025 under your model but you predict would be implausible under Paul’s model. This may be a daunting task, but one way to get started is to put a probability distribution over what you think the state-of-the-art will look like by 2025, and then compare to Paul’s.
Edit: Here are, for example, the individual rankings for 2021: https://www.imo-official.org/year_individual_r.aspx?year=2021
I expect it to be hella difficult to pick anything where I’m at 75% that it happens in the next 5 years and Paul is at 25%. Heck, it’s not easy to find things where I’m at over 75% that aren’t just obvious slam dunks; the Future isn’t that easy to predict. Let’s get up to a nice crawl first, and then maybe a small portfolio of crawlings, before we start trying to make single runs that pierce the sound barrier.
I frame no prediction about whether Paul is under 16%. That’s a separate matter. I think a little progress is made toward eventual epistemic virtue if you hand me a Metaculus forecast and I’m like “lol wut” and double their probability, even if it turns out that Paul agrees with me about it.
My model of Eliezer may be wrong, but I’d guess that this isn’t a domain where he has many over-50% predictions of novel events at all? See also ‘I don’t necessarily expect self-driving cars before the apocalypse’.
My Eliezer-model has a more flat prior over what might happen, which therefore includes stuff like ‘maybe we’ll make insane progress on theorem-proving (or whatever) out of the blue’. Again, I may be wrong, but my intuition is that you’re Paul-omorphizing Eliezer when you assume that >16% probability of huge progress in X by year Y implies >50% probability of smaller-but-meaningful progress in X by year Y.
(Ah, EY already replied.)
If this task is bad for operationalization reasons, there are other theorem proving benchmarks. Unfortunately it looks like there aren’t a lot of people that are currently trying to improve on the known benchmarks, as far as I’m aware.
The code generation benchmarks are slightly more active. I’m personally partial to Hendrycks et al.’s APPS benchmark, which includes problems that “range in difficulty from introductory to collegiate competition level and measure coding and problem-solving ability.” (Github link).
I think Metaculus is closer to Eliezer here: conditioned on this problem being resolved it seems unlikely for the AI to be either open-sourced or easily reproducible.
My honest guess is that most predictors didn’t see that condition and the distribution would shift right if someone pointed that out in the comments.
To add to this sentiment, I’ll post the graph from my notebook on language model progress. I refer to the Penn Treebank task a lot when making this point because it seems to have a lot of good data, but you can also look at the other tasks and see basically the same thing.
The last dip in the chart is from GPT-3. It looks like GPT-3 was indeed a discontinuity in progress but not a very shocking one. It roughly would have taken about one or two more years at ordinary progress to get to that point anyway—which I just don’t see as being all that impressive.
I sorta feel like the main reason why lots of people found GPT-3 so impressive was because OpenAI was just good at
marketingthe results [ETA: sorry, I take back the use of the word “marketing”]. Maybe OpenAI saw an opportunity to dump a lot of compute into language models and have a two year discontinuity ahead of everyone else, and showcase their work. And that strategy seemed to really worked well for them.I admit this is an uncharitable explanation, but is there a better story to tell about why GPT-3 captured so much attention?
The impact of GPT-3 had nothing whatsoever to do with its perplexity on Penn Treebank. I think this is a good example of why focusing on perplexity and ‘straight lines on graph go brr’ is so terrible, such cargo cult mystical thinking, and crippling. There’s something astonishing to see someone resort to explaining away GPT-3′s impact as ‘OpenAI was just good at marketing the results’. Said marketing consisted of: ‘dropping a paper on Arxiv’. Not even tweeting it! They didn’t even tweet the paper! (Forget an OA blog post, accompanying NYT/TR articles, tweets by everyone at OA, a fancy interactive interface—none of that.) And most of the initial reaction was “GPT-3: A Disappointing Paper”-style. If this is marketing genius, then it is truly 40-d chess, is all I can say.
The impact of GPT-3 was in establishing that trendlines did continue in a way that shocked pretty much everyone who’d written off ‘naive’ scaling strategies. Progress is made out of stacked sigmoids: if the next sigmoid doesn’t show up, progress doesn’t happen. Trends happen, until they stop. Trendlines are not caused by the laws of physics. You can dismiss AlphaGo by saying “oh, that just continues the trendline in ELO I just drew based on MCTS bots”, but the fact remains that MCTS progress had stagnated, and here we are in 2021, and pure MCTS approaches do not approach human champions, much less beat them. (This is also true of SVMs. Notice SVMs solving ImageNet because the trendlines continued? No, of course you did not. It drives me bonkers to see AI Impacts etc make arguments like “deep learning is unimportant because look, ImageNet follows a trendline”. Sheer numerology.) Appealing to trendlines is roughly as informative as “calories in calories out”; ‘the trend continued because the trend continued’. A new sigmoid being discovered is extremely important.
GPT-3 further showed completely unpredicted emergence of capabilities across downstream tasks which are not measured in PTB perplexity. There is nothing obvious about a PTB BPC of 0.80 that causes it to be useful where 0.90 is largely useless and 0.95 is a laughable toy. (OAers may have had faith in scaling, but they could not have told you in 2015 that interesting behavior would start at 𝒪(1b), and it’d get really cool at 𝒪(100b).) That’s why it’s such a useless metric. There’s only one thing that a PTB perplexity can tell you, under the pretraining paradigm: when you have reached human AGI level. (Which is useless for obvious reasons: much like saying that “if you hear the revolver click, the bullet wasn’t in that chamber and it was safe”. Surely true, but a bit late.) It tells you nothing about intermediate levels. I’m reminded of the Steven Kaas line:
Using PBT, and talking only about perplexity, is a precise answer to the wrong question. (This is a much better argument when it comes to AlphaGo/ELO, because at least there, ‘ELO’ is in fact the ultimate objective, and not a proxy pretext. But perplexity is of no interest to anyone except an information theorist. Unfortunately, we lack any ‘take-over-the-world-ELO’ we can benchmark models on and extrapolate there. If we did and there was a smooth curve, I would indeed agree that we should adopt that as the baseline. But the closest things we have to downstream tasks are all wildly jumpy—even superimposing scores of downstream tasks barely gives you a recognizable smooth curve, and certainly nothing remotely as smooth as the perplexity curve. My belief is that this is because the overall perplexity curve comes from hundreds or thousands of stacked sigmoids and plateau/breakthroughs averaging out in terms of prediction improvements.) It sure would be convenient if the only number that mattered in AI or its real-world impact or risk was also the single easiest one to measure!
I emphasized this poverty of extrapolation in my scaling hypothesis writeup already, but permit me to vent a little more here:
“So, you’re forecasting AI progress using PTB perplexity/BPC. Cool, good work, nice notebook, surely this must be useful for forecasting on substantive AI safety/capability questions of interest to us. I see it’s a pretty straight line on a graph. OK, can you tell me at what BPC a large language model could do stuff like hack computers and escape onto the Internet?”
“No. I can tell you what happens if I draw the line out x units, though.”
“Perhaps that’s an unfairly specific question to ask, as important as it is. OK, can you tell me when we can expect to see well-known benchmarks like Winograd schemas be solved?”
“No. I can draw you a line on PTB to estimate when PTB is solved, though, if you give me a second and define a bound for ‘PTB is solved’.”
“Hm. Can you at least tell me when we can expect to see meta-learning emerge, with good few-shot learning—does the graph predict 0.1b, 1b, 10b, 100b, or what?”
“No idea.”
“Do you know what capabilities will be next to emerge? We got pretty good programming performance in Copilot at 𝒪(100b), what’s next?”
“I don’t know.”
“Can you qualitatively describe what we’d get at 1t, or 10t?”
“No, but I can draw the line in perplexity. It gets pretty low.”
“How about the existence of any increasing returns to scale in downstream tasks? Does it tell us anything about spikes in capabilities (such as we observe in many places, such as text style transfer and inner monologue in LaMDA at 100b; most recently BIG-bench)? Such as whether there are any more spikes past 𝒪(100b), whether we’ll see holdouts like causality suddenly fall at 𝒪(1000b), anything like that?”
“No.”
“How about RL: what sort of world modeling can we get by plugging them into DRL agents?”
“I don’t know.”
“Fine, let’s leave it at tool AIs doing text in text out. Can you tell me how much economic value will be driven by dropping another 0.01 BPC?”
“No. I can tell you how much it’d cost in GPU-time, though, by the awesome power of drawing lines!”
“OK, how about that: how low does it need to go to support a multi-billion dollar company running something like the OA API, to defray the next 0.01 drop and pay for the GPU-time to get more drops?”
“No idea.”
“How do you know BPC is the right metric to use?”
“Oh, we have lots of theories about it, but I’ll level with you: we always have theories for everything, but really, we chose BPC post hoc out of a few thousand metrics littering Arxiv like BLEU, ROUGE, SSA etc after seeing that it worked and better BPC = better models.”
“Can you write down your predictions about any of this?”
“Absolutely not.”
“Can anyone else?”
“Sure. But they’re all terribly busy.”
“Did you write down your predictions before now, then?”
“Oh gosh no, I wasn’t around then.”
“Did… someone… else… write down their predictions before?”
“Not that I’m aware of.”
“Ugh. Fine, what can you tell me about AI safety/risk/capabilities/economics/societal-disruption with these analyses of absolute loss?”
“Lines go straight?”
Seems to me that instead of gradualist narratives it would be preferable to say with Socrates that we are wise about scaling only in that we know we know little & about the least.
And to say it also explicitly, I think this is part of why I have trouble betting with Paul. I have a lot of ? marks on the questions that the Gwern voice is asking above, regarding them as potentially important breaks from trend that just get dumped into my generalized inbox one day. If a gradualist thinks that there ought to be a smooth graph of perplexity with respect to computing power spent, in the future, that’s something I don’t care very much about except insofar as it relates in any known way whatsoever to questions like those the Gwern voice is asking. What does it even mean to be a gradualist about any of the important questions like those of the Gwern-voice, when they don’t relate in known ways to the trend lines that are smooth? Isn’t this sort of a shell game where our surface capabilities do weird jumpy things, we can point to some trend lines that were nonetheless smooth, and then the shells are swapped and we’re told to expect gradualist AGI surface stuff? This is part of the idea that I’m referring to when I say that, even as the world ends, maybe there’ll be a bunch of smooth trendlines underneath it that somebody could look back and point out. (Which you could in fact have used to predict all the key jumpy surface thresholds, if you’d watched it all happen on a few other planets and had any idea of where jumpy surface events were located on the smooth trendlines—but we haven’t watched it happen on other planets so the trends don’t tell us much we want to know.)
This seems totally bogus to me.
It feels to me like you mostly don’t have views about the actual impact of AI as measured by jobs that it does or the $s people pay for them, or performance on any benchmarks that we are currently measuring, while I’m saying I’m totally happy to use gradualist metrics to predict any of those things. If you want to say “what does it mean to be a gradualist” I can just give you predictions on them.
To you this seems reasonable, because e.g. $ and benchmarks are not the right way to measure the kinds of impacts we care about. That’s fine, you can propose something other than $ or measurable benchmarks. If you can’t propose anything, I’m skeptical.
My basic guess is that you probably can’t effectively predict $ or benchmarks or anything else quantitative. If you actually agreed with me on all that stuff, then I might suspect that you are equivocating between a gradualist-like view that you use for making predictions about everything near term and then switching to a more bizarre perspective when talking about the future. But fortunately I think this is more straightforward, because you are basically being honest when you say that you don’t understand how the gradualist perspective makes predictions.
I kind of want to see you fight this out with Gwern (not least for social reasons, so that people would perhaps see that it wasn’t just me, if it wasn’t just me).
But it seems to me that the very obvious GPT-5 continuation of Gwern would say, “Gradualists can predict meaningless benchmarks, but they can’t predict the jumpy surface phenomena we see in real life.” We want to know when humans land on the moon, not whether their brain sizes continued on a smooth trend extrapolated over the last million years.
I think there’s a very real sense in which, yes, what we’re interested in are milestones, and often milestones that aren’t easy to define even after the fact. GPT-2 was shocking, and then GPT-3 carried that shock further in that direction, but how do you talk with that about somebody who thinks that perplexity loss is smooth? I can handwave statements like “GPT-3 started to be useful without retraining via just prompt engineering” but qualitative statements like those aren’t good for betting, and it’s much much harder to come up with the right milestone like that in advance, instead of looking back in your rearview mirror afterwards.
But you say—I think? - that you were less shocked by this sort of thing than I am. So, I mean, can you prophesy to us about milestones and headlines in the next five years? I think I kept thinking this during our dialogue, but never saying it, because it seemed like such an unfair demand to make! But it’s also part of the whole point that AGI and superintelligence and the world ending are all qualitative milestones like that. Whereas such trend points as Moravec was readily able to forecast correctly—like 10 teraops / plausibly-human-equivalent-computation being available in a $10 million supercomputer around 2010 - are really entirely unanchored from AGI, at least relative to our current knowledge about AGI. (They would be anchored if we’d seen other planets go through this, but we haven’t.)
Don’t you think you’re making a falsifiable prediction here?
Name something that you consider part of the “jumpy surface phenomena” that will show up substantially before the world ends (that you think Paul doesn’t expect). Predict a discontinuity. Operationalize everything and then propose the bet.
(I’m currently slightly hopeful about the theorem-proving thread, elsewhere and upthread.)
Perplexity is one general “intrinsic” measure of language models, but there are many task-specific measures too. Studying the relationship between perplexity and task-specific measures is an important part of the research process. We shouldn’t speak as if people do not actively try to uncover these relationships.
I would generally be surprised if there were many highly non-linear relationship between perplexity and something like Winograd accuracy, human evaluation, or whatever other concrete measure you can come up with, such that the underlying behavior of the surface phenomenon is best described as a discontinuity with the past even when the latent perplexity changed smoothly. I admit the existence of some measures that exhibit these qualities (such as, potentially, the ability to do arithmetic), but I expect them to be quite a bit harder to find than the reverse.
Furthermore, it seems like if this is the crux — ie. that surface-level qualitative phenomena will experience discontinuities even while latent variables do not — then I do not understand why it’s hard to come up with bet conditions.
Can’t you just pick a surface level phenomenon that’s easy to measure and strongly interpretable in a qualitative sense — like Sensibleness and Specificity Average from the paper on Google’s chatbot — and then predict discontinuities in that metric?
(I should note that the paper shows a highly linear relationship between perplexity and Sensibleness and Specificity Average. Just look at the first plot in the PDF.)
Well put / endorsed / +1.
I think that most people who work on models like GPT-3 seem more interested in trendlines than you do here.
That said, it’s not super clear to me what you are saying so I’m not sure I disagree. Your narrative sounds like a strawman since people usually extrapolate performance on downstream tasks they care about rather than on perplexity. But I do agree that the updates from GPT-3 are not from OpenAI’s marketing but instead from people’s legitimate surprise about how smart big language models seem to be.
As you say, I think the interesting claim in GPT-3 was basically that scaling trends would continue, where pessimists incorrectly expected they would break based on weak arguments. I think that looking at all the graphs, both of perplexity and performance on individual tasks, helps establish this as the story. I don’t really think this lines up with Eliezer’s picture of AGI but that’s presumably up for debate.
There are always a lot of people willing to confidently decree that trendlines will break down without much argument. (I do think that eventually the GPT-3 trendline will break if you don’t change the data, but for the boring reason that the entropy of natural language will eventually dominate the gradient noise and so lead to a predictable slowdown.)
I realize your comment is in context of a comment I also disagree with, and I also think I agree with most what you’re saying, but I want to challenge this framing you have at the end.
BPC is at its core a continuous generalization of the Turing Test, aka. the imitation game. It is not an exact translation, but it preserves all the key difficulties, and therefore keeps most of its same strengths, and it does this while extrapolating to weaker models in a useful and modelable way. We might only have started caring viscerally about the numbers that BPC gives, or associating them directly to things of huge importance, around the advent of GPT, but that’s largely just a situational byproduct of our understanding. Turing understood the importance of the imitation game back in 1950, enough to write a paper on it, and certainly that paper didn’t go unnoticed.
Nor can I see the core BPC:Turing Test correspondance as something purely post-hoc. If people didn’t give it much thought, that’s probably because there never was a scaling law then, there never was an expectation that you could just take your hacky grammar-infused Markov chain and extrapolate it out to capture more than just surface level syntax. Even among earlier neural models, what’s the point of looking at extrapolations of a generalized Turing Test, when the models are still figuring out surface level syntactic details? Like, is it really an indictment of BPC, to say that when we saw
we weren’t asking, ‘gee, I wonder how close this is to passing the Turing Test, by some generalized continuous measure’?
I think it’s quite surprising—importantly surprising—how it’s turned out that it actually is a relevant question, that performance on this original datapoint does actually bear some continuous mathematical relationship with models for which mere grammar is a been-there-done-that, and we now regularly test for the strength of their world models. And I get the dismissal, that it’s no proven law that it goes so far before stopping, rather than some other stretch, or that it gives no concrete conclusions for what happens at each 0.01 perplexity increment, but I look at my other passion with a straight line, hardware, and I see exactly the same argument applied to almost the same arrow-straight trendline, and I think, I’d still much rather trust the person willing to look at the plot and say, gosh, those transistors will be absurdly cheap.
Would that person predict today, back at the start? Hell no. Knowing transistor scaling laws doesn’t directly tell you all that much about the discontinuous changes in how computation is done. You can’t look at a graph and say “at a transistor density of X, there will be the iPhone, and at a transistor density of Y, microcontrollers will get so cheap that they will start replacing simple physical switches.” It certainly will not tell you when people will start using the technology to print out tiny displays they will stick inside your glasses, or build MEMS accelerometers, nor can it tell you all of the discrete and independent innovations that overcame the challenges that got us here.
But yet, but yet, lines go straight. Moore’s Law pushed computing forward not because of these concrete individual predictions, but because it told us there was more of the same surprising progress to come, and that the well has yet to run dry. That too is why I think seeing GPT-3′s perplexity is so important. I agree with you, it’s not that we need the perplexity to tell us what GPT-3 can do. GPT-3 will happily tell us that itself. And I think you will agree with me when I say that what’s most important about these trends is that they’re saying there’s more to come, that the next jump will be just as surprising as the last.
Where we maybe disagree is that I’m willing to say these lines can stand by themselves; that you don’t need to actually see anything more of GPT-3 than its perplexity to know that its capabilities must be so impressive, even if you might need to see it to feel it emotionally. You don’t even need to know anything about neural networks or their output samples to see a straight line of bits-per-character that threatens to go so low in order to forecast that something big is going on. You didn’t need to know anything about CPU microarchitecture to imagine that having ten billion transistors per square centimeter would have massive societal impacts either, as long as you knew what a transistor was and understood its fundamental relations to computation.
Yeah, my phrasing there was not ideal here. I regret using the word “marketing”, but to be fair, I mostly meant what I said in the next few sentences, “Maybe OpenAI saw an opportunity to dump a lot of compute into language models and have a two year discontinuity ahead of everyone else, and showcase their work. And that strategy seemed to really worked well for them.”
Of course, seeing that such an opportunity exists is itself laudable and I give them Bayes points for realizing that scaling laws are important. At the same time, don’t you think we would have expected similar results in like two more years at ordinary progress?
I do agree that it’s extremely interesting to know why the lines go straight. I feel like I wasn’t trying to say that GPT-3 wasn’t intrinsically interesting. I was more saying it wasn’t unpredictable, in the sense that Paul Christiano would have strongly said “no I do not expect that to happen” in 2018.
Again, the fact that it is a straight line on a metric which is, if not meaningless, is extremely difficult to interpret, is irrelevant. Maybe OA moved up by 2 years. Why would anyone care in the slightest bit? That is, before they knew about how interesting the consequences would be of that small change in BPC?
Who’s ‘we’, exactly? Who are these people who expected all of this to happen, and are going around saying “ah yes, these BIG-Bench results are exactly as I calculated back in 2018, the capabilities are all emerging like clockwork, each at their assigned BPC; next is capability Z, obviously”? And what are they saying about 500b, 1000b, and so on?
OK. So can you link me to someone saying in 2018 that we’d see GPT-2-1.5b’s behavior at ~1.5b parameters, and that we’d get few-shot metalearning and instructability past that with another OOM? And while you’re at it, if it’s so predictable, please answer all the other questions I gave, even if only the ones about scale. After all, you’re claiming it’s so easy to predict based on straight lines on convenient metrics like BPC and that there’s nothing special or unpredictable about jumping 2 years. So, please jump merely 2 years ahead and tell me what I can look forward as the SOTA in Nov 2023, I’m dying of excitement here.
I’m confused why you think looking at the rate and lumpiness of historical progress on narrowly circumscribed performance metrics is not meaningful, because it seems like you do seem to think that drawing straight lines is fine when compute is on the x-axis—which seems like a similar exercise. What’s going on there?
Because the point I was trying to make was that the result was relatively predictable? I’m genuinely confused what you’re asking. I get a slight sense that you’re interpreting me as saying something about the inherent dullness of GPT-3 or that it doesn’t teach us anything interesting about AI, but I don’t see myself as saying anything like that. I actually really enjoy reading the output from it, your commentary on it, and what it reveals about the nature of intelligence.
I am making purely a point about predictability, and whether the result was a “discontinuity” from past progress, in the sense meant by Paul Christiano (in the way I think he means these things).
We refers in that sentence to competent observers in 2018 who predict when we’ll get ML milestones mostly by using the outside view, ie. by extrapolating trends on charts.
No, but
That seems like a different and far more specific question than whether we’d have language models that perform at roughly the same measured-level as GPT-3.
In general, people make very few specific predictions about what they expect to happen in the future about these sorts of things (though, if I may add, I’ve been making modest progress trying to fix this broad problem by writing lots of specific questions on Metaculus).
I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we’ve seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance. Qualitative capabilities are more relevant to safety than benchmark performance is, because while qualitative capabilities include things like “code a simple video game” and “summarize movies with emojis”, they also include things like “break out of confinement and kill everyone”. It’s the latter capability, and not PTB performance, that you’d need to predict if you wanted to reliably stay out of the x-risk regime — and the fact that we can’t currently do so is, I imagine, what brought to mind the analogy between scaling and Russian roulette.
I.e., a straight line in domain X is indeed not surprising; what’s surprising is the way in which that straight line maps to the things we care about more than X.
(Usual caveats apply here that I may be misinterpreting folks, but that is my best read of the argument.)
This is a reasonable thesis, and if indeed it’s the one Gwern intended, then I apologize for missing it!
That said, I have a few objections,
Isn’t it a bit suspicious that the thing-that’s-discontinuous is hard to measure, but the-thing-that’s-continuous isn’t? I mean, this isn’t totally suspicious, because subjective experiences are often hard to pin down and explain using numbers and statistics. I can understand that, but the suspicion is still there.
“No one predicted X in advance” is only damning to a theory if people who believed that theory were making predictions about it at all. If people who generally align with Paul Christiano were indeed making predictions to the effect of GPT-3 capabilities being impossible or very unlikely within a narrow future time window, then I agree that would be damning to Paul’s worldview. But—and maybe I missed something—I didn’t see that. Did you?
There seems to be an implicit claim that Paul Christiano’s theory was falsified via failure to retrodict the data. But that’s weird, because much of the evidence being presented is mainly that the previous trends were upheld (for example, with Gwern saying, “The impact of GPT-3 was in establishing that trendlines did continue...”). But if Paul’s worldview is that “we should extrapolate trends, generally” then that piece of evidence seems like a remarkable confirmation of his theory, not a disconfirmation.
Do we actually have strong evidence that the qualitative things being mentioned were discontinuous with respect to time? I can certainly see some things being discontinuous with past progress (like the ability for GPT-3 to do arithmetic). But overall I feel like I’m being asked to believe something quite strong about GPT-3 breaking trends without actual references to what progress really looked like in the past.
I don’t deny that you can find quite a few discontinuities on a variety of metrics, especially if you search for them post-hoc. I think it would be fairly strawmanish to say that people in Paul Christiano’s camp don’t expect those at all. My impression is that they just don’t expect those to be overwhelming in a way that makes reliable reference class forecasting qualitatively useless; it seems like extrapolating from the past still gives you a lot better of a model than most available alternatives.
My impression is that some people are impressed by GPT-3′s capabilities, whereas your response is “ok, but it’s part of the straight-line trend on Penn Treebank; maybe it’s a little ahead of schedule, but nothing to write home about.” But clearly you and they are focused on different metrics!
That is, suppose it’s the case that GPT-3 is the first successfully commercialized language model. (I think in order to make this literally true you have to throw on additional qualifiers that I’m not going to look up; pretend I did that.) So on a graph of “language model of type X revenue over time”, total revenue is static at 0 for a long time and then shortly after GPT-3′s creation departs from 0.
It seems like the fact that GPT-3 could be commercialized in this way when GPT-2 couldn’t is a result of something that Penn Treebank perplexity is sort of pointing at. (That is, it’d be hard to get a model with GPT-3′s commercializability but GPT-2′s Penn Treebank score.) But what we need in order for the straight line on PTB to be useful as a model for predicting revenue is to know ahead of time what PTB threshold you need for commercialization.
And so this is where the charge of irrelevancy is coming from: yes, you can draw straight lines, but they’re straight lines in the wrong variables. In the interesting variables (from the “what’s the broader situation?” worldview), we do see discontinuities, even if there are continuities in different variables.
[As an example of the sort of story that I’d want, imagine we drew the straight line of ELO ratings for Go-bots, had a horizontal line of “human professionals” on that graph, and were able to forecast the discontinuity in “number of AI wins against human grandmasters” by looking at straight-line forecasts in ELO.]
I think it’s the nature of every product that comes on the market that it will experience a discontinuity from having zero revenue to having some revenue at some point. It’s an interesting question of when that will happen, and maybe your point is simply that it’s hard to predict when that will happen when you just look at the Penn Treebank trend.
However, I suspect that the revenue curve will look pretty continuous, now that it’s gone from zero to one. Do you disagree?
In a world with continuous, gradual progress across a ton of metrics, you’re going to get discontinuities from zero to one. I don’t think anyone from the Paul camp disagrees with that (in fact, Katja Grace talked about this in her article). From the continuous takeoff perspective, these discontinuities don’t seem very relevant unless going from zero to one is very important in a qualitative sense. But I would contend that going from “no revenue” to “some revenue” is not actually that meaningful in the sense of distinguishing AI from the large class of other economic products that have gradual development curves.
This is a big part of my point; a smaller elaboration is that it can be easy to trick yourself into thinking that, because you understand what will happen with PTB, you’ll understand what will happen with economics/security/etc., when in fact you don’t have much understanding of the connection between those, and there might be significant discontinuities. [To be clear, I don’t have much understanding of this either; I wish I did!]
For example, I imagine that, by thirty years from now, we’ll have language/code models that can do significant security analysis of the code that was available in 2020, and that this would have been highly relevant/valuable to people in 2020 interested in computer security. But when will this happen in the 2020-2050 range that seems likely to me? I’m pretty uncertain, and I expect this to look a lot like ‘flicking a switch’ in retrospect, even tho the leadup to flicking that switch will probably look like smoothly increasing capabilities on ‘toy’ problems.
[My current guess is that Paul / people in “Paul’s camp” would mostly agree with the previous paragraph, except for thinking that it’s sort of weird to focus on specifically AI computer security productivity, rather than the overall productivity of the computer security ecosystem, and this misplaced focus will generate the ‘flipping the switch’ impression. I think most of the disagreements are about ‘where to place the focus’, and this is one of the reasons it’s hard to find bets; it seems to me like Eliezer doesn’t care much about the lines Paul is drawing, and Paul doesn’t care much about the lines Eliezer is drawing.]
I think I agree in a narrow sense and disagree in a broad sense. For this particular example, I expect OpenAI’s revenues from GPT-3 to look roughly continuous now that they’re selling/licensing it at all (until another major change happens; like, the introduction of a competitor would likely cause the revenue trend to change).
More generally, suppose we looked at something like “the total economic value of horses over the course of human history”. I think we would see mostly smooth trends plus some implied starting and stopping points for those trends. (Like, “first domestication of a horse” probably starts a positive trend, “invention of stirrups” probably starts another positive trend, “introduction of horses to America” starts another positive trend, “invention of the automobile” probably starts a negative trend that ends with “last horse that gets replaced by a tractor/car”.)
In my view, ‘understanding the world’ looks like having a causal model that you can imagine variations on (and have those imaginations be meaningfully grounded in reality), and I think the bits that are most useful for building that causal model are the starts and stops of the trends, rather than the smooth adoption curves or mostly steady equilibria in between. So it seems sort of backwards to me to say that for most of the time, most of the changes in the graph are smooth, because what I want out of the graph is to figure out the underlying generator, where the non-smooth bits are the most informative. The graph itself only seems useful as a means to that end, rather than an end in itself.
Yeah, these are interesting points.
I sympathize with this view, and I agree there is some element of truth to it that may point to a fundamental gap in our understanding (or at least in mine). But I’m not sure I entirely agree that discontinuous capabilities are necessarily hard to measure: for example, there are benchmarks available for things like arithmetic, which one can train on and make quantitative statements about.
I think the key to the discontinuity question is rather that 1) it’s the jumps in model scaling that are happening in discrete increments; and 2) everything is S-curves, and a discontinuity always has a linear regime if you zoom in enough. Those two things together mean that, while a capability like arithmetic might have a continuous performance regime on some domain, in reality you can find yourself halfway up the performance curve in a single scaling jump (and this is in fact what happened with arithmetic and GPT-3). So the risk, as I understand it, is that you end up surprisingly far up the scale of “world-ending” capability from one generation to the next, with no detectable warning shot beforehand.
No, you’re right as far as I know; at least I’m not aware of any such attempted predictions. And in fact, the very absence of such prediction attempts is interesting in itself. One would imagine that correctly predicting the capabilities of an AI from its scale ought to be a phenomenally valuable skill — not just from a safety standpoint, but from an economic one too. So why, indeed, didn’t we see people make such predictions, or at least try to?
There could be several reasons. For example, perhaps Paul (and other folks who subscribe to the “continuum” world-model) could have done it, but they were unaware of the enormous value of their predictive abilities. That seems implausible, so let’s assume they knew the value of such predictions would be huge. But if you know the value of doing something is huge, why aren’t you doing it? Well, if you’re rational, there’s only one reason: you aren’t doing it because it’s too hard, or otherwise too expensive compared to your alternatives. So we are forced to conclude that this world-model — by its own implied self-assessment — has, so far, proved inadequate to generate predictions about the kinds of capabilities we really care about.
(Note: you could make the argument that OpenAI did make such a prediction, in the approximate yet very strong sense that they bet big on a meaningful increase in aggregate capabilities from scale, and won. You could also make the argument that Paul, having been at OpenAI during the critical period, deserves some credit for that decision. I’m not aware of Paul ever making this argument, but if made, it would be a point in favor of such a view and against my argument above.)
Can I try to parse out what you’re saying about stacked sigmoids? Because it seems weird to me. Like, in that view, it still seems like showing a trendline is some evidence that it’s not “interesting”. I feel like this because I expect the asymptote of the AlphaGo sigmoid to be independent of MCTS bots, so surely you should see some trends where AlphaGo (or equivalent) was invented first, and jumped the trendline up really fast. So not seeing jumps should indicate that it is more a gradual progression, because otherwise, if they were independent, about half the time the more powerful technique should come first.
The “what counter argument can I come up with” part of me says, tho, that how quickly the sigmoid grows likely depends on lots of external factors (like compute available or something). So instead of sometimes seeing a sigmoid that grows twice as fast as the previous ones, you should expect one that’s not just twice as tall, but twice as wide, too. And if you have that case, you should expect the “AlphaGo was invented first” sigmoid to be under the MCTS bots sigmoid for some parts of the graph, where it then reaches the same asymptote as AlphaGo in the mainline. So, if we’re in the world where AlphaGo is invented first, you can make gains by inventing MCTS bots, which will also set the trendline. And so, seeing a jump would be less “AlphaGo was invented first” and more “MCTS bots were never invented during the long time when they would’ve outcompeted AlphaGo version −1″
Does that seem accurate, or am I still missing something?