Are Emergent Abilities of Large Language Models a Mirage? [linkpost]

Matthew Barnett2 May 2023 21:01 UTC

53 points

Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, one can choose a metric which leads to the inference of an emergent ability or another metric which does not. Thus, our alternative suggests that existing claims of emergent abilities are creations of the researcher’s analyses, not fundamental changes in model behavior on specific tasks with scale. We present our explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abilities, (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on BIG-Bench; and (3) show how similar metric decisions suggest apparent emergent abilities on vision tasks in diverse deep network architectures (convolutional, autoencoder, transformers). In all three analyses, we find strong supporting evidence that emergent abilities may not be a fundamental property of scaling AI models.

This result seems important for two reasons:

If AI abilities are predictable, then we can forecast when we’ll get dangerous capabilities ahead of time, rather than being taken by surprise. This result strengthens the case for a research program of devising a ton of interesting benchmarks to measure how capabilities are improving as a function of scale.
It provides some evidence against the idea that “understanding is discontinuous”, and that important AI abilities will suddenly click together at some level, which is a very loose description of what I understood to be one of the primary intuitions behind AI foom.

What links here?

Matthew Barnett's comment on Matthew Barnett’s Shortform by Matthew Barnett (5 May 2023 20:16 UTC; 3 points)

Matthew Barnett2 May 2023 21:01 UTC

53 points

21 comments1 min readLW link

Emergent Behavior ( Emergence )AI

habryka 2 May 2023 22:57 UTC
22 points
12
I think there is something valuable in this kind of work, but also, my reaction to this continues to be pretty similar to Gwern’s and Eliezer’s reaction to similar discussions of forecasting AI progress:
The impact of GPT-3 had nothing whatsoever to do with its perplexity on Penn Treebank. I think this is a good example of why focusing on perplexity and ‘straight lines on graph go brr’ is so terrible, such cargo cult mystical thinking, and crippling. There’s something astonishing to see someone resort to explaining away GPT-3′s impact as ‘OpenAI was just good at marketing the results’. Said marketing consisted of: ‘dropping a paper on Arxiv’. Not even tweeting it! They didn’t even tweet the paper! (Forget an OA blog post, accompanying NYT/TR articles, tweets by everyone at OA, a fancy interactive interface—none of that.) And most of the initial reaction was “GPT-3: A Disappointing Paper”-style. If this is marketing genius, then it is truly 40-d chess, is all I can say.
The impact of GPT-3 was in establishing that trendlines did continue in a way that shocked pretty much everyone who’d written off ‘naive’ scaling strategies. Progress is made out of stacked sigmoids: if the next sigmoid doesn’t show up, progress doesn’t happen. Trends happen, until they stop. Trendlines are not caused by the laws of physics. You can dismiss AlphaGo by saying “oh, that just continues the trendline in ELO I just drew based on MCTS bots”, but the fact remains that MCTS progress had stagnated, and here we are in 2021, and pure MCTS approaches do not approach human champions, much less beat them. (This is also true of SVMs. Notice SVMs solving ImageNet because the trendlines continued? No, of course you did not. It drives me bonkers to see AI Impacts etc make arguments like “deep learning is unimportant because look, ImageNet follows a trendline”. Sheer numerology.) Appealing to trendlines is roughly as informative as “calories in calories out”; ‘the trend continued because the trend continued’. A new sigmoid being discovered is extremely important.
GPT-3 further showed completely unpredicted emergence of capabilities across downstream tasks which are not measured in PTB perplexity. There is nothing obvious about a PTB BPC of 0.80 that causes it to be useful where 0.90 is largely useless and 0.95 is a laughable toy. (OAers may have had faith in scaling, but they could not have told you in 2015 that interesting behavior would start at 𝒪(1b), and it’d get really cool at 𝒪(100b).) That’s why it’s such a useless metric. There’s only one thing that a PTB perplexity can tell you, under the pretraining paradigm: when you have reached human AGI level. (Which is useless for obvious reasons: much like saying that “if you hear the revolver click, the bullet wasn’t in that chamber and it was safe”. Surely true, but a bit late.) It tells you nothing about intermediate levels. I’m reminded of the Steven Kaas line:
Why idly theorize when you can JUST CHECK and find out the ACTUAL ANSWER to a superficially similar-sounding question SCIENTIFICALLY?
Using PBT, and talking only about perplexity, is a precise answer to the wrong question. (This is a much better argument when it comes to AlphaGo/ELO, because at least there, ‘ELO’ is in fact the ultimate objective, and not a proxy pretext. But perplexity is of no interest to anyone except an information theorist. Unfortunately, we lack any ‘take-over-the-world-ELO’ we can benchmark models on and extrapolate there. If we did and there was a smooth curve, I would indeed agree that we should adopt that as the baseline. But the closest things we have to downstream tasks are all wildly jumpy—even superimposing scores of downstream tasks barely gives you a recognizable smooth curve, and certainly nothing remotely as smooth as the perplexity curve. My belief is that this is because the overall perplexity curve comes from hundreds or thousands of stacked sigmoids and plateau/breakthroughs averaging out in terms of prediction improvements.) It sure would be convenient if the only number that mattered in AI or its real-world impact or risk was also the single easiest one to measure!
I emphasized this poverty of extrapolation in my scaling hypothesis writeup already, but permit me to vent a little more here:
“So, you’re forecasting AI progress using PTB perplexity/BPC. Cool, good work, nice notebook, surely this must be useful for forecasting on substantive AI safety/capability questions of interest to us. I see it’s a pretty straight line on a graph. OK, can you tell me at what BPC a large language model could do stuff like hack computers and escape onto the Internet?”
“No. I can tell you what happens if I draw the line out x units, though.”
“Perhaps that’s an unfairly specific question to ask, as important as it is. OK, can you tell me when we can expect to see well-known benchmarks like Winograd schemas be solved?”
“No. I can draw you a line on PTB to estimate when PTB is solved, though, if you give me a second and define a bound for ‘PTB is solved’.”
“Hm. Can you at least tell me when we can expect to see meta-learning emerge, with good few-shot learning—does the graph predict 0.1b, 1b, 10b, 100b, or what?”
“No idea.”
“Do you know what capabilities will be next to emerge? We got pretty good programming performance in Copilot at 𝒪(100b), what’s next?”
“I don’t know.”
“Can you qualitatively describe what we’d get at 1t, or 10t?”
“No, but I can draw the line in perplexity. It gets pretty low.”
“How about the existence of any increasing returns to scale in downstream tasks? Does it tell us anything about spikes in capabilities (such as we observe in many places, such as text style transfer and inner monologue in LaMDA at 100b; most recently BIG-bench)? Such as whether there are any more spikes past 𝒪(100b), whether we’ll see holdouts like causality suddenly fall at 𝒪(1000b), anything like that?”
“No.”
“How about RL: what sort of world modeling can we get by plugging them into DRL agents?”
“I don’t know.”
“Fine, let’s leave it at tool AIs doing text in text out. Can you tell me how much economic value will be driven by dropping another 0.01 BPC?”
“No. I can tell you how much it’d cost in GPU-time, though, by the awesome power of drawing lines!”
“OK, how about that: how low does it need to go to support a multi-billion dollar company running something like the OA API, to defray the next 0.01 drop and pay for the GPU-time to get more drops?”
“No idea.”
“How do you know BPC is the right metric to use?”
“Oh, we have lots of theories about it, but I’ll level with you: we always have theories for everything, but really, we chose BPC post hoc out of a few thousand metrics littering Arxiv like BLEU, ROUGE, SSA etc after seeing that it worked and better BPC = better models.”
“Can you write down your predictions about any of this?”
“Absolutely not.”
“Can anyone else?”
“Sure. But they’re all terribly busy.”
“Did you write down your predictions before now, then?”
“Oh gosh no, I wasn’t around then.”
“Did… someone… else… write down their predictions before?”
“Not that I’m aware of.”
“Ugh. Fine, what can you tell me about AI safety/risk/capabilities/economics/societal-disruption with these analyses of absolute loss?”
“Lines go straight?”
Seems to me that instead of gradualist narratives it would be preferable to say with Socrates that we are wise about scaling only in that we know we know little & about the least.
I don’t know what to do with some kind of abstract graphs that continue in a straight line, if I don’t know how performance on that abstract graph is related to actual concrete tasks whose performance I care a lot about.
I don’t know at what level of perplexity you can refactor codebases autonomously. I don’t know at what level of perplexity you can do novel biology research and develop novel pathogens. I don’t know at what level of perplexity I get a system that can meaningfully recursively improve itself and its training process.
It is still interesting that there might exist metrics on which progress over time is stable, though in the absence of finding how those metrics relate to the real world outcomes I care about, I don’t really know what to do with that.
- TurnTrout 3 May 2023 16:44 UTC
  14 points
  5
  Parent
  though in the absence of finding how those metrics relate to the real world outcomes I care about, I don’t really know what to do with that.
  I wrote a shortform on this:
  Are there convergently-ordered developmental milestones for AI? I suspect there may be convergent orderings in which AI capabilities emerge. For example, it seems that LMs develop syntax before semantics, but maybe there’s an even more detailed ordering relative to a fixed dataset. And in embodied tasks with spatial navigation and recurrent memory, there may be an order in which enduring spatial awareness emerges (i.e. “object permanence”).
  [A bunch of evidence...]
  We might even end up in the world where AI also follows the crow/human/animal developmental milestone ordering, at least roughly up until general intelligence. If so, we could better estimate timelines to AGI by watching how far the AI progresses on the known developmental ordering.
  ETA I think that this is seriously dependent on the training data modalities; GPT4 does not have spatial awareness. I think the informativeness of comvergently ordered developmental milestones is seriously reduced because we seem to be in the “spam LLM progress” world, and not the “train multiagent RL setups in simulated 3D environments” world.
  - jacob_cannell 4 May 2023 16:17 UTC
    2 points
    0
    Parent
    
    I think the informativeness of comvergently ordered developmental milestones is seriously reduced because we seem to be in the “spam LLM progress” world, and not the “train multiagent RL setups in simulated 3D environments” world.
    
    Deepmind was very much on that latter path.
    - TurnTrout 8 May 2023 21:43 UTC
      2 points
      0
      Parent
      Agreed, but that path is far less successful right now.
      - Mo Putera 6 Jun 2023 17:41 UTC
        1 point
        0
        Parent
        What can I read to learn more about why that path was less successful?
- Matthew Barnett 2 May 2023 23:38 UTC
  11 points
  1
  Parent
  I don’t know what to do with some kind of abstract graphs that continue in a straight line, if I don’t know how performance on that abstract graph is related to actual concrete tasks whose performance I care a lot about.
  If you have some task like “ability to do hacking” and you think it’s well measured by some benchmark (which seems like something we could plausibly design), then this result seems to indicate that performance on this task will scale predictably with scale, as long as you know how to do the right measurement to adjust for non-linear scaling.
  In other words, as long as you know how performance will increase with scale, you could fairly precisely predict what scale is necessary to obtain some arbitrary level of performance on a well-measured metric, before you’ve actually reached that level of scale. That seems like a useful thing to know for many of the same reasons found in your comment.
  - habryka 3 May 2023 1:29 UTC
    10 points
    7
    Parent
    If you have some task like “ability to do hacking” and you think it’s well measured by some benchmark (which seems like something we could plausibly design), then this result seems to indicate that performance on this task will scale predictably with scale, as long as you know how to do the right measurement to adjust for non-linear scaling.
    Yes, but I think that’s exactly what I haven’t seen. When I’ve seen benchmarks that try to do this, I’ve seen either:
    That specific benchmark is not actually very smooth OR
    The relationship of that benchmark to the task at hand came apart at unexpected time
    Though to be clear, I also haven’t really seen anyone try this very hard (and the data I’ve seen has come more from trying to forecast things like videogames and go-performance, which haven’t seen much data in recent years where things are maybe more stable).
    As far as I can tell this paper doesn’t really talk about this though. Though maybe I’ve missed something. I’ve only skimmed it.
    - Matthew Barnett 3 May 2023 2:29 UTC
      17 points
      3
      Parent
      
      Yes, but I think that’s exactly what I haven’t seen. When I’ve seen benchmarks that try to do this, I’ve seen either:
      
      That specific benchmark is not actually very smooth OR The relationship of that benchmark to the task at hand came apart at unexpected time
      
      Can you give some examples?
      
      I don’t think people have created good benchmarks for things like “ability to hack into computers” but I suspect this is partly because relatively little effort has gone into making good benchmarks IMO. Even for relatively basic things like mathematical problem solving, we have very few high quality benchmarks, and this doesn’t seem explained by people trying hard but failing. I suspect we just don’t have that much effort going into creating good benchmarks.
      
      But we do have lots of benchmarks for non-useful things, and the paper is just saying that these benchmarks show smooth performance.
      
      Insofar as you’re saying that progress on existing benchmarks doesn’t actually look smooth, it sounds like you’re not responding to the contribution of the paper, which was that you can perform a simple modification to the performance metric to make performance look smooth as a function of scale (e.g. rather than looking at accuracy you can look at edit distance). Perhaps you disagree, but I think the results in this paper straightforwardly undermine the idea that progress has been non-smooth as measured by benchmarks.
      
      I’d particularly like to see a specific example of “relationship of that benchmark to the task at hand came apart at unexpected time”.
      - habryka 4 May 2023 22:10 UTC
        5 points
        0
        Parent
        Can you give some examples?
        Sorry for not responding to this. Examples do seem great, though digging up the exact charts I remember has turned out to be a bit of a longer time investment.
        Some quick things I remembered feeling not that informative:
        Go performance measured in ELO felt pretty hard to forecast from this kind of graph
        Things like “When does chain-of-thought reasoning work?” for LLMs
        LLM performance on various arithmetic tasks
        Things like Alphafold, where I feel like there was basically no precursor. I remember there being forecasts about DL and protein folding, and I feel like none of them were very informative about when it would actually fall.
        Sorry again for not linking to things. I might get around writing a post on this, since I do think it really deserves more exploration, but time is short these days.
      - Amalthea 4 May 2023 7:42 UTC
        2 points
        1
        Parent
        I think asking for non-smoothness to call something an emergent property is unreasonable. If a performance graph is precisely an S-curve along a reasonable metric, it is reasonable to call that emergent, although it is perfectly smooth you can reparametrize to make it seem linear etc.
        
        I haven’t looked at the paper to see what it’s substance is, but from the description alone it could be a mathematical sleight of hand.
        Matthew Barnett 5 May 2023 19:57 UTC
        3 points
        0
        Parent
        Couldn’t the opposite critique easily be made? If some metric looks linear, then you could easily reparameterize it to make it look non-linear, and then call it emergent. That makes any claim about emergence trivial, if all you mean by emergence is that it arises non-linearly.
        The central claim about emergent abilities, as I understood it, was that such abilities cannot be predicted ahead of time. But the fact that you can reparameterize any metric to make it linear, and then predict when it will reach some threshold seems like an extremely important fact, if true.
        Compare two possible claims about some emergent ability:
        “At the 10^28 training FLOP level, LLMs will suddenly get the ability to hack into computers competently.”
        “At some training FLOP level—which cannot be predicted ahead of time—LLMs will suddenly get the ability to hack into computers competently.”
        Both claims are worrisome, since both imply that at some point we will go from having LLMs that can’t hack into other computers, to LLMs that can. But I would be way more worried if the second claim is true, compared to the first.
        gwern 6 May 2023 2:25 UTC
        12 points
        0
        Parent
        
        The central claim about emergent abilities, as I understood it, was that such abilities cannot be predicted ahead of time. But the fact that you can reparameterize any metric to make it linear, and then predict when it will reach some threshold seems like an extremely important fact, if true.
        
        Of course you can pick a reparameterization in hindsight, but without the benefit of hindsight, which reparameterization, exactly...?
        
        What is interesting about emergence is that it happens on ‘natural’ parameterizations of metrics, the ones people come up with in advance of knowing the results from scaling, as opposed to retrodicting/curve-fitting ad hoc measures to make an emergence go away. No one designed any of these Big-Bench or other tasks to display emergence, and most of the initial dozen or so examples weren’t even particularly highlighted by the original authors back when I was collecting them to try to convince people that this was an actual thing which actually happened and was worth trying to understand (particularly connections to inner-monologue, hidden scaling, and U-shaped scaling).
        
        When emergence happens on an obvious natural metric like accuracy, chosen independently of any scaling considerations at all, which often maps onto real world rewards and loss functions, then I am surprised. When un-emergence is retrodicted by the choice of metrics like… [checks notes]… ‘arithmetic accuracy expressed as a function of edit distance on BPE tokens’ (and a different one for each un-emergence) in order to explain away previously observed emergence and this retrodiction is being advertised to all and sundry as evidence of ‘predicting emergence’, then I am surprised in an entirely different way.
        Matthew Barnett 8 May 2023 18:13 UTC
        0 points
        −2
        Parent
        What is interesting about emergence is that it happens on ‘natural’ parameterizations of metrics, the ones people come up with in advance of knowing the results from scaling, as opposed to retrodicting/curve-fitting ad hoc measures to make an emergence go away.
        It’s not clear to me that edit distance or brier score are much less natural metrics than accuracy or multiple choice grade. I agree that we should have a presumption here since accuracy and multiple choice grade were chosen first, but the presumption seems pretty weak to me.
        I find it easy to imagine wanting to give a model partial credit for giving answers that are close to correct even before knowing anything about emergence. One plausible theory is that awarding partial credit might not have been salient to researchers because it’s not normally how we evaluate human students. But, our choice for how we evaluate human students seems more a function of evaluation costs and lack of access to output probabilities than anything deep about measuring performance.
        For these reasons, I don’t really find the metrics used in the papers ad hoc, except to the extent that “award partial credit for answers that are close to correct” is ad hoc. One prediction I’d probably make is that if we continue to use the same measures (token edit distance and brier score) then we’ll continue to see non-discontinuous progress on most benchmarks, by these measures. If true, that would at least partially falsify the claim that we were merely doing post-hoc curve fitting.
        ETA: the paper says that in >92% of cases, emergence is only observed on two metrics: (1) “Multiple Choice Grade”, and (2) “Exact String Match”. I agree that Multiple Choice Grade is a fairly “natural” metric, but “Exact String Match” is less natural, and it doesn’t seem very interesting to me that we see emergence under that choice.
        gwern 16 Aug 2025 4:08 UTC
        2 points
        0
        Parent
        
        For these reasons, I don’t really find the metrics used in the papers ad hoc, except to the extent that “award partial credit for answers that are close to correct” is ad hoc.
        
        If they are not ad hoc, what have they successfully predicted in the two and a half years since they concocted their original reparameterizations like ‘BPE edit distance’ to ‘explain’ past emergence?
        Amalthea 6 May 2023 7:01 UTC
        1 point
        0
        Parent
        You can reparametrize any monotonous function to make it linear.
        This can be used to predict the function
        
        Are wildly different claims. The point is that it’s always easy to do 1. in retrospect and this has no bearing whatsoever on 2.
        
        I think we would agree that (Log-) Flops or parameters or some mild combination of those would count as a reasonable metric?
        
        I’m not a statistician, but from what I know it should be extremely hard to predict S-curves before their inflection point, in particular if there’s no guarantee that what you’re predicting is literally a logistic function.
        
        That being said, trying to create benchmarks for all kinds of tasks seems like a reasonable thing to do in an case.
Neel Nanda 3 May 2023 21:20 UTC
16 points
9
I’m so torn on this paper -I think it makes a reasonable point that many claims of emergence are overrated and that it’s easy to massage metrics into a single narrative. But also, I think the title and abstract are overclaiming clickbait—obviously models have emergent abilities!! Chain of thought and few shot learning are just not a thing smaller models can do. Accuracy is sometimes the right metric, etc. It’s often overhyped, but this paper way overclaims
jacob_cannell 4 May 2023 16:26 UTC
8 points
0
In the Quanta Theory of Neural Scaling, individual token tasks (quanta) occupy some continuum between monogenic (non-linear/emergent) and polygenic (smooth linear). Seems reasonable that some tasks have circuit solution dependencies that work out to being more multiplicative/combinatoric than additive—ie circuit Z requires both X and Y, rather than X or Y.
Aaron_Scher 4 May 2023 2:14 UTC
6 points
2
Strong upvote because I want to signal boost this paper, though I think “It provides some evidence against the idea that “understanding is discontinuous”″ is too strong and this is actually very weak evidence.
Main ideas:
Emergent abilities, defined as being sharp and unpredictable, sometimes go away when we adopt different measurement techniques, or at least they become meaningfully less sharp and unpredictable.
Changing from non-linear/discontinuous metrics (e.g., Accuracy, Multiple Choice Grade) to linear/continuous metrics (e.g., Token Edit Distance, Brier Score) can cause lots of emergent abilities to disappear; Figure 3, much of the paper.
The authors find support for this hypothesis via:
- Using different metrics for GPT math performance and observing the results, finding that performance can look much less sharp/unpredictable with different metrics
- Meta-analysis: Understanding alleged emergent abilities in BIG-Bench, finding that there is not very much of it and 92% of emergent abilities appear when the metric is Multiple Choice Grade or Exact String Match; these are metrics we would expect to behave discontinuously; Figure 5. Additionally, taking the BIG-Bench tasks LaMDA displays emergence on and switching from Multiple Choice Grade to Brier Score causes emergence to disappear
- Inducing emergence: Taking models and tasks which do not typically exhibit emergence and modifying the metric to elicit emergence. Figures 7, 8.
Sometimes emergent abilities go away when you use a larger test set (the small models were bad enough that their performance was rounding to zero on small test sets); Figure 4 compared to Figure 3 top. This may work even if you are still using a non-linear metric like Accuracy.
Observed emergent abilities may be in part due to sparsely sampling from models with lots of parameters (because it’s costly to train multiple); Figure 7.
What I’m taking away besides the above:
I think this paper should give hope to those trying to detect deception and other dangerous model capabilities. While the downstream tasks we care about might be quite discontinuous in nature (we might be fine with an AI that can design up to 90% of a pathogen, but very dead at 100%), there is hope in identifying continuous metrics that we can measure which are correlated. It’s likely pretty hard to design such metrics, but we would be shooting ourselves in the foot to just go “oh deception will be emergent so there’s no way to predict it ahead of time.” This paper gives a couple ideas of approaches we might take to preventing that problem: designing more continuous and linear metrics, creating larger test sets, and sampling more large models.
The paper doesn’t say “emergence isn’t a thing, nothing to worry about here,” despite the provocative title, it gestures toward approaches we can take to make the unpredictable thing more predictable and indicates that the current unpredictability is largely resolved through different metrics, which is exactly what we should be trying to do when we want to avoid dangerous capabilities.
Charlie Steiner 2 May 2023 22:56 UTC
6 points
0
Interesting stuff. The nonlinearity of requiring long sequences of tokens doesn’t seem to be a fatal objection to measuring long sequences, because often we’re interested in capabilities that really do require getting long sequences all correct. But from the perspective of predicting capabilities, this is definitely a point for team straight lines on graphs.
Daniel Paleka 4 May 2023 10:06 UTC
5 points
3
Jason Wei responded at https://www.jasonwei.net/blog/common-arguments-regarding-emergent-abilities.
My thoughts: It is true that some metrics increase smoothly and some don’t. The issue is that some important capabilities are inherently all-or-nothing, and we haven’t yet found surrogate metrics which increase smoothly and correlate with things we care about.
What we want is: for a given capability, predicting whether this capability happens in the model that is being trained.
If extrapolating a smoothly increasing surrogate metric can do that, then emergence of that capability is indeed a mirage. Otherwise, Betteridge’s law of headlines applies.
Noosphere89 15 Aug 2025 20:08 UTC
2 points
0
One area where I’ve changed my mind on emergent capabilities is that I now think most emergent capabilities really were us not realizing how large the Internet truly was, and not realizing how much data GPT-3 and GPT-4 had.
The even more deflationary hypothesis is that most of the emergent capabilities were basically data-contamination.
Here’s an example of how easy it is to data-contaminate LLMs, where it’s very easy to give models near-perfect replications of questions in the test set:
AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination
AIME 2025 part I was conducted yesterday, and the scores of some language models are available here:
https://matharena.ai thanks to @mbalunovic, @ni_jovanovic et al.
I have to say I was impressed, as I predicted the smaller distilled models would crash and burn, but they actually scored at a reasonable 25-50%.
That was surprising to me! Since these are new problems, not seen during training, right? I expected smaller models to barely score above 0%. It’s really hard to believe that a 1.5B model can solve pre-math olympiad problems when it can’t multiply 3-digit numbers. I was wrong, I guess.
I then used openai’s Deep Research to see if similar problems to those in AIME 2025 exist on the internet. And guess what? An identical problem to Q1 of AIME 2025 exists on Quora:
https://quora.com/In-what-bases-b-does-b-7-divide-into-9b-7-without-any-remainder
I thought maybe it was just coincidence, and used Deep Research again on Problem 3. And guess what? A very similar question was on math.stackexchange:
https://math.stackexchange.com/questions/3548821/
Still skeptical, I used Deep Research on Problem 5, and a near identical problem appears again on math.stackexchange:
https://math.stackexchange.com/questions/3146556/how-many-five-digit-numbers-formed-from-digits-1-2-3-4-5-used-exactly-once-a#:~:text=,are%20divisible%20by%20%2412
I haven’t checked beyond that because the freaking p-value is too low already. Problems near identical to the test set can be found online.
So, what—if anything—does this imply for Math benchmarks? And what does it imply for all the sudden hill climbing due to RL?
I’m not certain, and there is a reasonable argument that even if something in the train-set contains near-identical but not exact copies of test data, it’s still generalization. I am sympathetic to that. But, I also wouldn’t rule out that GRPO is amazing at sharpening memories along with math skills.
At the very least, the above show that data decontamination is hard.
Never ever underestimate the amount of stuff you can find online. Practically everything exists online.
Unfortunately, the fact that companies won’t open their datasets makes it way too hard to actually study the issue of data contamination systematically.