Tamay

Karma: 734

I’m interested in the economics of computing and big-picture trends in machine learning. https://www.tamaybesiroglu.com/

Tamay Mar 4, 2025, 1:02 AM
1 point
0
in reply to: Daniel Kokotajlo’s comment on: Tamay’s Shortform
I had in mind just your original counterparty. Requiring become a public market maker seems like quite the commitment.

Tamay Jan 19, 2025, 2:45 AM
132 points
−4
in reply to: meemi’s comment on: meemi’s Shortform
Tamay from Epoch AI here.
We made a mistake in not being more transparent about OpenAI’s involvement. We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible. Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset. We own this error and are committed to doing better in the future.
For future collaborations, we will strive to improve transparency wherever possible, ensuring contributors have clearer information about funding sources, data access, and usage purposes at the outset. While we did communicate that we received lab funding to some mathematicians, we didn’t do this systematically and did not name the lab we worked with. This inconsistent communication was a mistake. We should have pushed harder for the ability to be transparent about this partnership from the start, particularly with the mathematicians creating the problems.
Getting permission to disclose OpenAI’s involvement only around the o3 launch wasn’t good enough. Our mathematicians deserved to know who might have access to their work. Even though we were contractually limited in what we could say, we should have made transparency with our contributors a non-negotiable part of our agreement with OpenAI.
Regarding training usage: We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.
Relevant OpenAI employees’ public communications have described FrontierMath as a ‘strongly held out’ evaluation set. While this public positioning aligns with our understanding, I would also emphasize more broadly that labs benefit greatly from having truly uncontaminated test sets.
OpenAI has also been fully supportive of our decision to maintain a separate, unseen holdout set—an extra safeguard to prevent overfitting and ensure accurate progress measurement. From day one, FrontierMath was conceived and presented as an evaluation tool, and we believe these arrangements reflect that purpose.
[Edit: Clarified OpenAI’s data access—they do not have access to a separate holdout set that serves as an additional safeguard for independent verification.]

Tamay Nov 17, 2024, 3:53 PM
12 points
−3
on: Tamay’s Shortform
Short version: The claim that AI automation of software engineering will erase NVIDIA’s software advantage misunderstands that as markets expand, the rewards for further software improvements grow substantially. While AI may lower the cost of matching existing software capabilities, overall software project costs are likely to keep increasing as returns on optimization rise. Matching the frontier of performance in the future will still be expensive and technically challenging, and access to AI does not necessarily equalize production costs or eliminate NVIDIA’s moat.
I often see the argument that, since NVIDIA is largely software, when AI automates software, NVIDIA will have no moat, and therefore NVIDIA a bad AI bet. The argument goes something like: AI drives down the cost of software, so the barriers to entry will be much lower. Competitors can “hire” AI to generate the required software by, for example, tasking LLMs with porting application-level code into appropriate low-level instructions, which would eliminate NVIDIA’s competitive advantage stemming from CUDA.
However, while the cost of matching existing software capabilities will decline, the overall costs of software projects are likely to continue increasing, as is the usual pattern. This is because, with software, the returns to optimization increase with the size of the addressable market. As the market expands, companies have greater incentives to invest intensely because even small improvements in performance or efficiency can yield substantial overall benefits. These improvements impact a large number of users, and the costs are amortized across this extensive user base.
Consider web browsers and operating systems: while matching 2000s-era capabilities now takes >1000x fewer developer hours using modern frameworks, the investments that Google makes in Chrome and Microsoft in Windows vastly exceed what tech companies spent in the 2000s. Similarly, as AI becomes a larger part of the overall economy, I expect the investments needed for state-of-the-art GPU firmware and libraries to be greater than those today.
When software development is mostly AI-driven, there will be opportunities to optimize software with more spending, such as by spending on AI inference, building better scaffolding, or producing better ways of testing and verifying potential improvements. This just seems to match our understanding of inference scaling for other complex reasoning tasks, such as programming or mathematics.
It’s also unlikely that the relative cost of producing the same software will be much more equalized; that anyone can hire the same “AI” to do the engineering. Just having access to the raw models is often not sufficient for getting state-of-the-art results (good luck matching AlphaProof’s IMO performance with the Gemini API).
To be clear, I am personally not too optimistic about NVIDIA’s long term future. There are good reasons to expect their moat won’t persist:
- Dethroning NVIDIA is now a trillion dollar proposition, and their key customers are all trying to produce GPU substitutes
- Rapid technological progress tends to erode competitive advantages by enabling substitute technologies
- NVIDIA has had issues adopting new technologies, such as CoWoS-L packaging, and therrefore appears less competent in staying ahead of its competition.
My claim is narrower: the argument that “when AI can automate software engineering, companies whose moat involves software will be outcompeted” seems incorrect.

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

TamayNov 14, 2024, 6:13 AM

33 points

0 comments3 min readLW link

(epoch.ai)

Tamay May 31, 2024, 11:52 PM
25 points
0
on: Algorithmic Improvement Is Probably Faster Than Scaling Now
My guess is that compute scaling is probably more important when looking just at pre-training and upstream performance. When looking innovations both pre- and post-training and measures of downstream performance, the relative contributions are probably roughly evenly matched.

Compute for training runs is increasing at a rate of around 4-5x/year, which amounts to a doubling every 5-6 months, rather than every 10 months. This is what we found in the 2022 paper, and something we recently confirmed using 3x more data up to today.
Algorithms and training techniques for language models seem to improve at a rate that amounts to a doubling of ‘effective compute’ about doubling every 8 months, though, like our work on vision, this estimate has large errors bars. Still, it’s likely to be slower than the 5-6 month doubling time for actual compute. These estimates suggest that compute scaling has been responsible for perhaps 2/3rds of performance gains over the 2014-2023 period, with algorithms + insights about optimal scaling + better data, etc. explaining the remaining 1/3rd.

The estimates mentioned only account for the performance gains from pre-training, and do not consider the impact of post-training innovations. Some key post-training techniques, such as prompting, scaffolding, and finetuning, have been estimated to provide performance improvements ranging from 2 to 50 times in units of compute-equivalents, as shown in the plot below. However, these estimates vary substantially depending on the specific technique and domain, and are somewhat unreliable due to their scale-dependence.
Naively adding these up with the estimates from the progress from pre-training suggests that compute scaling likely still acounts for most of the performance gains, though it looks more evenly matched.
… and that was just in vision nets. I haven’t seen careful analysis of LLMs (probably because they’re newer, so harder to fit a trend), but eyeballing it… Chinchilla by itself must have been a factor-of-4 compute-equivalent improvement at least.
Incidentally, I looked into the claim about Chinchilla scaling. It turns out that Chinchilla was actually more like a factor 1.6 to 2 in compute-equivalent gain over Kaplan at the scale of models today (at least if you use the version of the scaling law that corrects a mistake the Chinchilla paper made when doing the estimation).

Tamay Apr 19, 2024, 3:57 AM
6 points
0
on: AI #60: Oh the Humanity
Sebastian Borgeaud, one of the lead authors of the Chinchilla scaling paper, admits there was a bug in their code. https://twitter.com/borgeaud_s/status/1780988694163321250
Claim that the Chinchilla paper calculated the implied scaling laws incorrectly. Yes, it seems entirely plausible that there was a mistake, tons of huge training runs relied on the incorrect result, and only now did someone realize this. Why do you ask?

Tamay Mar 14, 2024, 3:30 AM
3 points
0
in reply to: Ted Sanders’s comment on: Transformative AGI by 2043 is <1% likely
I’m interested. What bets would you offer?

Announcing Epoch’s newly expanded Parameters, Compute and Data Trends in Machine Learning database

Robi Rahman, Jaime Sevilla Molina, Tamay, Ege Erdil, Pablo Villalobos, Ben Cottier and Matthew Barnett

Oct 25, 2023, 2:55 AM

18 points

0 comments1 min readLW link

(epochai.org)

Tamay Feb 5, 2023, 4:19 AM
7 points
on: Tamay’s Shortform
There is an insightful literature that documents and tries to explain why large incumbent tech firms fail to invest appropriately in disruptive technologies, even when they played an important role in its invention. I speculatively think this sheds some light on why we see new firms such as OpenAI rather than incumbents such as Google and Meta leading the deployment of recent innovations in AI, notably LLMs.
Disruptive technologies—technologies that initially fail to satisfy existing demands but later surpass the dominant technology—are often underinvested in by incumbents, even when these incumbents played a major role in their invention. Henderson and Clark, 1990 discuss examples of this phenomenon, such as Xerox’s failure to exploit their technology and transition from larger to smaller copiers:
Xerox, the pioneer of plain-paper copiers, was confronted in the mid-1970s with competitors offering copiers that were much smaller and more reliable than the traditional product. The new products required little new scientific or engineering knowledge, but despite the fact that Xerox had invented the core technologies and had enormous experience in the industry, it took the company almost eight years of missteps and false starts to introduce a competitive product into the market. In that time Xerox lost half of its market share and suffered serious financial problems
and RCA’s failure to embrace the small transistorized radio during the 1950s:
In the mid-1950s engineers at RCA’s corporate research and development center developed a prototype of a portable, transistorized radio receiver. The new product used technology in which RCA was accomplished (transistors, radio circuits, speakers, tuning devices), but RCA saw little reason to pursue such an apparently inferior technology. In contrast, Sony, a small, relatively new company, used the small transistorized radio to gain entry into the US, market. Even after Sony’s success was apparent, RCA remained a follower in the market as Sony introduced successive models with improved sound quality and FM capability. The irony of the situation was not lost on the R&D engineers: for many years Sony’s radios were produced with technology licensed from RCA, yet RCA had great difficulty matching Sony’s product in the marketplace
A few explanations of this “Innovator’s curse” are given in the literature:
- Christensen (1997) suggests this is due to, among other things:
  - Incumbents focus on innovations that address existing customer needs rather than serving small markets. Customer bases usually ask for incremental improvements rather than radical innovations.
  - Disruptive products are simpler and cheaper; they generally promise lower margins, not greater profits
  - Incumbents’ most important customers usually don’t want radically new technologies, as they can’t immediately use these
- Reinganum (1983) shows that under conditions of uncertainty, incumbent monopolists will rationally invest less in innovation than entrants will, for fear of cannibalizing the stream of rents from their existing products
- Leonard-Barton (1992) suggests that the same competencies that have driven incumbent’s commercial success may produce ‘competency traps’ (engrained habits, procedures, equipment or expertise that make change difficult); see also Henderson, 2006
- Henderson, 1993 highlights that entrants have greater strategic incentives to invest in radical innovation, and incumbents fall prey to inertia and complacency
After skimming a few papers on this, I’m inclined to draw an analogue here for AI: Google produced the Transformer; labs at Google, Meta, and Microsoft, have long been key players in AI research, and yet, the creation of explicitly disruptive LLM products that aim to do much more than existing technologies has been led mostly by relative new-comers (such as OpenAI, Anthropic, and Cohere for LLMs and StabilityAI for generative image models).
The same literature also suggests how to avoid the “innovator curse”, such as through establishing independent sub-organizations focused on disruptive innovations (see Christensen ,1997 and Christensen, 2003), which is clearly what companies like Google have done, as its AI labs have a large degree of independence. And yet this seems not to seem to have been sufficient to establish the dominance of these firms when it comes to the frontiers of LLMs and the like.

Predicting GPU performance

Marius Hobbhahn and Tamay

Dec 14, 2022, 4:27 PM

60 points

26 comments1 min readLW link

(epochai.org)

Revisiting algorithmic progress

Tamay and Ege Erdil

Dec 13, 2022, 1:39 AM

95 points

15 comments2 min readLW link 1 review

(arxiv.org)

Tamay Nov 14, 2022, 6:49 PM
8 points
5
in reply to: Dave Orr’s comment on: Will we run out of ML data? Evidence from projecting dataset size trends
If the data is low-quality and easily distinguishable from human-generated text, it should be simple to train a classifier to spot LM-generated text and exclude this from the training set. If it’s not possible to distinguish, then it should be high-enough quality so that including it is not a problem.

ETA: As people point out below, this comment was glib and glosses over some key details; I don’t endorse this take anymore.

Tamay Aug 18, 2022, 1:17 AM
30 points
23
in reply to: NunoSempere’s comment on: The longest training run
Good question. Some thoughts on why do this:
- Our results suggest we won’t be caught off-guard by highly capable models that were trained for years in secret, which seems strategically relevant for those concerned with risks
- We looked whether there was any ‘alpha’ in these results by investigating the training durations of ML training runs, and found that models are typically trained for durations that aren’t far off from what our analysis suggests might be optimal (see a snapshot of the data here)
- It independently seems highly likely that large training runs would already be optimized in this dimension, which further suggests that this has little to no action-relevance for advancing the frontier

The longest training run

Jsevillamol, Tamay, Owen D and anson.ho

Aug 17, 2022, 5:18 PM

71 points

12 comments9 min readLW link

(epochai.org)

Trends in GPU price-performance

Marius Hobbhahn and Tamay

Jul 1, 2022, 3:51 PM

85 points

13 comments1 min readLW link 1 review

(epochai.org)

Announcing Epoch: A research organization investigating the road to Transformative AI

Jsevillamol, Pablo Villalobos, Tamay, lennart, Marius Hobbhahn and anson.ho

Jun 27, 2022, 1:55 PM

97 points

2 comments2 min readLW link

(epochai.org)

Tamay May 15, 2022, 9:19 PM
1 point
in reply to: alyssavance’s comment on: Is AI Progress Impossible To Predict?
I’m not sure what you mean; I’m not looking at log-odds. Maybe the correlation is an artefact from noise being amplified in log-space (I’m not sure), but it’s not obvious to me that this isn’t the correct way to analyse the data.

Tamay May 15, 2022, 8:30 PM
25 points
in reply to: alyssavance’s comment on: Is AI Progress Impossible To Predict?
Thanks! At least for Gopher, if you look at correlations between reductions in log-error (which I think is the scaling laws literature suggests would be the more natural framing) you find a more tighter relationship, particularly when looking at the relatively smaller models.

Tamay May 15, 2022, 7:17 PM
1 point
in reply to: alyssavance’s comment on: Is AI Progress Impossible To Predict?
Thanks, though I was hoping for something like a Google Sheet containing the data.

Tamay May 15, 2022, 6:49 PM
10 points
on: Is AI Progress Impossible To Predict?
This is super interesting. Are you able to share the underlying data?

Tamay

Fron­tierMath: A Bench­mark for Eval­u­at­ing Ad­vanced Math­e­mat­i­cal Rea­son­ing in AI

An­nounc­ing Epoch’s newly ex­panded Pa­ram­e­ters, Com­pute and Data Trends in Ma­chine Learn­ing database

Pre­dict­ing GPU performance

Re­vis­it­ing al­gorith­mic progress

The longest train­ing run

Trends in GPU price-performance

An­nounc­ing Epoch: A re­search or­ga­ni­za­tion in­ves­ti­gat­ing the road to Trans­for­ma­tive AI

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Announcing Epoch’s newly expanded Parameters, Compute and Data Trends in Machine Learning database

Predicting GPU performance

Revisiting algorithmic progress

The longest training run

Announcing Epoch: A research organization investigating the road to Transformative AI