N = #params, D = #data
Training compute = const .* N * D
Forward pass cost (R bits) = c * N, and assume R = Ω(1) on average
Now, thinking purely information-theoretically:
Model stealing compute = C * fp16 * N / R ~ const. * c * N^2
If compute-optimal training and α = β in Chinchilla scaling law:
Model stealing compute ~ Training compute
For significantly overtrained models:
Model stealing << Training compute
Typically:
Total inference compute ~ Training compute
=> Model stealing << Total inference compute
Caveats:
- Prior on weights reduces stealing compute, same if you only want to recover some information about the model (e.g. to create an equally capable one)
- Of course, if the model is producing much fewer than 1 token per forward pass, then model stealing compute is very large
I think the labs might well be rational in focusing on this sort of “handheld automation”, just to enable their researchers to code experiments faster and in smaller teams.
My mental model of AI R&D is that it can be bottlenecked roughly by three things: compute, engineering time, and the “dark matter” of taste and feedback loops on messy research results. I can certainly imagine a model of lab productivity where the best way to accelerate is improving handheld automation for the entirety of 2025. Say, the core paradigm is fixed; but inside that paradigm, the research team has more promising ideas than they have time to implement and try out on smaller-scale experiments; and they really do not want to hire more people.
If you consider the AI lab as a fundamental unit that wants to increase its velocity, and works on things that make models faster, it’s plausible they can be aware how bad the model performance is on research taste, and still not be making a mistake by ignoring your “dark matter” right now. They will work on it when they are faster.