Researcher incentives cause smoother progress on benchmarks
(Epistemic status: likely. That said, this post isn’t thorough; I wanted to write quickly.)
Let’s look at the state of the art in ImageNet.[1] The curve looks pretty smooth especially for the last 7 years. However, there don’t seem to be that many advances which actually improve current results.
Here’s a list which should include most of the important factors:
Batch norm
Better LR schedules
Residual connections
MBConv
NAS
ReLU
Attention
Augmentation schemes
Some other tweaks to the training process
Moar compute
Part of the smoothness comes from compute scaling, but I think another important factor is the control system of researchers trying to achieve SOTA (compare to does reality drive straight lines on graphs, or do straight lines on graphs drive reality?).
For instance, consider the batch norm paper. Despite batch norm being a relatively large advancement (removing it would greatly harm performance with current models even after retuning), the improvement in top 5 SOTA error from this paper is only from 4.94% to 4.82%. This is likely because the researchers only bothered to improve performance until the SOTA threshold was reached. When surpassing SOTA by a large amount is easy, this situation likely differs, but that seems uncommon (it does seem to have been the case for resnet).
This presents a reason to be wary of generalizing smooth progress on benchmarks to smooth AI progress in future high investment scenarios where research incentives could differ greatly.
(I’m also planning on writing a post on gears level models of where smooth AI progress could come from, but I wanted to write this first as a standalone post. Edit: here is the post)
- ↩︎
Yes, ImageNet SOTA is mostly meaningless garbage. This post is actually trying to increase the rate at which the fully automatic nail gun is shooting at that particular dead horse containing coffin.
You’re also looking at %. Once you hit >90%… Does it make sense to argue about, say, Florence getting ‘only’ >99% top-5 being disappointing? (Plus, surely the label error alone is more than zero.) The real progress is elsewhere: Florence is getting zero-shot 97%. Dang.
I think this basically matches my take. In particular, that researcher incentives (including things like ‘getting SOTA on a benchmark’) cause more points on the curve of continuous progress.
However it doesn’t seem like this sort of thing prevents discontinuous progress.
In particular, when considering “what happens just before AGI” it’s not clear that this makes the final steps less likely to be discontinuous.
With that in mind, I don’t think that this ‘filling in of the progress curve’ does much to change the the possibility of discontinuity right before AGI much.
Another factor I’m considering here is benchmarks quickly saturate in their utility. (This is basically a direct result of GoodHarting)
In the ImageNet case, the early progress tracked generalized deep neural network progress for a few years, but now most of that progress is happening elsewhere, and the benchmark has ceased to be a good metric.
In the development of technologies, the first few key innovations tend to be more discontinuous than innovations made once the technology is already mature. For example, the steps required to make the first plane that flies more than a few kilometers were discontinuous, whereas these days, year-to-year improvements to airliners are quite modest.
As I understand it, the basic argument for discontinuities around AGI is that AGI will be “at the beginning” of it’s development curve, as it will be the result of a few key innovations, as opposed to a side effect of modest progress on an already mature technology. In other words, the metaphorical “first key steps” will happen right before AGI is developed, as opposed to in the distant past, such as when we first developed backpropagation or alpha-beta pruning.
The basic case against discontinuities is that we have some reason to think that AI is already maturing as a technology. If, for example, we could simply scale a deep learning model to produce AGI, then the main reason to expect a discontinuity would be if there is some other weird discontinuity elsewhere, such as big tech corporations suddenly deciding to dump a bunch of money into scaling neural networks (but why wouldn’t they do that earlier?).
I’m not sure I understood Ryan Greenblatt’s argument, and your point here, but I don’t see a huge difference between the type of incentives that produced continuous progress on these benchmarks, and the incentives that will produce AGI. Generally, I expect before AGI arrives, a ton of people will be trying really hard to make even tiny improvements on an already somewhat-mature tech, on whatever general measure they’re trying to target.
This discontinuity could lie in the space of AI discoveries. The discovery space is not guaranteed to be efficiently explored: there could be simple and high impact discoveries which occur later on. I’m not sure how much credence I put in this idea. Empirically it does seem like the discovery space is explored efficiently in most fields with high investment, but generalizing this to AI seems non-trivial. Possible exceptions include relativity in physics.
Edit: I’m using the term efficiency somewhat loosely here. There could be discoveries which are very difficult to think of but which are considerably more simple than current approaches. I’m refering to the failure to find these discoveries as ‘inefficiency’, but there isn’t concrete action which can/should be taken to resolve this.
Rob Bensinger examines this idea in more detail in this discussion.