Two scenarios:
I take a vision or language model which was cutting edge in 2000, and run it with a similar amount of compute/data to what’s typically used today.
I take a modern vision or language model, calculate how much money it costs to train, estimate the amount of compute I could have bought for that much money in 2000, then train it with that much compute
In both cases, assume that the number of parameters is scaled to available compute as needed (if possible), and we generally adjust the code to reflect scalability requirements (while keeping the algorithm itself the same).
Which of the two would perform better?
CLARIFICATION: my goal here is to compare the relative importance of insights vs compute. “More compute is actually really important” is itself an insight, which is why the modern-algorithm scenario talks about compute cost, rather than amount of compute actually used in 2000. Likewise, for the 2000-algorithm scenario, it’s important that the model only leverage insights which were already known in 2000.
This recent post by OpenAI is trying to shed some light in this question: https://openai.com/blog/ai-and-efficiency/
LW linkpost here.
Until 2017, the best performing language models were LSTMs, which have been around since 1997. However, LSTMs in their late era of dominance were distinguished from early LSTMs by experimenting with attention and a few other mechanisms, though it’s unclear to me how much this boosted their performance.
The paper that unseated LSTMs for language models reported an additional 2.0 BLEU score (range from 0 to 100) gained by switching to the new model, though this is likely an underestimate of the gain by switching to Transformers given that the old state-of-the-art models were tweaked very carefully.
My guess is that the 2000 model using 2020 compute would beat the 2020 model using 2000 compute easily, though I would love to see someone to do a deeper dive into this question.
I think they would both suck honestly. Many things have changed in 20 years. Datasets, metrics, and architectures have all changed significantly.
I think the relationship between algorithms and compute looks something like this.
For instance, look at language models. LSTMs had been introduced 3 years prior. People mainly used n-gram markov models for language modelling. N-grams don’t really scale and training a transformer using as much resources as you need to train an N-gram model would probably not work at all. In fact, I don’t think you even really “train” an N-gram model.
The same goes for computer vision. SVM’s using the kernel trick have terrible scaling properties. (O(N^3) in the number of datapoints, but until compute increased, they worked better. See the last slide here
You often hear the complaint that the algorithms we use were invented 50 years ago, and many NN techniques fall in and out of fashion.
I think this is all because of the interactions between algorithms and compute/data. The best algorithm for the job changes as a function of compute, so as compute grows, new methods that previously weren’t competitive suddenly start to outperform older methods.
I think this is a general trend in much of CS. Look at matrix multiplication. The naive algorithm has a small constant overhead, but N^3 scaling. You can use group theory to come up with algorithms that have better scaling, but have a larger overhead. As compute grows, the best matmul algorithm changes.
matmul?
Matrix Multiplication
Weak evidence for compute: apparently the original TD-gammon code from 1992 performed quite well when run with modern amount of compute (source).
My view: Although I think it is a neat thought experiment, my intuition is it is a false dichotomy to separate between compute and algorithm, and I think so because: narrowing the path dependence of a domain that consists of multiple requirements for it to evolve optimally to an “either/or” situation usually leads to deadlocks that can be paradoxical(not all deadlocks have to remain paradoxical, pre-emption/non-blocking synchronization is a way out) like the one above.
My answer: Not much difference, because twenty-year timescale doesn’t seem very significant to me; and also because neither has there been any fundamental revolution in the semiconductor/compute-manufacturing industry that has benefitted us in ways other than cost, and nor has there been any revolutionary algorithms found that couldn’t be run with old hardware scaled to today’s standards. (But in complex systems (which ML is) interactions matters more than anything else, so I might be way off here)