I think they would both suck honestly. Many things have changed in 20 years. Datasets, metrics, and architectures have all changed significantly.
I think the relationship between algorithms and compute looks something like this.
For instance, look at language models. LSTMs had been introduced 3 years prior. People mainly used n-gram markov models for language modelling. N-grams don’t really scale and training a transformer using as much resources as you need to train an N-gram model would probably not work at all. In fact, I don’t think you even really “train” an N-gram model.
The same goes for computer vision. SVM’s using the kernel trick have terrible scaling properties. (O(N^3) in the number of datapoints, but until compute increased, they worked better. See the last slide here
You often hear the complaint that the algorithms we use were invented 50 years ago, and many NN techniques fall in and out of fashion.
I think this is all because of the interactions between algorithms and compute/data. The best algorithm for the job changes as a function of compute, so as compute grows, new methods that previously weren’t competitive suddenly start to outperform older methods.
I think this is a general trend in much of CS. Look at matrix multiplication. The naive algorithm has a small constant overhead, but N^3 scaling. You can use group theory to come up with algorithms that have better scaling, but have a larger overhead. As compute grows, the best matmul algorithm changes.
I think they would both suck honestly. Many things have changed in 20 years. Datasets, metrics, and architectures have all changed significantly.
I think the relationship between algorithms and compute looks something like this.
For instance, look at language models. LSTMs had been introduced 3 years prior. People mainly used n-gram markov models for language modelling. N-grams don’t really scale and training a transformer using as much resources as you need to train an N-gram model would probably not work at all. In fact, I don’t think you even really “train” an N-gram model.
The same goes for computer vision. SVM’s using the kernel trick have terrible scaling properties. (O(N^3) in the number of datapoints, but until compute increased, they worked better. See the last slide here
You often hear the complaint that the algorithms we use were invented 50 years ago, and many NN techniques fall in and out of fashion.
I think this is all because of the interactions between algorithms and compute/data. The best algorithm for the job changes as a function of compute, so as compute grows, new methods that previously weren’t competitive suddenly start to outperform older methods.
I think this is a general trend in much of CS. Look at matrix multiplication. The naive algorithm has a small constant overhead, but N^3 scaling. You can use group theory to come up with algorithms that have better scaling, but have a larger overhead. As compute grows, the best matmul algorithm changes.
matmul?
Matrix Multiplication