You can achieve infinitely (literally) faster than Alexnet training time if you just take the weight of Alexnet.
You can also achieve much faster performance if you rely on weight transfer and or hyperparameter optimization based on looking at the behavior of an already trained Alexnet. Or, mind you, some other image-classification model based on that.
Once a given task is “solved” it become trivial to compute models that can train on said task exponentially faster, since you’re already working down from a solution.
Could you clarify, you mean the primary cause of efficiency increase wasn’t algorithmic or architectural developments, but researchers just fine-tuning weight transferred models?
However, if you want to look for exp improvement you can always find it and if you want to look for log improvement you always will.
Are you saying that the evidence for exponential algorithmic efficiency, not just in image processing, is entirely cherry picked?
In regards to training text models “x time faster”, go into the “how do we actually benchmark text models” section the academica/internet flamewar library.
I googled that and there were no results, and I couldn’t find an “academica/internet flamewar library” either.
Look I don’t know enough about ML yet to respond intelligently to your points, could someone else more knowledgeable than me weigh in here please?
Could you clarify, you mean the primary cause of efficiency increase wasn’t algorithmic or architectural developments, but researchers just fine-tuning weight transferred models?
Algorithm/Architecture are fundamentally hyperparameters, so when I say “fine-tuning hyperparameters” (i.e. the ones that aren’t tuned by the learning process itself), those are included.
Granted, you have jumped from e.g. LSTM to attention, where you can’t think of it as “hyperparameter” tuning, since it’s basically a shift in mentality in many ways.
But in computer vision, at least to my knowledge, most of the improvements would boil down to tuning optimization methods. E.g here’s an analysis of the subject (https://www.fast.ai/2018/07/02/adam-weight-decay/) describing some now-common method, mainly around CV.
However, the problem is that the optimization is happening around the exact same datasets Alexnet was built around. Even if you don’t transfer weight, “knowing” a very good solution helps you fine-tune much quicker around a problem ala ImagNet, or cifrar, or mnist or various other datasets that fall into the category of “classifying things which are obviously distinct to humans from square images of roughly 50 to 255px width/height”
But that domain is fairly niche if we were to look at, e.g., almost any time-series prediction datasets… not much progress has been made since the mid 20s. And maybe that’s because no more progress can be made, but the problem is that until we know the limits of how “solvable” a problem is, the problem is hard. Once we know how to solve the problem in one way, achieving similar results, but faster, is a question of human ingenuity we’ve been good at since at least the industrial revolution.
I mean, you could build an Alexnet-specific circuit, not now, but back when it was invented, and get 100x or 1000x performance, but nobody is doing that because our focus is not (or, at least, should not) fall under optimizing very specific problems. Rather, the important thing is finding techniques that can generalize.
**Note: Not a hardware engineer, not sure how easily one can come up with auto diff circuits, might be harder than I’d expect for that specific case, just trying to illustrate the general point**
Are you saying that the evidence for exponential algorithmic efficiency, not just in image processing, is entirely cherry picked?
if you want a simple overview of how speed and accuracy has evolved on a broader range of problems. And even those problems are cherry picked, in that they are very specific competition/research problems that hundreds of people are working on.
I googled that and there were no results, and I couldn’t find an “academica/internet flamewar library” either.
Then there’s the problem of how one actually “evaluates” how good an NLP model is.
As in, think of the problem for a second, I ask you:
“How good is this translation from English to French, on a scale from 1 to 10” ?
For anything beyond simple phrases that question is very hard, almost impossible. And even if it iisn’tsnt’, i.e. if we can use the aggregate perceptions of many humans to determine “truth” in that regard, you can’t capture that in a simple accuracy function that evaluates the model.
And if you want anecdotes from another field I’m more familiar with, the whole “field” of neural architecture search (building algorithms to build algorithms), has arguably overfit on specific problems for the last 5 years to the point that all state of the art solutions are:
But honestly, probably not the best reference, you know why?
Because I don’t bookmark negative findings, and neither does anyone. We laugh at them and then move on with life. The field is 99% “research” that’s usually spending months or years optimizing a toy problem and then having a 2 paragraph discussion section about “This should generalize to other problems”… and then nobody bothers to replicate the original study or to work on the “generalize” part. Because where’s the money in an ML researcher saying “actually, guys, the field has a lot of limitations and a lot of research directions are artificial, pun not intended, and can’t be applied to relevant problems outside of generating on-demand furry porn or some other pointless nonsense”.
But as is the case over and over again, when people try to replicate techniques that “work” in papers in slightly different conditions they return to baseline. Probably the prime example of this is a paper that made it into **** nature about how to predict earthquake aftershocks with neural networks and then somebody tried to apply a linear regression to the same data instead and we got this gem
(In case the pun is not obvious, a one neuron network is a linear regression)
And while improvements certainly exist, we have observed exponential improvements in the real world. On the whole, we don’t have much more “AI powered” technology now than in the 80s.
I’m the first to argue that this is in part because of over-regulation, I’ve written a lot on that subject and I do agree that it’s part of the issue. But part of the issue is that there are not so many things with real-world applications. Because at the end of the day all you are seeing in numbers like the ones above is a generalization on a few niche problems.
Anyway, I should probably stop ranting about this subject on LW, it’s head-against-wall banging.
Thank you for the excellent and extensive write up :)
I hadn’t encountered your perspective before, I’ll definitely go through all your links to educate myself, and put less weight on algorithmic progress being a driving force then.
At the end of the day, the best thing to do is to actually try and apply the advances to real-world problems.
I work on open source stuff that anyone can use, and there’s plenty of companies willing to pay 6 figures a year if we can do some custom development to give them a 1-2% boost in performance. So the market is certainly there and waiting.
Even a minimal increase in accuracy can be worth millions or billions to the right people. In some industries (advertising, trading) you can even go at it alone, you don’t need customers.
But there’s plenty of domain-specific competitions that pay in the dozens or hundreds of thousands for relatively small improvements. Look past Kaggle at things that are domain-specific (e.g. https://unearthed.solutions/) and you’ll find plenty.
That way you’ll probably get a better understanding of what happens when you take a technique that’s good on paper and try to generalize. And I don’t mean this as a “you will fail”, you might well succede but it will probably make you see how minimal of an improvement “success” actually is and how hard you must work for that improvement. So I think it’s a win-win.
The problem with companies like OpenAI (and even more so with “AI experts” on LW/Alignment) is that they don’t have a stake by which to measure success or failure. If waxing lyrically and picking the evidence that suits your narrative is your benchmark for how well you are doing, you can make anything from horoscopes to homeopathy sound ground-breaking.
When you measure your ideas about “what works” against the real world that’s when the story changes. After all, one shouldn’t forget that since OpenAI was created it got its funding via optimizing the “Impress Paul Graham and Elon Musk”, rather than via the “Create an algorithm that can do something better than a human than sell it to humans that want that thing done better” strategy… which is an incentive 101 kinda problem and what makes me wary of many of their claims.
Again, not trying to disparage here, I also get my funding via the “Impress Paul Graham” route, I’m just saying that people in AI startups are not the best to listen to in terms of AI progress, none of them are going to say “Actually, it’s kinda stagnating”. Not because they are dishonest, but because the kind of people that work in and get funding for AI startups genuinely believe that… otherwise they’d be doing something else. However, as has been well pointed about by many here, confirmation bias is often much more insidious and credible than outright lies. Even I fall on the side of “exponential improvement” at the end of the day, but all my incentives are working towards biasing me in that direction, so thinking about it rationally, I’m likely wrong.
Could you clarify, you mean the primary cause of efficiency increase wasn’t algorithmic or architectural developments, but researchers just fine-tuning weight transferred models?
Are you saying that the evidence for exponential algorithmic efficiency, not just in image processing, is entirely cherry picked?
I googled that and there were no results, and I couldn’t find an “academica/internet flamewar library” either.
Look I don’t know enough about ML yet to respond intelligently to your points, could someone else more knowledgeable than me weigh in here please?
Algorithm/Architecture are fundamentally hyperparameters, so when I say “fine-tuning hyperparameters” (i.e. the ones that aren’t tuned by the learning process itself), those are included.
Granted, you have jumped from e.g. LSTM to attention, where you can’t think of it as “hyperparameter” tuning, since it’s basically a shift in mentality in many ways.
But in computer vision, at least to my knowledge, most of the improvements would boil down to tuning optimization methods. E.g here’s an analysis of the subject (https://www.fast.ai/2018/07/02/adam-weight-decay/) describing some now-common method, mainly around CV.
However, the problem is that the optimization is happening around the exact same datasets Alexnet was built around. Even if you don’t transfer weight, “knowing” a very good solution helps you fine-tune much quicker around a problem ala ImagNet, or cifrar, or mnist or various other datasets that fall into the category of “classifying things which are obviously distinct to humans from square images of roughly 50 to 255px width/height”
But that domain is fairly niche if we were to look at, e.g., almost any time-series prediction datasets… not much progress has been made since the mid 20s. And maybe that’s because no more progress can be made, but the problem is that until we know the limits of how “solvable” a problem is, the problem is hard. Once we know how to solve the problem in one way, achieving similar results, but faster, is a question of human ingenuity we’ve been good at since at least the industrial revolution.
I mean, you could build an Alexnet-specific circuit, not now, but back when it was invented, and get 100x or 1000x performance, but nobody is doing that because our focus is not (or, at least, should not) fall under optimizing very specific problems. Rather, the important thing is finding techniques that can generalize.
**Note: Not a hardware engineer, not sure how easily one can come up with auto diff circuits, might be harder than I’d expect for that specific case, just trying to illustrate the general point**
Ahm, yes.
https://paperswithcode.com/
if you want a simple overview of how speed and accuracy has evolved on a broader range of problems. And even those problems are cherry picked, in that they are very specific competition/research problems that hundreds of people are working on.
Some examples:
Paper with good arguments that impressive results achieved by transformer architectures are just test data contamination: https://arxiv.org/pdf/1907.07355.pdf
A simpler article: https://hackingsemantics.xyz/2019/leaderboards/ (which makes the same point as the above paper)
Then there’s the problem of how one actually “evaluates” how good an NLP model is.
As in, think of the problem for a second, I ask you:
“How good is this translation from English to French, on a scale from 1 to 10” ?
For anything beyond simple phrases that question is very hard, almost impossible. And even if it iisn’tsnt’, i.e. if we can use the aggregate perceptions of many humans to determine “truth” in that regard, you can’t capture that in a simple accuracy function that evaluates the model.
Granted, I think my definition of “flamewar” is superfluous, I mean more so passive-aggressive snarky questions with a genuine interest in improving behind them posted on forums ala: https://www.reddit.com/r/LanguageTechnology/comments/bcehbv/why_do_all_the_new_nlp_models_preform_poor_on_the/
More on the idea of how NLP models are overfitting on very poor accuracy functions that won’t allow them to progress much further:
https://arxiv.org/pdf/1902.01007.pdf
And a more recent one (202) with similar ideas that proposes solutions: https://www.aclweb.org/anthology/2020.acl-main.408.pdf
If you want to generalize this idea outside of NLP, see, for example, this: https://arxiv.org/pdf/1803.05252.pdf
And if you want anecdotes from another field I’m more familiar with, the whole “field” of neural architecture search (building algorithms to build algorithms), has arguably overfit on specific problems for the last 5 years to the point that all state of the art solutions are:
Basically no better than random and often worst: https://arxiv.org/pdf/1902.08142.pdf
And the results are often unreliable/unreplicable: https://arxiv.org/pdf/1902.07638.pdf
*****
But honestly, probably not the best reference, you know why?
Because I don’t bookmark negative findings, and neither does anyone. We laugh at them and then move on with life. The field is 99% “research” that’s usually spending months or years optimizing a toy problem and then having a 2 paragraph discussion section about “This should generalize to other problems”… and then nobody bothers to replicate the original study or to work on the “generalize” part. Because where’s the money in an ML researcher saying “actually, guys, the field has a lot of limitations and a lot of research directions are artificial, pun not intended, and can’t be applied to relevant problems outside of generating on-demand furry porn or some other pointless nonsense”.
But as is the case over and over again, when people try to replicate techniques that “work” in papers in slightly different conditions they return to baseline. Probably the prime example of this is a paper that made it into **** nature about how to predict earthquake aftershocks with neural networks and then somebody tried to apply a linear regression to the same data instead and we got this gem
One neuron is more informative than a deep neural network for aftershock pattern forecasting
(In case the pun is not obvious, a one neuron network is a linear regression)
And while improvements certainly exist, we have observed exponential improvements in the real world. On the whole, we don’t have much more “AI powered” technology now than in the 80s.
I’m the first to argue that this is in part because of over-regulation, I’ve written a lot on that subject and I do agree that it’s part of the issue. But part of the issue is that there are not so many things with real-world applications. Because at the end of the day all you are seeing in numbers like the ones above is a generalization on a few niche problems.
Anyway, I should probably stop ranting about this subject on LW, it’s head-against-wall banging.
Thank you for the excellent and extensive write up :)
I hadn’t encountered your perspective before, I’ll definitely go through all your links to educate myself, and put less weight on algorithmic progress being a driving force then.
Cheers
At the end of the day, the best thing to do is to actually try and apply the advances to real-world problems.
I work on open source stuff that anyone can use, and there’s plenty of companies willing to pay 6 figures a year if we can do some custom development to give them a 1-2% boost in performance. So the market is certainly there and waiting.
Even a minimal increase in accuracy can be worth millions or billions to the right people. In some industries (advertising, trading) you can even go at it alone, you don’t need customers.
But there’s plenty of domain-specific competitions that pay in the dozens or hundreds of thousands for relatively small improvements. Look past Kaggle at things that are domain-specific (e.g. https://unearthed.solutions/) and you’ll find plenty.
That way you’ll probably get a better understanding of what happens when you take a technique that’s good on paper and try to generalize. And I don’t mean this as a “you will fail”, you might well succede but it will probably make you see how minimal of an improvement “success” actually is and how hard you must work for that improvement. So I think it’s a win-win.
The problem with companies like OpenAI (and even more so with “AI experts” on LW/Alignment) is that they don’t have a stake by which to measure success or failure. If waxing lyrically and picking the evidence that suits your narrative is your benchmark for how well you are doing, you can make anything from horoscopes to homeopathy sound ground-breaking.
When you measure your ideas about “what works” against the real world that’s when the story changes. After all, one shouldn’t forget that since OpenAI was created it got its funding via optimizing the “Impress Paul Graham and Elon Musk”, rather than via the “Create an algorithm that can do something better than a human than sell it to humans that want that thing done better” strategy… which is an incentive 101 kinda problem and what makes me wary of many of their claims.
Again, not trying to disparage here, I also get my funding via the “Impress Paul Graham” route, I’m just saying that people in AI startups are not the best to listen to in terms of AI progress, none of them are going to say “Actually, it’s kinda stagnating”. Not because they are dishonest, but because the kind of people that work in and get funding for AI startups genuinely believe that… otherwise they’d be doing something else. However, as has been well pointed about by many here, confirmation bias is often much more insidious and credible than outright lies. Even I fall on the side of “exponential improvement” at the end of the day, but all my incentives are working towards biasing me in that direction, so thinking about it rationally, I’m likely wrong.