The research community is very far from being efficient.
One of my own fields of research is Markov chain Monte Carlo methods, and their applications in computations for Bayesian models. Markov chain Monte Carlo (MCMC) was invented in the early 1950s, for use in statistical physics. It was not used by Bayesian statisticians until around 1990. There was no reason that it could not have been used before then—the methods of the 1950s could have been directly applied to many Bayesian inference problems.
In 1970, a paper generalizing the most common MCMC algorithm (the “Metropolis” method) was published in Biometrika, one of the top statistics journals. This didn’t prompt anyone to start using it for Bayesian inference.
In the early 1980s, MCMC was used by some engineers and computer scientists (eg, by Geoffrey Hinton for maximum likelihood inference for log-linear models with latent variables, also known as “Boltzmann machines”). This also didn’t prompt anyone to start using it for Bayesian inference.
After a form of MCMC starting being used by Bayeian statisticians around 1990, it took many years for the literature on MCMC methods used by physicists to actually be used by statisticians. This despite the fact that I wrote a review paper describing just about all these methods in terms readily accessible to statisticians in 1993.
In 1992, I started using the Hamiltonian Monte Carlo method (aka, hybrid Monte Carlo, or HMC) for Bayesian inference for neural network models. This method was invented by physicists in 1987. (It could have been invented in the 1950s, but just wasn’t.) I demonstrated that HMC was often hundreds or thousands of times faster than simpler methods, gave talks on this at conferences, and wrote my thesis (later book) on Bayesian learning in which this was a major theme. It wasn’t much used by other statisticians until after I wrote another review paper in 2010, which for some reason led to it catching on. It is now widely used in packages such as Stan.
Another of my research areas is error-correcting codes. In 1948, Claude Shannon proved his noisy coding theorem, establishing the theoretical (but not practical) limits of error correction. In 1963, Robert Gallager invented Low Density Parity Check (LDPC) codes. For many years after this, standard texbooks stated that the theoretical limit proved to be possible by Shannon was unlikely to ever be closely approached by codes with practical encoding and decoding algorithms. In 1996, David MacKay and I showed that a slight variation on Gallager’s LDPC codes comes very close to achieving the Shannon limit on performance. (A few years before then, “Turbo codes” had achieved similar performance.) These and related codes are now very widely used.
These are examples of good ideas that took far longer to be widely used than one would expect in an efficient research community. There are also many bad ideas that persist for far longer than they should.
I think both problems are at least partially the result of perverse incentives of researchers.
Lots of research is very incremental—what you describe as ”...there was instantly an explosion of activity as researchers raced to apply it to all the important NLP problems and be the first to publish”. Sometimes, of course, this explosion of activity is useful. But often it is not—the idea isn’t actually very good, it’s just the sort of idea on which it is easy to write more and more papers, often precisely because it isn’t very good. And sometimes this explosion of activity doesn’t happen when it would have been useful, because the activity required is not the sort that leads to easy papers—eg, the needed activity is to apply the idea to practical problems, but that isn’t the “novel” research that leads to tenure, or the idea requires learning some new tools and that’s too much trouble, or the way forward is by messy empirical work that doesn’t look as impressive as proving theorems (even if the theorems are actually pointless), or extending an idea that someone else came up with doesn’t seem like as good a career move as developing your own ideas (even when your ideas aren’t as good).
The easy rewards from incremental research may mean that researchers don’t spend much, or any, time on thinking about actual original ideas. Getting such ideas may require reading extensively in diverse fields, and getting one’s hands dirty with the low-level work that is necessary to develop real intuition about how things work, and what is important. Academic researchers can’t easily find time for this, and may be forced into either doing incremental research, or becoming research managers rather than actual researchers.
In my case, the best research environment was when I was a PhD student (with Geoffrey Hinton). But I’m not sure things are still as good for PhD students. The level of competition for short-term rewards may be higher than back in the 1990s.
Yeah, I agree that the EMH holds true more for incremental research than for truly groundbreaking ideas. I’m not too familiar with MCMC or Bayesian inference so correct me if I’m wrong, but I’m guessing these advancements required combining of ideas that nobody expected would work? The deep learning revolution could probably have happened sooner (in the sense that all the prerequisite tools existed), but few people before 2010 expected neural networks to work so consequently the inefficiencies there remained undiscovered.
At the same time, I wouldn’t denigrate research that you might view as “incremental”, because most research is of that nature. By this I mean, for every paper published in the ACL / EMNLP conferences, if the authors hadn’t published it, someone else would almost certainly have published something very similar within 1-2 years. Exceptions to this are few and far between—science advances via an accumulation of many small contributions.
I think the problem with MCMC is that’s an incredible dirty thing from the perspective of a mathematician. It’s way to practically useful as opposed to being about mathematical theorems. MCMC is about doing an efficient way to do calculation and doing calculations is low status for mathematicians. It has ugly randomness in it.
I was personally taught MCMC when studying bioinformatics and I was a bit surprised when talking with a friend who was a math Phd who had a problem where MCMC would have worked very well for a subproblem but it was completely out his radar.
MCMC was something that came out of compuer science and not out of the statistic community. Most people in statistics cared about statistical significance. The math community already looks down at the statistics community and MCMC seems to be even worse from that perspective.
My statistics proof said that in principle bioinformatics could have had been a subfield of statistics but the way of doing things was in the beginning rejected by the statistics community so that bioinformatics had to become it’s own field (and it’s the field where MCMC was used a lot because you actually need it for the problems that bioinformatics cares about).
Certainly some incremental research is very useful. But much of it isn’t. I’m not familiar with the ACL and EMNLP conferences, but for ML and statistics, there are large numbers of papers that don’t really contribute much (and these aren’t failed attempts at breakthroughs). You can see that this must be true from the sheer volume of papers now—there can’t possibly be that many actual advances.
For LDPC codes, it certainly was true that for years people didn’t realize their potential. But there wasn’t any good reason not to investigate—it’s sort of like nobody pointing a telescope at Saturn because Venus turned out to be rather featureless, and why would Saturn be different? There was a bit of tunnel vision, with an unjustified belief that one couldn’t really expect much more than what the codes being investigated delivered—though one could of course publish lots of papers on a new variation in sequential decoding of convolutional codes. (There was good evidence that this would never lead to the Shannon limit—but that of course must surely be unobtainable...)
Regarding MCMC and Bayesian inference, I think there was just nobody making the connection—nobody who actually knew what the methods from physics could do, and also knew what the computational obstacles for Bayesian inference were. I don’t think anyone thought of applying the Metropolis algorithm to Bayesian inference and then said, “but surely that wouldn’t work...”. It’s obviously worth a try.
The research community is very far from being efficient.
One of my own fields of research is Markov chain Monte Carlo methods, and their applications in computations for Bayesian models. Markov chain Monte Carlo (MCMC) was invented in the early 1950s, for use in statistical physics. It was not used by Bayesian statisticians until around 1990. There was no reason that it could not have been used before then—the methods of the 1950s could have been directly applied to many Bayesian inference problems.
In 1970, a paper generalizing the most common MCMC algorithm (the “Metropolis” method) was published in Biometrika, one of the top statistics journals. This didn’t prompt anyone to start using it for Bayesian inference.
In the early 1980s, MCMC was used by some engineers and computer scientists (eg, by Geoffrey Hinton for maximum likelihood inference for log-linear models with latent variables, also known as “Boltzmann machines”). This also didn’t prompt anyone to start using it for Bayesian inference.
After a form of MCMC starting being used by Bayeian statisticians around 1990, it took many years for the literature on MCMC methods used by physicists to actually be used by statisticians. This despite the fact that I wrote a review paper describing just about all these methods in terms readily accessible to statisticians in 1993.
In 1992, I started using the Hamiltonian Monte Carlo method (aka, hybrid Monte Carlo, or HMC) for Bayesian inference for neural network models. This method was invented by physicists in 1987. (It could have been invented in the 1950s, but just wasn’t.) I demonstrated that HMC was often hundreds or thousands of times faster than simpler methods, gave talks on this at conferences, and wrote my thesis (later book) on Bayesian learning in which this was a major theme. It wasn’t much used by other statisticians until after I wrote another review paper in 2010, which for some reason led to it catching on. It is now widely used in packages such as Stan.
Another of my research areas is error-correcting codes. In 1948, Claude Shannon proved his noisy coding theorem, establishing the theoretical (but not practical) limits of error correction. In 1963, Robert Gallager invented Low Density Parity Check (LDPC) codes. For many years after this, standard texbooks stated that the theoretical limit proved to be possible by Shannon was unlikely to ever be closely approached by codes with practical encoding and decoding algorithms. In 1996, David MacKay and I showed that a slight variation on Gallager’s LDPC codes comes very close to achieving the Shannon limit on performance. (A few years before then, “Turbo codes” had achieved similar performance.) These and related codes are now very widely used.
These are examples of good ideas that took far longer to be widely used than one would expect in an efficient research community. There are also many bad ideas that persist for far longer than they should.
I think both problems are at least partially the result of perverse incentives of researchers.
Lots of research is very incremental—what you describe as ”...there was instantly an explosion of activity as researchers raced to apply it to all the important NLP problems and be the first to publish”. Sometimes, of course, this explosion of activity is useful. But often it is not—the idea isn’t actually very good, it’s just the sort of idea on which it is easy to write more and more papers, often precisely because it isn’t very good. And sometimes this explosion of activity doesn’t happen when it would have been useful, because the activity required is not the sort that leads to easy papers—eg, the needed activity is to apply the idea to practical problems, but that isn’t the “novel” research that leads to tenure, or the idea requires learning some new tools and that’s too much trouble, or the way forward is by messy empirical work that doesn’t look as impressive as proving theorems (even if the theorems are actually pointless), or extending an idea that someone else came up with doesn’t seem like as good a career move as developing your own ideas (even when your ideas aren’t as good).
The easy rewards from incremental research may mean that researchers don’t spend much, or any, time on thinking about actual original ideas. Getting such ideas may require reading extensively in diverse fields, and getting one’s hands dirty with the low-level work that is necessary to develop real intuition about how things work, and what is important. Academic researchers can’t easily find time for this, and may be forced into either doing incremental research, or becoming research managers rather than actual researchers.
In my case, the best research environment was when I was a PhD student (with Geoffrey Hinton). But I’m not sure things are still as good for PhD students. The level of competition for short-term rewards may be higher than back in the 1990s.
Yeah, I agree that the EMH holds true more for incremental research than for truly groundbreaking ideas. I’m not too familiar with MCMC or Bayesian inference so correct me if I’m wrong, but I’m guessing these advancements required combining of ideas that nobody expected would work? The deep learning revolution could probably have happened sooner (in the sense that all the prerequisite tools existed), but few people before 2010 expected neural networks to work so consequently the inefficiencies there remained undiscovered.
At the same time, I wouldn’t denigrate research that you might view as “incremental”, because most research is of that nature. By this I mean, for every paper published in the ACL / EMNLP conferences, if the authors hadn’t published it, someone else would almost certainly have published something very similar within 1-2 years. Exceptions to this are few and far between—science advances via an accumulation of many small contributions.
I think the problem with MCMC is that’s an incredible dirty thing from the perspective of a mathematician. It’s way to practically useful as opposed to being about mathematical theorems. MCMC is about doing an efficient way to do calculation and doing calculations is low status for mathematicians. It has ugly randomness in it.
I was personally taught MCMC when studying bioinformatics and I was a bit surprised when talking with a friend who was a math Phd who had a problem where MCMC would have worked very well for a subproblem but it was completely out his radar.
MCMC was something that came out of compuer science and not out of the statistic community. Most people in statistics cared about statistical significance. The math community already looks down at the statistics community and MCMC seems to be even worse from that perspective.
My statistics proof said that in principle bioinformatics could have had been a subfield of statistics but the way of doing things was in the beginning rejected by the statistics community so that bioinformatics had to become it’s own field (and it’s the field where MCMC was used a lot because you actually need it for the problems that bioinformatics cares about).
Certainly some incremental research is very useful. But much of it isn’t. I’m not familiar with the ACL and EMNLP conferences, but for ML and statistics, there are large numbers of papers that don’t really contribute much (and these aren’t failed attempts at breakthroughs). You can see that this must be true from the sheer volume of papers now—there can’t possibly be that many actual advances.
For LDPC codes, it certainly was true that for years people didn’t realize their potential. But there wasn’t any good reason not to investigate—it’s sort of like nobody pointing a telescope at Saturn because Venus turned out to be rather featureless, and why would Saturn be different? There was a bit of tunnel vision, with an unjustified belief that one couldn’t really expect much more than what the codes being investigated delivered—though one could of course publish lots of papers on a new variation in sequential decoding of convolutional codes. (There was good evidence that this would never lead to the Shannon limit—but that of course must surely be unobtainable...)
Regarding MCMC and Bayesian inference, I think there was just nobody making the connection—nobody who actually knew what the methods from physics could do, and also knew what the computational obstacles for Bayesian inference were. I don’t think anyone thought of applying the Metropolis algorithm to Bayesian inference and then said, “but surely that wouldn’t work...”. It’s obviously worth a try.