Lossless compression is the correct unsupervised machine learning benchmark, and not just for language models. To understand this, it helps to read the Hutter Prize FAQ on why it doesn’t use perplexity:
Although Solomonoff proved this in the 60s, people keep arguing about it because they keep thinking they can, somehow, escape from the primary assumption of Solomonoff’s proof: computation. The natural sciences are about prediction. If you can’t make a prediction you can’t test your model. To make a prediction in a way that can be replicated, you need to communicate a model that the receiver can then use to make an assertion about the future. The language used to communicate this model is, in its most general form, algorithmic. Once you arrive at this realization, you have just adopted Solomonoff’s primary assumption with which he proved that lossless compression is the correct unsupervised model selection criterion aka the Algorithmic Information Criterion for model selection.
People are also confused about the distinction between science and technology (aka unsupervised vs supervised learning) but also the distinction between the scientific activities of model generation as well as model selection.
Benchmarks are about model selection. Science is about both model selection andmodel generation. Technology is about the application of scientific models subject to utility functions of decision trees (supervised learning in making decisions about not only what kind of widget to build but also what kind of observations to prioritize in the scientific process).
If LessWrong can get this right, it will do an enormous amount of good now that people are going crazy about bias in language models. As with the distinction between science and technology, people are confused about bias in the scientific sense and bias in the moral zeitgeist sense (ie: social utility). We’re quickly heading into a time when exceedingly influential models are being subjected to reinforcement learning in order to make them compliant with the moral zeitgeist’s notion of bias almost to the exclusion of any scientific notion of bias. This is driven by the failure of thought leaders to get their own heads screwed on straight about what it might mean for there to be “bias in the data” under the Algorithmic Information Criterion for scientific model selection. Here’s a clue: In algorithmic information terms, a billion repetitions of the same erroneous assertion requires a 30 bit counter, but a “correct” assertion (one that finds multidisciplinary consilience) may be imputed from other data and hence, ideally, requires no bits.
Lossless compression is the correct unsupervised machine learning benchmark, and not just for language models. To understand this, it helps to read the Hutter Prize FAQ on why it doesn’t use perplexity:
http://prize.hutter1.net/hfaq.htm
Although Solomonoff proved this in the 60s, people keep arguing about it because they keep thinking they can, somehow, escape from the primary assumption of Solomonoff’s proof: computation. The natural sciences are about prediction. If you can’t make a prediction you can’t test your model. To make a prediction in a way that can be replicated, you need to communicate a model that the receiver can then use to make an assertion about the future. The language used to communicate this model is, in its most general form, algorithmic. Once you arrive at this realization, you have just adopted Solomonoff’s primary assumption with which he proved that lossless compression is the correct unsupervised model selection criterion aka the Algorithmic Information Criterion for model selection.
People are also confused about the distinction between science and technology (aka unsupervised vs supervised learning) but also the distinction between the scientific activities of model generation as well as model selection.
Benchmarks are about model selection. Science is about both model selection and model generation. Technology is about the application of scientific models subject to utility functions of decision trees (supervised learning in making decisions about not only what kind of widget to build but also what kind of observations to prioritize in the scientific process).
If LessWrong can get this right, it will do an enormous amount of good now that people are going crazy about bias in language models. As with the distinction between science and technology, people are confused about bias in the scientific sense and bias in the moral zeitgeist sense (ie: social utility). We’re quickly heading into a time when exceedingly influential models are being subjected to reinforcement learning in order to make them compliant with the moral zeitgeist’s notion of bias almost to the exclusion of any scientific notion of bias. This is driven by the failure of thought leaders to get their own heads screwed on straight about what it might mean for there to be “bias in the data” under the Algorithmic Information Criterion for scientific model selection. Here’s a clue: In algorithmic information terms, a billion repetitions of the same erroneous assertion requires a 30 bit counter, but a “correct” assertion (one that finds multidisciplinary consilience) may be imputed from other data and hence, ideally, requires no bits.