Suppose you have a biased coin, that comes up heads with p=0.6 and tails with p=0.4. Suppose you flip it 10 times.
That’s a different case.
If you have a text you can calculate for every word in the text the likelihood (L_text) how likely it would follow the preceding words in the text. You can also calculate the likelihood (L_ideal) of the most likely word that would follow the preceding text.
L_ideal—L_text is in Kahemann’s words noise. If you look at a given text you can calculate the average of the noise for each word.
The average noise that’s produced by GPT3 is less than that of the average text on the internet. It would be surprising to encounter texts with so little noise randomly on the internet.
Ah, now I get your point, sorry. Yes, it is true that GPTs are not incentivised to reproduce the full data distribution, but rather, are incentivised to reproduce something more similar to a maximum-likelihood estimate point distribution. This means that they have lower variance (at least in the limit), which may improve performance in some domains, as you point out. But individual samples from the model will still have a high likelihood under the data distribution.
But individual samples from the model will still have a high likelihood under the data distribution.
That’s not true for maximum-likelihood distribution is general. It’s been more than a decade since I dealt with that topic in university while studying bioinformatics but in the domain of bioinformatics maximum-likelihood distribution can frequently produce results that are impossible to appear in reality and there are a bunch of tricks to avoid that.
To get back to the actual case of large language models, imagine there’s a complex chain of verbal reasoning. The next correct word in that reasoning chain has a higher likelihood than 200 different words that could be used that lead to a wrong conclusion. The likelihood of the correct word might be 0.01.
A large language model might pick the right word for the reasoning chain for every word over a 1000-word reasoning chain. The result is one that would be very unlikely to appear in the real world.
That’s a different case.
If you have a text you can calculate for every word in the text the likelihood (L_text) how likely it would follow the preceding words in the text. You can also calculate the likelihood (L_ideal) of the most likely word that would follow the preceding text.
L_ideal—L_text is in Kahemann’s words noise. If you look at a given text you can calculate the average of the noise for each word.
The average noise that’s produced by GPT3 is less than that of the average text on the internet. It would be surprising to encounter texts with so little noise randomly on the internet.
Ah, now I get your point, sorry. Yes, it is true that GPTs are not incentivised to reproduce the full data distribution, but rather, are incentivised to reproduce something more similar to a maximum-likelihood estimate point distribution. This means that they have lower variance (at least in the limit), which may improve performance in some domains, as you point out. But individual samples from the model will still have a high likelihood under the data distribution.
That’s not true for maximum-likelihood distribution is general. It’s been more than a decade since I dealt with that topic in university while studying bioinformatics but in the domain of bioinformatics maximum-likelihood distribution can frequently produce results that are impossible to appear in reality and there are a bunch of tricks to avoid that.
To get back to the actual case of large language models, imagine there’s a complex chain of verbal reasoning. The next correct word in that reasoning chain has a higher likelihood than 200 different words that could be used that lead to a wrong conclusion. The likelihood of the correct word might be 0.01.
A large language model might pick the right word for the reasoning chain for every word over a 1000-word reasoning chain. The result is one that would be very unlikely to appear in the real world.