Ah, now I get your point, sorry. Yes, it is true that GPTs are not incentivised to reproduce the full data distribution, but rather, are incentivised to reproduce something more similar to a maximum-likelihood estimate point distribution. This means that they have lower variance (at least in the limit), which may improve performance in some domains, as you point out. But individual samples from the model will still have a high likelihood under the data distribution.
But individual samples from the model will still have a high likelihood under the data distribution.
That’s not true for maximum-likelihood distribution is general. It’s been more than a decade since I dealt with that topic in university while studying bioinformatics but in the domain of bioinformatics maximum-likelihood distribution can frequently produce results that are impossible to appear in reality and there are a bunch of tricks to avoid that.
To get back to the actual case of large language models, imagine there’s a complex chain of verbal reasoning. The next correct word in that reasoning chain has a higher likelihood than 200 different words that could be used that lead to a wrong conclusion. The likelihood of the correct word might be 0.01.
A large language model might pick the right word for the reasoning chain for every word over a 1000-word reasoning chain. The result is one that would be very unlikely to appear in the real world.
Ah, now I get your point, sorry. Yes, it is true that GPTs are not incentivised to reproduce the full data distribution, but rather, are incentivised to reproduce something more similar to a maximum-likelihood estimate point distribution. This means that they have lower variance (at least in the limit), which may improve performance in some domains, as you point out. But individual samples from the model will still have a high likelihood under the data distribution.
That’s not true for maximum-likelihood distribution is general. It’s been more than a decade since I dealt with that topic in university while studying bioinformatics but in the domain of bioinformatics maximum-likelihood distribution can frequently produce results that are impossible to appear in reality and there are a bunch of tricks to avoid that.
To get back to the actual case of large language models, imagine there’s a complex chain of verbal reasoning. The next correct word in that reasoning chain has a higher likelihood than 200 different words that could be used that lead to a wrong conclusion. The likelihood of the correct word might be 0.01.
A large language model might pick the right word for the reasoning chain for every word over a 1000-word reasoning chain. The result is one that would be very unlikely to appear in the real world.