But individual samples from the model will still have a high likelihood under the data distribution.
That’s not true for maximum-likelihood distribution is general. It’s been more than a decade since I dealt with that topic in university while studying bioinformatics but in the domain of bioinformatics maximum-likelihood distribution can frequently produce results that are impossible to appear in reality and there are a bunch of tricks to avoid that.
To get back to the actual case of large language models, imagine there’s a complex chain of verbal reasoning. The next correct word in that reasoning chain has a higher likelihood than 200 different words that could be used that lead to a wrong conclusion. The likelihood of the correct word might be 0.01.
A large language model might pick the right word for the reasoning chain for every word over a 1000-word reasoning chain. The result is one that would be very unlikely to appear in the real world.
That’s not true for maximum-likelihood distribution is general. It’s been more than a decade since I dealt with that topic in university while studying bioinformatics but in the domain of bioinformatics maximum-likelihood distribution can frequently produce results that are impossible to appear in reality and there are a bunch of tricks to avoid that.
To get back to the actual case of large language models, imagine there’s a complex chain of verbal reasoning. The next correct word in that reasoning chain has a higher likelihood than 200 different words that could be used that lead to a wrong conclusion. The likelihood of the correct word might be 0.01.
A large language model might pick the right word for the reasoning chain for every word over a 1000-word reasoning chain. The result is one that would be very unlikely to appear in the real world.