Yes, predicting some sequences can be arbitrarily hard. But I have doubts that LLM training will try to predict very hard sequences.
Suppose that some sequences are not only difficult but impossible to predict, because they’re random? I would expect that with enough training, it would overfit and memorize them, because they get visited more than once in the training data. Memorization rather than generalization seems likely to happen for anything particularly difficult?
Meanwhile, there is a sea of easier sequences. Wouldn’t it be more “evolutionarily profitable” to predict those instead? Pattern recognizers that predict easy sequences seem more likely to survive than pattern-recognizers that predict hard sequences. Maybe the recognizers for hard sequences would be so rarely used and make so little progress that they’d get repurposed?
Thinking like a compression algorithm, a pattern recognizer needs to be worth its weight, or you might as well leave the data uncompressed.
I’m reasoning by analogy here, so these are only possibilities. Someone will need to actually research what LLM’s do. Does it work to think of LLM training as pattern-recognizer evolution? What causes pattern recognizers to be kept or dropped?
Yes, predicting some sequences can be arbitrarily hard. But I have doubts that LLM training will try to predict very hard sequences.
Suppose that some sequences are not only difficult but impossible to predict, because they’re random? I would expect that with enough training, it would overfit and memorize them, because they get visited more than once in the training data. Memorization rather than generalization seems likely to happen for anything particularly difficult?
Meanwhile, there is a sea of easier sequences. Wouldn’t it be more “evolutionarily profitable” to predict those instead? Pattern recognizers that predict easy sequences seem more likely to survive than pattern-recognizers that predict hard sequences. Maybe the recognizers for hard sequences would be so rarely used and make so little progress that they’d get repurposed?
Thinking like a compression algorithm, a pattern recognizer needs to be worth its weight, or you might as well leave the data uncompressed.
I’m reasoning by analogy here, so these are only possibilities. Someone will need to actually research what LLM’s do. Does it work to think of LLM training as pattern-recognizer evolution? What causes pattern recognizers to be kept or dropped?