In what sense is predicting internet text “training [LLMs] on human thoughts”?
Internet text contains the inputs and outputs of human minds in the sense that every story, post, article, essay, book, etc written by humans first went through our brains, word by word, token by token, tracing through our minds.
Training on internet text is literally training on human thoughts because text written by humans is literally an encoding of human thoughts. The fact that it is an incomplete and partial encoding is mostly irrelevant, as given enough data you can infer through any such gaps. Only a small fraction of the pixels in an image are sufficient to reconstruct it.
Internet text contains the inputs and outputs of human minds in the sense that every story, post, article, essay, book, etc written by humans first went through our brains, word by word, token by token, tracing through our minds.
Training on internet text is literally training on human thoughts because text written by humans is literally an encoding of human thoughts. The fact that it is an incomplete and partial encoding is mostly irrelevant, as given enough data you can infer through any such gaps. Only a small fraction of the pixels in an image are sufficient to reconstruct it.