That is not how this works. Let’s walk through it for both the “human as clumps of molecules following physics” and the “LLM as next-text-on-internet predictor”.
Humans as clumps of molecules following physics
Picture a human attempting to achieve some goal—for concreteness, let’s say the human is trying to pick an apple from a high-up branch on an apple tree. Picture what that human does: they maybe get a ladder, or climb the tree, or whatever. They manage to pluck the apple from the tree and drop it in a basket.
Now, imagine a detailed low-level simulation of the exact same situation: that same human trying to pick that same apple. Modulo quantum noise, what happens in that simulation? What do we see when we look at its outputs? Well, it looks like a human attempting to achieve some goal: the clump of molecules which is a human gets another clump which is a ladder, or climbs the clump which is the tree, or whatever.
LLM as next-text-on-internet predictor
Now imagine finding the text “Notes From a Prompt Factory” on the internet, today (because the LLM is trained on text from ~today). Imagine what text would follow that beginning on the internet today.
The text which follows that beginning on the internet today is not, in fact, notes from a prompt factory. Instead, it’s fiction about a fictional prompt factory. So that’s the sort of thing we should expect a highly capable LLM to output following the prompt “Notes From a Prompt Factory”: fiction. The more capable it is, the more likely it is to correctly realize that that prompt precedes fiction.
It’s not a question of whether the LLM is absorbing the explicit and tacit knowledge of internet authors; I’m perfectly happy to assume that it is. The issue is that the distribution of text on today’s internet which follows the prompt “Notes From a Prompt Factory” is not the distribution of text which would result from actual notes on an actual prompt factory. The highly capable LLM absorbs all that knowledge from internet authors, and then uses that knowledge to correctly predict that what follows the text “Notes From a Prompt Factory” will not be actual notes from an actual prompt factory.
No, because we have tons of information about what specific kinds of information on the internet is/isn’t usually fabricated. It’s not like we have no idea at all which internet content is more/less likely to be fabricated.
Information about, say, how to prove that there are infinitely many primes is probably not usually fabricated. It’s standard basic material, there’s lots of presentations of it, it’s not the sort of thing which people usually troll about. Yes, the distribution of internet text about the infinitude of primes contains more-than-zero trolling and mistakes and the like, but that’s not the typical case, so low-temperature sampling from the LLM should usually work fine for that use-case.
On the other end of the spectrum, “fusion power plant blueprints” on the internet today will obviously be fictional and/or wrong, because nobody currently knows how to build a fusion power plant which works. This generalizes to most use-cases in which we try to get an LLM to do something (using only prompting on a base model) which nobody is currently able to do. Insofar as the LLM is able to do such things, that actually reflects suboptimal next-text prediction on its part.
I would add “and the kind of content you want to get from aligned AGI definitely is fabricated on the Internet today”. So the powerful LM trying to predict it will predict how the fabrication would look like.
That is not how this works. Let’s walk through it for both the “human as clumps of molecules following physics” and the “LLM as next-text-on-internet predictor”.
Humans as clumps of molecules following physics
Picture a human attempting to achieve some goal—for concreteness, let’s say the human is trying to pick an apple from a high-up branch on an apple tree. Picture what that human does: they maybe get a ladder, or climb the tree, or whatever. They manage to pluck the apple from the tree and drop it in a basket.
Now, imagine a detailed low-level simulation of the exact same situation: that same human trying to pick that same apple. Modulo quantum noise, what happens in that simulation? What do we see when we look at its outputs? Well, it looks like a human attempting to achieve some goal: the clump of molecules which is a human gets another clump which is a ladder, or climbs the clump which is the tree, or whatever.
LLM as next-text-on-internet predictor
Now imagine finding the text “Notes From a Prompt Factory” on the internet, today (because the LLM is trained on text from ~today). Imagine what text would follow that beginning on the internet today.
The text which follows that beginning on the internet today is not, in fact, notes from a prompt factory. Instead, it’s fiction about a fictional prompt factory. So that’s the sort of thing we should expect a highly capable LLM to output following the prompt “Notes From a Prompt Factory”: fiction. The more capable it is, the more likely it is to correctly realize that that prompt precedes fiction.
It’s not a question of whether the LLM is absorbing the explicit and tacit knowledge of internet authors; I’m perfectly happy to assume that it is. The issue is that the distribution of text on today’s internet which follows the prompt “Notes From a Prompt Factory” is not the distribution of text which would result from actual notes on an actual prompt factory. The highly capable LLM absorbs all that knowledge from internet authors, and then uses that knowledge to correctly predict that what follows the text “Notes From a Prompt Factory” will not be actual notes from an actual prompt factory.
“Some content on the Internet is fabricated, and therefore we can never trust LMs trained on it”
Is this a fair summary?
No, because we have tons of information about what specific kinds of information on the internet is/isn’t usually fabricated. It’s not like we have no idea at all which internet content is more/less likely to be fabricated.
Information about, say, how to prove that there are infinitely many primes is probably not usually fabricated. It’s standard basic material, there’s lots of presentations of it, it’s not the sort of thing which people usually troll about. Yes, the distribution of internet text about the infinitude of primes contains more-than-zero trolling and mistakes and the like, but that’s not the typical case, so low-temperature sampling from the LLM should usually work fine for that use-case.
On the other end of the spectrum, “fusion power plant blueprints” on the internet today will obviously be fictional and/or wrong, because nobody currently knows how to build a fusion power plant which works. This generalizes to most use-cases in which we try to get an LLM to do something (using only prompting on a base model) which nobody is currently able to do. Insofar as the LLM is able to do such things, that actually reflects suboptimal next-text prediction on its part.
I would add “and the kind of content you want to get from aligned AGI definitely is fabricated on the Internet today”. So the powerful LM trying to predict it will predict how the fabrication would look like.