I have some reasons for being optimistic about ‘white box heuristic reasoning’ (humans understanding models is a special case of this), but models becoming easier to understand as they get bigger isn’t one of them.
That’s not really the correct comparison. The correct comparison is neural outputs of linguistic cortex vs LLM neural outputs, because the LLM isnt’ having to learn a few shot mouse/keyboard minigame like the human is.
Humans do not have direct access to the implicit predictions of their brain’s language centers, any more than the characters simulated by a language model have access to the language model’s token probabilities.
Really, the correct comparison is something like asking the LLM to make a zero shot prediction of the form:
Consider the following sentence; “I am a very funny _”
What word seems most likely to continue the sentence?
Answer:
I expect LLMs to do much worse when prompted like this, though I haven’t done the experiment myself.
Humans do not have direct access to the implicit predictions of their brain’s language centers,
But various other human brain modules do have direct access to the outputs of linguistic cortex, and that is the foundation of most of our linguistic abilities, which surpass those of LLM in many ways.
Human linguistic cortex learns via word/token prediction, just like LLMs.
Human linguistic cortical outputs are the foundation for various linguistic abilities, performance of which follows on performance on 1.
Humans generally outperform LLMs on most downstream linguistic tasks.
I’m merely responding to this statement:
Language models are already superhuman at next token prediction
Which is misleading—LLMs are superhuman than humans at the next token prediction game, but that does not establish that LLMs are superhuman than human linguistic cortex (establishing that would require comparing neural readouts)
I don’t think this sort of prompt actually gets at the conscious reasoning gap. It only takes one attention head to copy the exact next token prediction made at a previous token, and I’d expect if you used few shot prompting (especially filling the entire context with few shot prompts), it would use its induction-like heads to just copy its predictions and perform quite well.
A better example would be to have the model describe its reasoning about predicting the next token, and then pass that to itself in an isolated prompt to predict the next token.
This sort of prompt shows up in the corpus and when it does it implies a different token distribution for the _ than the typical distribution on the corpus. Ofc, you could make the model quite good at prompts like this via finetuning.
Imo, it is reasonably close to the right comparison for thinking about humans understanding how LLMs work (I make no claims about this being a reasonable comparison for other things). We care about how humans perform using conscious reasoning.
Similarly, I’d claim that trying to do interpretability on your own linguistic cortex is made difficult by the fact the the linguistic cortex (probably) implicitly represents probability distributions over language which are much better than those that you can conciously compute.
More generally, it’s worth thinking about the conscious reasoning gap—this gap happens to be smaller in vision for various reasons.
This gap will also ofc exist in language models trying to interpret themselves, but fine-tuning might be very helpful for at least partially removing this gap.
Language models are already superhuman at next token prediction
I have some reasons for being optimistic about ‘white box heuristic reasoning’ (humans understanding models is a special case of this), but models becoming easier to understand as they get bigger isn’t one of them.
That’s not really the correct comparison. The correct comparison is neural outputs of linguistic cortex vs LLM neural outputs, because the LLM isnt’ having to learn a few shot mouse/keyboard minigame like the human is.
Humans do not have direct access to the implicit predictions of their brain’s language centers, any more than the characters simulated by a language model have access to the language model’s token probabilities.
Really, the correct comparison is something like asking the LLM to make a zero shot prediction of the form:
I expect LLMs to do much worse when prompted like this, though I haven’t done the experiment myself.
But various other human brain modules do have direct access to the outputs of linguistic cortex, and that is the foundation of most of our linguistic abilities, which surpass those of LLM in many ways.
Human linguistic cortex learns via word/token prediction, just like LLMs.
Human linguistic cortical outputs are the foundation for various linguistic abilities, performance of which follows on performance on 1.
Humans generally outperform LLMs on most downstream linguistic tasks.
I’m merely responding to this statement:
Which is misleading—LLMs are superhuman than humans at the next token prediction game, but that does not establish that LLMs are superhuman than human linguistic cortex (establishing that would require comparing neural readouts)
I don’t think this sort of prompt actually gets at the conscious reasoning gap. It only takes one attention head to copy the exact next token prediction made at a previous token, and I’d expect if you used few shot prompting (especially filling the entire context with few shot prompts), it would use its induction-like heads to just copy its predictions and perform quite well.
A better example would be to have the model describe its reasoning about predicting the next token, and then pass that to itself in an isolated prompt to predict the next token.
Here’s what GPT-3 output for me
It’s distribution over continuations for the sentence itself is broader:
I’d have expected it to become less confident of its answer when asked verbally.
This sort of prompt shows up in the corpus and when it does it implies a different token distribution for the _ than the typical distribution on the corpus. Ofc, you could make the model quite good at prompts like this via finetuning.
Imo, it is reasonably close to the right comparison for thinking about humans understanding how LLMs work (I make no claims about this being a reasonable comparison for other things). We care about how humans perform using conscious reasoning.
Similarly, I’d claim that trying to do interpretability on your own linguistic cortex is made difficult by the fact the the linguistic cortex (probably) implicitly represents probability distributions over language which are much better than those that you can conciously compute.
More generally, it’s worth thinking about the conscious reasoning gap—this gap happens to be smaller in vision for various reasons.
This gap will also ofc exist in language models trying to interpret themselves, but fine-tuning might be very helpful for at least partially removing this gap.
isn’t this about generation vs classification, not language vs vision?