One thing that I have observed, working with LLMs, is that when they’re predicting the next token in a Python REPL they also make kinda similar mistakes to the ones that a human who wasn’t paying that much attention would make. For example, consider the following
>>> a, b = 3, 5 # input
>>> a + b # input
8 # input
>>> a, b = b, a # input
>>> a * b # input
15 # prediction (text-davinci-003, temperature=0, correct)
>>> a / b # input
1.0 # prediction (text-davinci-003, temperature=0, incorrect but understandable mistake)
>>> a # input
5 # prediction (text-davinci-003, temperature=0, correct)
>>> a / b # input
1.0 # prediction (text-davinci-003, temperature=0, incorrect but understandable mistake)
>>> a # input
5 # prediction (text-davinci-003, temperature=0, correct)
>>> a / b # input
1.0 # prediction (text-davinci-003, temperature=0, incorrect but understandable mistake)
>>> b # input
3 # prediction (text-davinci-003, temperature=0, correct)
>>> a / b # input
1.6666666666666667 # prediction (text-davinci-003, temperature=0, now correct -- looks like a humanish "oh whoops lol")
I expect that most examples in the training data of “this looks like an interaction with the Python REPL” were in fact interactions with the Python REPL. To the extent that GPT-N models do make human-like mistakes when predicting non-human-like data (instead of predicting that non-human-like data correctly, which is what their loss function wants them to do), I think that does serve as nonzero evidence that their cognition is “human-like” in the specific narrow sense of “mirroring our quirks and limitations”.
More generally, I think it’s particularly informative to look at the cases where the thing GPT-n does is different than the thing its loss function wants it to do. The extent to which those particular cases look like human failure modes is informative to how “human-like” the cognition of GPT-n class models (as a note, the SolidGoldMagikarp class of failure modes is extremely non human-like, and so that observation caused me to update more in the “shoggoth which can exhibit human behavior among many other behaviors” view. But I haven’t actually seen a whole lot of failure modes like that in normal use, and I have seen a bunch of human-not-paying-attention type failure modes in normal use).
a more human mistake might be guessing 0.6 and not 1.0?
After the mistake, it’s not clear what the “correct” answer is, from a text prediction perspective. If I were trying to predict the output of my python interpreter, and it output 1.0, I’d predict that future outputs on the same input would also be “wrong”—that either I was using some kind of bugged interpreter, or that I was looking at some kind of human-guessed transcript of a python session.
Yeah, that one’s “the best example of the behavior that I was able to demonstrate from scratch with the openai playground in 2 minutes” not “the best example of the behavior I’ve ever seen”. Mostly the instances I’ve seen were chess-specific results on a model that I specifically fine-tuned on Python REPL transcripts that looked like
and it would print N instead of None (except that in the actual examples it mostly was a much longer transcript, and it was more like it would forget where the pieces were if the transcript contained an unusual move or just too many moves).
For context I was trying to see if a small language model could be fine-tuned to play chess, and was working under the hypothesis of “a Python REPL will make the model behave as if statefulness holds”.
And then, of course, the Othello paper came out, and bing chat came out and just flat out could play chess without having been explicitly trained on it, and the question of “can a language model play chess” became rather less compelling because the answer was just “yes”.
But that project is where a lot of my “the mistakes tend to look like things a careless human does, not weird alien mistakes” intuitions ultimately come from.
An alternative explanation of mistakes is that making mistakes and then correcting them was awarded during additional post-training refinement stages. I work with GPT-4 daily and sometimes it feels like it makes mistakes on purpose just to be able to say that it is sorry for the confusion and then correct it. It feels like it also makes fewer mistakes when you ask politely, which is rather strange (use please, thank you, etc.).
Nevertheless, distillation seems like a very possible thing that is also going on here.
It does not distill the whole of a human mind though. There are areas that are intuitive for the average human, even a small child, that are not for the GPT-4. For example, it has problems with concepts of 3D geometry and visualizing things in 3D. It may have similar gaps in other areas, including more important ones (like moral intuitions).
One thing that I have observed, working with LLMs, is that when they’re predicting the next token in a Python REPL they also make kinda similar mistakes to the ones that a human who wasn’t paying that much attention would make. For example, consider the following
I expect that most examples in the training data of “this looks like an interaction with the Python REPL” were in fact interactions with the Python REPL. To the extent that GPT-N models do make human-like mistakes when predicting non-human-like data (instead of predicting that non-human-like data correctly, which is what their loss function wants them to do), I think that does serve as nonzero evidence that their cognition is “human-like” in the specific narrow sense of “mirroring our quirks and limitations”.
More generally, I think it’s particularly informative to look at the cases where the thing GPT-n does is different than the thing its loss function wants it to do. The extent to which those particular cases look like human failure modes is informative to how “human-like” the cognition of GPT-n class models (as a note, the SolidGoldMagikarp class of failure modes is extremely non human-like, and so that observation caused me to update more in the “shoggoth which can exhibit human behavior among many other behaviors” view. But I haven’t actually seen a whole lot of failure modes like that in normal use, and I have seen a bunch of human-not-paying-attention type failure modes in normal use).
An interesting example! A couple remarks:
a more human mistake might be guessing 0.6 and not 1.0?
After the mistake, it’s not clear what the “correct” answer is, from a text prediction perspective. If I were trying to predict the output of my python interpreter, and it output 1.0, I’d predict that future outputs on the same input would also be “wrong”—that either I was using some kind of bugged interpreter, or that I was looking at some kind of human-guessed transcript of a python session.
Yeah, that one’s “the best example of the behavior that I was able to demonstrate from scratch with the openai playground in 2 minutes” not “the best example of the behavior I’ve ever seen”. Mostly the instances I’ve seen were chess-specific results on a model that I specifically fine-tuned on Python REPL transcripts that looked like
and it would print
N
instead ofNone
(except that in the actual examples it mostly was a much longer transcript, and it was more like it would forget where the pieces were if the transcript contained an unusual move or just too many moves).For context I was trying to see if a small language model could be fine-tuned to play chess, and was working under the hypothesis of “a Python REPL will make the model behave as if statefulness holds”.
And then, of course, the Othello paper came out, and bing chat came out and just flat out could play chess without having been explicitly trained on it, and the question of “can a language model play chess” became rather less compelling because the answer was just “yes”.
But that project is where a lot of my “the mistakes tend to look like things a careless human does, not weird alien mistakes” intuitions ultimately come from.
An alternative explanation of mistakes is that making mistakes and then correcting them was awarded during additional post-training refinement stages. I work with GPT-4 daily and sometimes it feels like it makes mistakes on purpose just to be able to say that it is sorry for the confusion and then correct it. It feels like it also makes fewer mistakes when you ask politely, which is rather strange (use please, thank you, etc.).
Nevertheless, distillation seems like a very possible thing that is also going on here.
It does not distill the whole of a human mind though. There are areas that are intuitive for the average human, even a small child, that are not for the GPT-4. For example, it has problems with concepts of 3D geometry and visualizing things in 3D. It may have similar gaps in other areas, including more important ones (like moral intuitions).