How would you describe in human terms this failure mode ? Do you think humans would make this category of mistake ?
One arguement we often hear is that GP3/AI is not really intelligent because it is not able to maintain logical consistency in longer texts. This has always seemed a dubious arguement to me because humans are in fact not very good at either logic or consistency—whoever has graded undergrad maths exam has necessarily a dim view of human logic, and even top authors routinely include some inconsistencies in their novels.
However, this Morse error looks really alien to me, in fact I have trouble imagining any human making this kind of mistakes.
It reminds me a little of a story in Surely you’re joking, Mr Feynman in which someone who thinks Feynman is a phony decides to show him up by having someone greet him (in some social situation where he might be embarrassed to admit ignorance) in Chinese. So Feynman meets Chinese speaker; Chinese speaker greets Feynman in Chinese; Feynman responds with fluent pseudo-Chinese doubletalk; Chinese speaker retires in confusion, thinking that Feynman speaks Cantonese where she speaks Mandarin (or maybe it’s the other way around).
It’s rather a weird story and I can’t escape the suspicion that somehow Feynman’s telling of it isn’t perfectly honest. But the relevant thing here is that Feynman, presented with a sudden demand to produce a language he doesn’t know, responds not by saying “I’m sorry, Dave, I can’t do that”, but by fluently producing nonsense that resembles that language. Which is pretty much what GPT-3 is doing when you demand that it speak to you in Morse code.
Or consider “speaking in tongues” as practiced in some religious communities; here there isn’t a specific other language involved, but in some sense what’s happening is that people are put in a position where they look good if they start speaking Mysterious Angelic Languages, so (in at least some cases) they start emitting fluent pseudo-language. (I am assuming here that in fact they are not magically speaking Mysterious Angelic Languages; my understanding is that quite aside from the prior improbability of that, when linguists look at samples of glossolalia it doesn’t look very language-like and does contain a surprisingly large number of short-term repetitions, rather like some of that GPT-3 output.)
None of this is the exact same thing that GPT-3 is doing when asked to translate into More code, but it’s similar enough that I don’t find GPT-3′s output so completely alien. I think that when a person is put in a situation where fluent nonsense might make them look better than honest admission of ignorance, fluent nonsense isn’t so unusual, and arguably GPT-3 is always in a position where fluency is required.
However, this Morse error looks really alien to me, in fact I have trouble imagining any human making this kind of mistakes.
Mm. If you buy the idea that it’s memorized a single response and is falling back onto that as the maximum likelihood answer overgeneralized (no matter how tiny the posterior of that might be) answer, then it does have human analogues—small children are particularly infamous for pointing at things and going “gavagai? gavagai?”
The first time my kid called me “Daddy” my heart melted. That feeling was tempered somewhat the first time he called a stranger on the street “Daddy.” Today he called the roomba “Daddy” and I’m starting to feel a little insulted. My wife has it worse, though. He calls her “milk.”
(If I remembering being a little kid and my siblings correctly, this would not be remotely the strangest sort of linguistic behavior one would’ve observed.)
Seing your comment I now remember that one of my sister would answer the question “What colour is this ?” <pointing at something> by “blue” no matter what the colour was.
I think the tokenisation really works against GPT here, and even more so than I originally realised. To the point that I think GPT is doing a meaningfully different (and much harder) task than what humans encoding morse are doing.
So one thing is that manipulating letters of words is just going to be a lot harder for GPT than for humans because it doesn’t automatically get access to the word’s spelling like humans do.
Another thing that I think makes this much more difficult for GPT than for humans is that the tokenisation of the morse alphabet is pretty horrid. Whereas for humans morse is made of four base characters ( ‘-’ , ‘.’ , <space> , ‘/’), tokenised morse useseighteenunique tokens to encode 26 letters + 2 separation characters. This is because of the way spaces are tokenised.
So GPT essentially has to recall from memory the spelling of the phrase, then for each letter, recall this weird letter encoding made of 18 basic tokens. (Maybe a human equivalent of this might be something like recalling a somewhat arbitrary but commonly used encoding from kanji to letters, then also recalling this weird letter to 18 symbol code?)
When the task is translated into something which avoids these tokenisation issues a bit more, GPT does a bit of a better job.
This doesn’t deal with word separation though. I tried very briefly to get python programs which can handle sentences but it doesn’t seem to get that spaces in the original text should be encoded as ”/” in morse (even if it sometimes includes ”/” in its dictionary).
Argument 1: “In order to write a program to do a thing, you must yourself understand how to do the thing.”
Objection 1a: Not very true. Many not-terribly-good programmers write code that kinda-works by cobbling together things they find on the internet. I think GPT-3 does something fairly similar. Which, to be clear, is still impressive! Most humans cannot write often-kinda-working software by cobbling things together from the internet! But it is absolutely not the case that no one can write working code to do something without understanding how it works.
Objection 1b: I can write a program that calculates pi to 100 decimal places in a reasonable amount of time, but I cannot myself calculate pi to 100 decimal places without (with high probability) making mistakes along the way. (Well, as it happens I know pi to 100 decimal places, or at least have done in the past, so if that counts as “calculating” then I guess I can, but it shouldn’t.)
Argument 2: “If you can write a program to do a thing, then having written it you can execute it in your head and see what the result is.”
Objection 2a: Not very true. Many not-terribly-good programmers are surprisingly bad at executing programs in their heads. And GPT-3, in particular, is literally unable to do more than a fixed amount of computation per token it outputs. (It might be interesting to try to make it run a program in its head and make notes as it goes, which might let it get around that limitation, but I think the finite input window would then be a problem.)
Objection 2b: Again, I can write a program that computes pi to 100 decimal places but I cannot execute it in my head. I would at the very least need a substantial amount of paper to make notes on.
(If there’s some other reason why it’s weird for GPT-3 to be able to write a correct program to do a thing but not able to do the thing itself, I’m missing it.)
I tried a bit of handholding for simple program simulation, as follows:
[--- prompt begins ---]
Consider this function written in Python.
def f(n):
if n ⇐ 1: return n
else: return f(n-1) + f(n-2)
What is the value of f(5)?
Answer:
Since 5 ⇐ 1 is false, f(5) equals f(4) + f(3), so we need to know the values of those.
Since 4 ⇐ 1 is false, f(4) equals f(3) + f(2), so we also need to know f(2).
Since 3 ⇐ 1 is false, f(3) equals f(2) + f(1) = f(2) + 1.
Since 2 ⇐ 1 is false, f(2) equals f(1) + f(0) = 1 + 0 = 1.
So now we can work out that f(3) = f(2) + f(1) = 1 + 1 = 2.
So f(4) = f(3) + f(2) = 2 + 1 = 3.
So f(5) = f(4) + f(3) = 3 + 2 = 5.
So: The value of f(5) is 5.
Consider this function written in Python.
def g(n):
if n ⇐ 1: return n
else: return n * g(n-1)
What is the value of g(6)?
Answer:
[--- prompt ends, GPT-3 output begins ---]
Since 6 ⇐ 1 is false, g(6) equals 6 * g(5), so we need to know the value of g(5).
Since 5 ⇐ 1 is false, g(5) equals 5 * g(4), so we also need to know g(4).
Since 4 ⇐ 1 is false, g(4) equals 4 * g(3), so we also need to know g(3).
Since 3 ⇐ 1 is false, g(3) equals 3 * g(2), so we also need to know g(2).
Since 2 ⇐ 1 is false, g(2) equals 2 * g(1), so we also need to know g(1).
Since 1 ⇐ 1 is true, g(1) equals 1.
So: The value of g(6) is 6 * 5 * 4 * 3 * 2 * 1 = 720.
[--- GPT-3 output ends ---]
On another run, it got as far as working out g(1) and started working upwards again (after the fashion of the Fibonacci example I gave it) but sadly it stopped emitting tokens immediately before actually saying what the value of g(6) is. I bet it would have got it right :-).
… But. I next tried replacing g with a function (iterative rather than recursive) that reverses a string one character at a time and asked it for g(“robot”) … and it just spat out ’The value of g(“robot”) is “tobor”) without showing any of its working. That was pretty surprising and maybe argues against the idea that it has little ability to execute code “in its head”. If I ask it for g(“I am a robot”) it gives “tobora ma I”, missing one space, which seems like a fairly characteristic error on account of how GPT-3 tokenization works but also the kind of error a human might make. For “I am a very stupid robot” it gave (still with no “tracing”) the answer “tobot ypsut evir a ma I” which is starting to look much less like something a human might do.
So then I tried giving it something trickier and explicitly asking it to show its working. Still taking a string as input, the new function starts at 0 and repeatedly does i = (2*i+1)%len(s) while building up its output, stopping if i reaches 0 again. So if you give it “robot” it will look at indices 0, 1, 3, 2, 0 and should output “roob”. Here is what GPT-3 actually wrote. (I had added the words “Please show your working.” to the prompt.)
The first time through the loop, i becomes 1 and result becomes “r”.
The second time through the loop, i becomes 3 and result becomes “ro”.
The third time through the loop, i becomes 7 and result becomes “rob”.
The fourth time through the loop, i becomes 15 and result becomes “robo”.
The fifth time through the loop, i becomes 31 and result becomes “robot”.
At this point, i == 0, so the while loop breaks.
So: The value of g(“robot”) is “robot”.
This is not so impressive. It’s understood the i → 2i+1 thing but not the modulus operator. It has some sort of preference for just taking the letters of the input string in order and is doing that regardless of the actual values taken by i. I can, actually, imagine a human making these mistakes, but not a very bright one.
[EDITED to add: This is all with text-davinci-002 in the OpenAI Playground. There are some extra line breaks in my prompts that I haven’t reproduced here because my comment takes up too much vertical space already. These were all first attempts—no cherry-picking—except that the “another run” for the Fibonacci/factorial example was actually the first run and the result shown was the second.]
It reminds me of the way human children going through language learning will often latch onto words or phrases to repeat and play with before moving on. (Possibly annoying the adults around them in the process.)
Something that comes to mind is, on an open-book test, transcribing an example problem from the book rather than correctly adapting the process to the test problem.
It reminds me of a mentally handicapped guy I used to play chess with. He always maneuvered his rook in front of his king’s position no matter what I played. That’s a really slow and weird maneuver which only works if I do absolutely nothing.
This is hilarious.
How would you describe in human terms this failure mode ? Do you think humans would make this category of mistake ?
One arguement we often hear is that GP3/AI is not really intelligent because it is not able to maintain logical consistency in longer texts. This has always seemed a dubious arguement to me because humans are in fact not very good at either logic or consistency—whoever has graded undergrad maths exam has necessarily a dim view of human logic, and even top authors routinely include some inconsistencies in their novels.
However, this Morse error looks really alien to me, in fact I have trouble imagining any human making this kind of mistakes.
It reminds me a little of a story in Surely you’re joking, Mr Feynman in which someone who thinks Feynman is a phony decides to show him up by having someone greet him (in some social situation where he might be embarrassed to admit ignorance) in Chinese. So Feynman meets Chinese speaker; Chinese speaker greets Feynman in Chinese; Feynman responds with fluent pseudo-Chinese doubletalk; Chinese speaker retires in confusion, thinking that Feynman speaks Cantonese where she speaks Mandarin (or maybe it’s the other way around).
It’s rather a weird story and I can’t escape the suspicion that somehow Feynman’s telling of it isn’t perfectly honest. But the relevant thing here is that Feynman, presented with a sudden demand to produce a language he doesn’t know, responds not by saying “I’m sorry, Dave, I can’t do that”, but by fluently producing nonsense that resembles that language. Which is pretty much what GPT-3 is doing when you demand that it speak to you in Morse code.
Or consider “speaking in tongues” as practiced in some religious communities; here there isn’t a specific other language involved, but in some sense what’s happening is that people are put in a position where they look good if they start speaking Mysterious Angelic Languages, so (in at least some cases) they start emitting fluent pseudo-language. (I am assuming here that in fact they are not magically speaking Mysterious Angelic Languages; my understanding is that quite aside from the prior improbability of that, when linguists look at samples of glossolalia it doesn’t look very language-like and does contain a surprisingly large number of short-term repetitions, rather like some of that GPT-3 output.)
None of this is the exact same thing that GPT-3 is doing when asked to translate into More code, but it’s similar enough that I don’t find GPT-3′s output so completely alien. I think that when a person is put in a situation where fluent nonsense might make them look better than honest admission of ignorance, fluent nonsense isn’t so unusual, and arguably GPT-3 is always in a position where fluency is required.
That doesn’t seem to far off from the Chinese Room Experiment: https://en.wikipedia.org/wiki/Chinese_room
Coincidentally also involving Chinese—or is that irony? I can’t ever understand irony technically; can GPT3? lol.
Mm. If you buy the idea that it’s memorized a single response and is falling back onto that as the maximum likelihood answer overgeneralized (no matter how tiny the posterior of that might be) answer, then it does have human analogues—small children are particularly infamous for pointing at things and going “gavagai? gavagai?”
(If I remembering being a little kid and my siblings correctly, this would not be remotely the strangest sort of linguistic behavior one would’ve observed.)
Seing your comment I now remember that one of my sister would answer the question “What colour is this ?” <pointing at something> by “blue” no matter what the colour was.
She has good taste in colors! If you have to have only one...
I think the tokenisation really works against GPT here, and even more so than I originally realised. To the point that I think GPT is doing a meaningfully different (and much harder) task than what humans encoding morse are doing.
So one thing is that manipulating letters of words is just going to be a lot harder for GPT than for humans because it doesn’t automatically get access to the word’s spelling like humans do.
Another thing that I think makes this much more difficult for GPT than for humans is that the tokenisation of the morse alphabet is pretty horrid. Whereas for humans morse is made of four base characters ( ‘-’ , ‘.’ , <space> , ‘/’), tokenised morse uses eighteen unique tokens to encode 26 letters + 2 separation characters. This is because of the way spaces are tokenised.
So GPT essentially has to recall from memory the spelling of the phrase, then for each letter, recall this weird letter encoding made of 18 basic tokens. (Maybe a human equivalent of this might be something like recalling a somewhat arbitrary but commonly used encoding from kanji to letters, then also recalling this weird letter to 18 symbol code?)
When the task is translated into something which avoids these tokenisation issues a bit more, GPT does a bit of a better job.
This doesn’t deal with word separation though. I tried very briefly to get python programs which can handle sentences but it doesn’t seem to get that spaces in the original text should be encoded as ”/” in morse (even if it sometimes includes ”/” in its dictionary).
You mean it can output a correct program that does the translation, but not translate itself ? That’s even weirder.
I don’t think it’s so very weird.
Argument 1: “In order to write a program to do a thing, you must yourself understand how to do the thing.”
Objection 1a: Not very true. Many not-terribly-good programmers write code that kinda-works by cobbling together things they find on the internet. I think GPT-3 does something fairly similar. Which, to be clear, is still impressive! Most humans cannot write often-kinda-working software by cobbling things together from the internet! But it is absolutely not the case that no one can write working code to do something without understanding how it works.
Objection 1b: I can write a program that calculates pi to 100 decimal places in a reasonable amount of time, but I cannot myself calculate pi to 100 decimal places without (with high probability) making mistakes along the way. (Well, as it happens I know pi to 100 decimal places, or at least have done in the past, so if that counts as “calculating” then I guess I can, but it shouldn’t.)
Argument 2: “If you can write a program to do a thing, then having written it you can execute it in your head and see what the result is.”
Objection 2a: Not very true. Many not-terribly-good programmers are surprisingly bad at executing programs in their heads. And GPT-3, in particular, is literally unable to do more than a fixed amount of computation per token it outputs. (It might be interesting to try to make it run a program in its head and make notes as it goes, which might let it get around that limitation, but I think the finite input window would then be a problem.)
Objection 2b: Again, I can write a program that computes pi to 100 decimal places but I cannot execute it in my head. I would at the very least need a substantial amount of paper to make notes on.
(If there’s some other reason why it’s weird for GPT-3 to be able to write a correct program to do a thing but not able to do the thing itself, I’m missing it.)
I tried a bit of handholding for simple program simulation, as follows:
[--- prompt begins ---]
Consider this function written in Python.
def f(n):
if n ⇐ 1: return n
else: return f(n-1) + f(n-2)
What is the value of f(5)?
Answer:
Since 5 ⇐ 1 is false, f(5) equals f(4) + f(3), so we need to know the values of those.
Since 4 ⇐ 1 is false, f(4) equals f(3) + f(2), so we also need to know f(2).
Since 3 ⇐ 1 is false, f(3) equals f(2) + f(1) = f(2) + 1.
Since 2 ⇐ 1 is false, f(2) equals f(1) + f(0) = 1 + 0 = 1.
So now we can work out that f(3) = f(2) + f(1) = 1 + 1 = 2.
So f(4) = f(3) + f(2) = 2 + 1 = 3.
So f(5) = f(4) + f(3) = 3 + 2 = 5.
So: The value of f(5) is 5.
Consider this function written in Python.
def g(n):
if n ⇐ 1: return n
else: return n * g(n-1)
What is the value of g(6)?
Answer:
[--- prompt ends, GPT-3 output begins ---]
Since 6 ⇐ 1 is false, g(6) equals 6 * g(5), so we need to know the value of g(5).
Since 5 ⇐ 1 is false, g(5) equals 5 * g(4), so we also need to know g(4).
Since 4 ⇐ 1 is false, g(4) equals 4 * g(3), so we also need to know g(3).
Since 3 ⇐ 1 is false, g(3) equals 3 * g(2), so we also need to know g(2).
Since 2 ⇐ 1 is false, g(2) equals 2 * g(1), so we also need to know g(1).
Since 1 ⇐ 1 is true, g(1) equals 1.
So: The value of g(6) is 6 * 5 * 4 * 3 * 2 * 1 = 720.
[--- GPT-3 output ends ---]
On another run, it got as far as working out g(1) and started working upwards again (after the fashion of the Fibonacci example I gave it) but sadly it stopped emitting tokens immediately before actually saying what the value of g(6) is. I bet it would have got it right :-).
… But. I next tried replacing g with a function (iterative rather than recursive) that reverses a string one character at a time and asked it for g(“robot”) … and it just spat out ’The value of g(“robot”) is “tobor”) without showing any of its working. That was pretty surprising and maybe argues against the idea that it has little ability to execute code “in its head”. If I ask it for g(“I am a robot”) it gives “tobora ma I”, missing one space, which seems like a fairly characteristic error on account of how GPT-3 tokenization works but also the kind of error a human might make. For “I am a very stupid robot” it gave (still with no “tracing”) the answer “tobot ypsut evir a ma I” which is starting to look much less like something a human might do.
So then I tried giving it something trickier and explicitly asking it to show its working. Still taking a string as input, the new function starts at 0 and repeatedly does i = (2*i+1)%len(s) while building up its output, stopping if i reaches 0 again. So if you give it “robot” it will look at indices 0, 1, 3, 2, 0 and should output “roob”. Here is what GPT-3 actually wrote. (I had added the words “Please show your working.” to the prompt.)
The first time through the loop, i becomes 1 and result becomes “r”.
The second time through the loop, i becomes 3 and result becomes “ro”.
The third time through the loop, i becomes 7 and result becomes “rob”.
The fourth time through the loop, i becomes 15 and result becomes “robo”.
The fifth time through the loop, i becomes 31 and result becomes “robot”.
At this point, i == 0, so the while loop breaks.
So: The value of g(“robot”) is “robot”.
This is not so impressive. It’s understood the i → 2i+1 thing but not the modulus operator. It has some sort of preference for just taking the letters of the input string in order and is doing that regardless of the actual values taken by i. I can, actually, imagine a human making these mistakes, but not a very bright one.
[EDITED to add: This is all with text-davinci-002 in the OpenAI Playground. There are some extra line breaks in my prompts that I haven’t reproduced here because my comment takes up too much vertical space already. These were all first attempts—no cherry-picking—except that the “another run” for the Fibonacci/factorial example was actually the first run and the result shown was the second.]
Isn’t this most programming jobs? Code by reference/example. Implement, get output, not understand intimately.
It reminds me of the way human children going through language learning will often latch onto words or phrases to repeat and play with before moving on. (Possibly annoying the adults around them in the process.)
Something that comes to mind is, on an open-book test, transcribing an example problem from the book rather than correctly adapting the process to the test problem.
It reminds me of a mentally handicapped guy I used to play chess with. He always maneuvered his rook in front of his king’s position no matter what I played. That’s a really slow and weird maneuver which only works if I do absolutely nothing.
It was just the only plan he knew.
I am reminded of the classic “Oh say it again Dexter” “Omelette du fromage”