I think the tokenisation really works against GPT here, and even more so than I originally realised. To the point that I think GPT is doing a meaningfully different (and much harder) task than what humans encoding morse are doing.
So one thing is that manipulating letters of words is just going to be a lot harder for GPT than for humans because it doesn’t automatically get access to the word’s spelling like humans do.
Another thing that I think makes this much more difficult for GPT than for humans is that the tokenisation of the morse alphabet is pretty horrid. Whereas for humans morse is made of four base characters ( ‘-’ , ‘.’ , <space> , ‘/’), tokenised morse useseighteenunique tokens to encode 26 letters + 2 separation characters. This is because of the way spaces are tokenised.
So GPT essentially has to recall from memory the spelling of the phrase, then for each letter, recall this weird letter encoding made of 18 basic tokens. (Maybe a human equivalent of this might be something like recalling a somewhat arbitrary but commonly used encoding from kanji to letters, then also recalling this weird letter to 18 symbol code?)
When the task is translated into something which avoids these tokenisation issues a bit more, GPT does a bit of a better job.
This doesn’t deal with word separation though. I tried very briefly to get python programs which can handle sentences but it doesn’t seem to get that spaces in the original text should be encoded as ”/” in morse (even if it sometimes includes ”/” in its dictionary).
Argument 1: “In order to write a program to do a thing, you must yourself understand how to do the thing.”
Objection 1a: Not very true. Many not-terribly-good programmers write code that kinda-works by cobbling together things they find on the internet. I think GPT-3 does something fairly similar. Which, to be clear, is still impressive! Most humans cannot write often-kinda-working software by cobbling things together from the internet! But it is absolutely not the case that no one can write working code to do something without understanding how it works.
Objection 1b: I can write a program that calculates pi to 100 decimal places in a reasonable amount of time, but I cannot myself calculate pi to 100 decimal places without (with high probability) making mistakes along the way. (Well, as it happens I know pi to 100 decimal places, or at least have done in the past, so if that counts as “calculating” then I guess I can, but it shouldn’t.)
Argument 2: “If you can write a program to do a thing, then having written it you can execute it in your head and see what the result is.”
Objection 2a: Not very true. Many not-terribly-good programmers are surprisingly bad at executing programs in their heads. And GPT-3, in particular, is literally unable to do more than a fixed amount of computation per token it outputs. (It might be interesting to try to make it run a program in its head and make notes as it goes, which might let it get around that limitation, but I think the finite input window would then be a problem.)
Objection 2b: Again, I can write a program that computes pi to 100 decimal places but I cannot execute it in my head. I would at the very least need a substantial amount of paper to make notes on.
(If there’s some other reason why it’s weird for GPT-3 to be able to write a correct program to do a thing but not able to do the thing itself, I’m missing it.)
I tried a bit of handholding for simple program simulation, as follows:
[--- prompt begins ---]
Consider this function written in Python.
def f(n):
if n ⇐ 1: return n
else: return f(n-1) + f(n-2)
What is the value of f(5)?
Answer:
Since 5 ⇐ 1 is false, f(5) equals f(4) + f(3), so we need to know the values of those.
Since 4 ⇐ 1 is false, f(4) equals f(3) + f(2), so we also need to know f(2).
Since 3 ⇐ 1 is false, f(3) equals f(2) + f(1) = f(2) + 1.
Since 2 ⇐ 1 is false, f(2) equals f(1) + f(0) = 1 + 0 = 1.
So now we can work out that f(3) = f(2) + f(1) = 1 + 1 = 2.
So f(4) = f(3) + f(2) = 2 + 1 = 3.
So f(5) = f(4) + f(3) = 3 + 2 = 5.
So: The value of f(5) is 5.
Consider this function written in Python.
def g(n):
if n ⇐ 1: return n
else: return n * g(n-1)
What is the value of g(6)?
Answer:
[--- prompt ends, GPT-3 output begins ---]
Since 6 ⇐ 1 is false, g(6) equals 6 * g(5), so we need to know the value of g(5).
Since 5 ⇐ 1 is false, g(5) equals 5 * g(4), so we also need to know g(4).
Since 4 ⇐ 1 is false, g(4) equals 4 * g(3), so we also need to know g(3).
Since 3 ⇐ 1 is false, g(3) equals 3 * g(2), so we also need to know g(2).
Since 2 ⇐ 1 is false, g(2) equals 2 * g(1), so we also need to know g(1).
Since 1 ⇐ 1 is true, g(1) equals 1.
So: The value of g(6) is 6 * 5 * 4 * 3 * 2 * 1 = 720.
[--- GPT-3 output ends ---]
On another run, it got as far as working out g(1) and started working upwards again (after the fashion of the Fibonacci example I gave it) but sadly it stopped emitting tokens immediately before actually saying what the value of g(6) is. I bet it would have got it right :-).
… But. I next tried replacing g with a function (iterative rather than recursive) that reverses a string one character at a time and asked it for g(“robot”) … and it just spat out ’The value of g(“robot”) is “tobor”) without showing any of its working. That was pretty surprising and maybe argues against the idea that it has little ability to execute code “in its head”. If I ask it for g(“I am a robot”) it gives “tobora ma I”, missing one space, which seems like a fairly characteristic error on account of how GPT-3 tokenization works but also the kind of error a human might make. For “I am a very stupid robot” it gave (still with no “tracing”) the answer “tobot ypsut evir a ma I” which is starting to look much less like something a human might do.
So then I tried giving it something trickier and explicitly asking it to show its working. Still taking a string as input, the new function starts at 0 and repeatedly does i = (2*i+1)%len(s) while building up its output, stopping if i reaches 0 again. So if you give it “robot” it will look at indices 0, 1, 3, 2, 0 and should output “roob”. Here is what GPT-3 actually wrote. (I had added the words “Please show your working.” to the prompt.)
The first time through the loop, i becomes 1 and result becomes “r”.
The second time through the loop, i becomes 3 and result becomes “ro”.
The third time through the loop, i becomes 7 and result becomes “rob”.
The fourth time through the loop, i becomes 15 and result becomes “robo”.
The fifth time through the loop, i becomes 31 and result becomes “robot”.
At this point, i == 0, so the while loop breaks.
So: The value of g(“robot”) is “robot”.
This is not so impressive. It’s understood the i → 2i+1 thing but not the modulus operator. It has some sort of preference for just taking the letters of the input string in order and is doing that regardless of the actual values taken by i. I can, actually, imagine a human making these mistakes, but not a very bright one.
[EDITED to add: This is all with text-davinci-002 in the OpenAI Playground. There are some extra line breaks in my prompts that I haven’t reproduced here because my comment takes up too much vertical space already. These were all first attempts—no cherry-picking—except that the “another run” for the Fibonacci/factorial example was actually the first run and the result shown was the second.]
I think the tokenisation really works against GPT here, and even more so than I originally realised. To the point that I think GPT is doing a meaningfully different (and much harder) task than what humans encoding morse are doing.
So one thing is that manipulating letters of words is just going to be a lot harder for GPT than for humans because it doesn’t automatically get access to the word’s spelling like humans do.
Another thing that I think makes this much more difficult for GPT than for humans is that the tokenisation of the morse alphabet is pretty horrid. Whereas for humans morse is made of four base characters ( ‘-’ , ‘.’ , <space> , ‘/’), tokenised morse uses eighteen unique tokens to encode 26 letters + 2 separation characters. This is because of the way spaces are tokenised.
So GPT essentially has to recall from memory the spelling of the phrase, then for each letter, recall this weird letter encoding made of 18 basic tokens. (Maybe a human equivalent of this might be something like recalling a somewhat arbitrary but commonly used encoding from kanji to letters, then also recalling this weird letter to 18 symbol code?)
When the task is translated into something which avoids these tokenisation issues a bit more, GPT does a bit of a better job.
This doesn’t deal with word separation though. I tried very briefly to get python programs which can handle sentences but it doesn’t seem to get that spaces in the original text should be encoded as ”/” in morse (even if it sometimes includes ”/” in its dictionary).
You mean it can output a correct program that does the translation, but not translate itself ? That’s even weirder.
I don’t think it’s so very weird.
Argument 1: “In order to write a program to do a thing, you must yourself understand how to do the thing.”
Objection 1a: Not very true. Many not-terribly-good programmers write code that kinda-works by cobbling together things they find on the internet. I think GPT-3 does something fairly similar. Which, to be clear, is still impressive! Most humans cannot write often-kinda-working software by cobbling things together from the internet! But it is absolutely not the case that no one can write working code to do something without understanding how it works.
Objection 1b: I can write a program that calculates pi to 100 decimal places in a reasonable amount of time, but I cannot myself calculate pi to 100 decimal places without (with high probability) making mistakes along the way. (Well, as it happens I know pi to 100 decimal places, or at least have done in the past, so if that counts as “calculating” then I guess I can, but it shouldn’t.)
Argument 2: “If you can write a program to do a thing, then having written it you can execute it in your head and see what the result is.”
Objection 2a: Not very true. Many not-terribly-good programmers are surprisingly bad at executing programs in their heads. And GPT-3, in particular, is literally unable to do more than a fixed amount of computation per token it outputs. (It might be interesting to try to make it run a program in its head and make notes as it goes, which might let it get around that limitation, but I think the finite input window would then be a problem.)
Objection 2b: Again, I can write a program that computes pi to 100 decimal places but I cannot execute it in my head. I would at the very least need a substantial amount of paper to make notes on.
(If there’s some other reason why it’s weird for GPT-3 to be able to write a correct program to do a thing but not able to do the thing itself, I’m missing it.)
I tried a bit of handholding for simple program simulation, as follows:
[--- prompt begins ---]
Consider this function written in Python.
def f(n):
if n ⇐ 1: return n
else: return f(n-1) + f(n-2)
What is the value of f(5)?
Answer:
Since 5 ⇐ 1 is false, f(5) equals f(4) + f(3), so we need to know the values of those.
Since 4 ⇐ 1 is false, f(4) equals f(3) + f(2), so we also need to know f(2).
Since 3 ⇐ 1 is false, f(3) equals f(2) + f(1) = f(2) + 1.
Since 2 ⇐ 1 is false, f(2) equals f(1) + f(0) = 1 + 0 = 1.
So now we can work out that f(3) = f(2) + f(1) = 1 + 1 = 2.
So f(4) = f(3) + f(2) = 2 + 1 = 3.
So f(5) = f(4) + f(3) = 3 + 2 = 5.
So: The value of f(5) is 5.
Consider this function written in Python.
def g(n):
if n ⇐ 1: return n
else: return n * g(n-1)
What is the value of g(6)?
Answer:
[--- prompt ends, GPT-3 output begins ---]
Since 6 ⇐ 1 is false, g(6) equals 6 * g(5), so we need to know the value of g(5).
Since 5 ⇐ 1 is false, g(5) equals 5 * g(4), so we also need to know g(4).
Since 4 ⇐ 1 is false, g(4) equals 4 * g(3), so we also need to know g(3).
Since 3 ⇐ 1 is false, g(3) equals 3 * g(2), so we also need to know g(2).
Since 2 ⇐ 1 is false, g(2) equals 2 * g(1), so we also need to know g(1).
Since 1 ⇐ 1 is true, g(1) equals 1.
So: The value of g(6) is 6 * 5 * 4 * 3 * 2 * 1 = 720.
[--- GPT-3 output ends ---]
On another run, it got as far as working out g(1) and started working upwards again (after the fashion of the Fibonacci example I gave it) but sadly it stopped emitting tokens immediately before actually saying what the value of g(6) is. I bet it would have got it right :-).
… But. I next tried replacing g with a function (iterative rather than recursive) that reverses a string one character at a time and asked it for g(“robot”) … and it just spat out ’The value of g(“robot”) is “tobor”) without showing any of its working. That was pretty surprising and maybe argues against the idea that it has little ability to execute code “in its head”. If I ask it for g(“I am a robot”) it gives “tobora ma I”, missing one space, which seems like a fairly characteristic error on account of how GPT-3 tokenization works but also the kind of error a human might make. For “I am a very stupid robot” it gave (still with no “tracing”) the answer “tobot ypsut evir a ma I” which is starting to look much less like something a human might do.
So then I tried giving it something trickier and explicitly asking it to show its working. Still taking a string as input, the new function starts at 0 and repeatedly does i = (2*i+1)%len(s) while building up its output, stopping if i reaches 0 again. So if you give it “robot” it will look at indices 0, 1, 3, 2, 0 and should output “roob”. Here is what GPT-3 actually wrote. (I had added the words “Please show your working.” to the prompt.)
The first time through the loop, i becomes 1 and result becomes “r”.
The second time through the loop, i becomes 3 and result becomes “ro”.
The third time through the loop, i becomes 7 and result becomes “rob”.
The fourth time through the loop, i becomes 15 and result becomes “robo”.
The fifth time through the loop, i becomes 31 and result becomes “robot”.
At this point, i == 0, so the while loop breaks.
So: The value of g(“robot”) is “robot”.
This is not so impressive. It’s understood the i → 2i+1 thing but not the modulus operator. It has some sort of preference for just taking the letters of the input string in order and is doing that regardless of the actual values taken by i. I can, actually, imagine a human making these mistakes, but not a very bright one.
[EDITED to add: This is all with text-davinci-002 in the OpenAI Playground. There are some extra line breaks in my prompts that I haven’t reproduced here because my comment takes up too much vertical space already. These were all first attempts—no cherry-picking—except that the “another run” for the Fibonacci/factorial example was actually the first run and the result shown was the second.]
Isn’t this most programming jobs? Code by reference/example. Implement, get output, not understand intimately.