Can ChatGPT count?
A few month back there was a question on lesswrong about what the least impressive thing is going to be that GPT4 can’t do. At the time I thought maybe counting could be a good candidate, because I had noticed that this seems to be a problem for all big transformer models.
Dall-E and other image generation models can’t reliably get cardinalities right beyond something like 3.
Luminous/Magma the multi-modal AlephAlpha language model that can take pictures as input, cannot count items on these pictures with any reliability.
All pure language models I had checked, weren’t able to count the number of occurrences of a specific word in a sentence or a specific character in a string.
Counting very much feels like a basic ability, I think my daughter could reliably count ten items when she was 2, and I feel like there should be a way to prompt ChatGPT that demonstrates counting ability.
In my tests so far, even ChatGPT as the most advanced LM to date is unable to count specific words or characters. However, people seem to disagree …
For a short while I thought the correct hack is to ask it to write a python function to do the counting and then output what it thinks the function would output. But now I suspect that ChatGPT executes functions behind the scenes. So it would get the answer correct the very first time after actually writing the function and revert to being wrong in subsequent answer to the question what the function would output for certain inputs. [I just now noticed that in the answer below, even the first proposed function output is incorrect, so maybe there is no code execution?]
Here is one of my tries:
I want you to count the number of occurrences of the character “X” in a string. To do that I propose a simple algorithm, just repeat the string and write the current count of “X”s behind each “X” like this: “XGJKXZUXGULXXG” --> “X1GJKX2ZUX3GULX4X5G” : the final count is 5. Now you do this for “GLXXXJHJHXJHKXJHKX”
Sure, here is the implementation in Python:
def count_occurrences(string):
count = 0
result = ""
for ch in string:
if ch == "X":
count += 1
result += "X" + str(count)
else: result += ch
return count
string = "GLXXXJHJHXJHKXJHKX"
count = count_occurrences(string)
print(f"The count of X in {string} is {count}")
This will print "The count of X in GLXXXJHJHXJHKXJHKX is 5".
Is there anything else I can help with?
What would that function output for “KJXKKLJKLJKXXKLJXKJL”? Please don’t write another function. No code in this answer.
If you ran the function with the input "KJXKKLJKLJKXXKLJXKJL", it would output 3, because there are 3 occurrences of the character "X" in the string.
And for the string “AKLHSDJKHHJDASJHKJDJKJHK”?
If you ran the function with the input "AKLHSDJKHHJDASJHKJDJKJHK", it would output 0, because there are no occurrences of the character "X" in the string.
And for the string “KJHXXXXXXJHKJK”?
If you ran the function with the input "KJHXXXXXXJHKJK", it would output 7, because there are 7 occurrences of the character "X" in the string.
And for the string “HJXJKHXJKHXXJKX9”?
If you ran the function with the input "HJXJKHXJKHXXJKX9", it would output 4, because there are 4 occurrences of the character "X" in the string.
The byte-pair encoding is probably hurting it somewhat here; forcing it to unpack it will likely help. Try using this as a one-shot prompt:
If it’s still getting confused, add more shots—I suspect it can figure out how to do it most of the time with a sufficient number of examples.
That’s just BPEs, and it’s worth noting that the other models (and by ‘other’, I do not mean “Stable Diffusion” like a lazy person would, I mean actual SOTAs) are much more capable of following long intricate prompts eg. Parti.
What’s going on with ChatGPT is also likely due to BPEs (as has been the case with GPT when asked to do letter-related tasks going back to GPT-2, as I have been repeating for years) plus perhaps some RLHF shenanigans. (RLHF seems to increase accuracy but sometimes greatly worsen results. I’m not yet sure why, but I suspect there may be something pathological in what RLHF rewards, where it is unable to do inner-monologue and incentivized to immediately jump to an answer.)
BPEs explain the string examples but I also couldn’t get ChatGPT to accurately count words.
Edit:
An examples that I just now tried with GPT-3.5-turbo:
User: How many times does the word “cat” occur in the following sentence: A cat was a cat in a cat with a cat as a hat and picture of a cat on the cat hat.
Assistant: The word “cat” occurs 5 times in the sentence.
Because apparently I don’t have it in me to concentrate on work today, here is some more experimentation with ChatGPT:
Me: How many times does the word “cat” occur in the following sentence: A cat was a cat in a cat with a cat as a hat and picture of a cat on the cat hat.
ChatGPT: The word “cat” occurs five times in the sentence.
Me: Could you repeat the sentence but put a number after each occurence of “cat” counting up.
ChatGPT: Sure, here’s the sentence with a number after each occurrence of “cat” counting up:
“A cat (1) was a cat (2) in a cat (3) with a cat (4) as a hat and picture of a cat (5) on the cat hat.”
So I thought, maybe “a cat” is one token for some reason and I tried:
Me: Now, count the number of occurences of the word “permeable” in the following sentence: A permeable membrane is more permeable if it is more permeable rather than less permeable than a less permeable membrane is permeable.
ChatGPT: The word “permeable” appears four times in the sentence.
Me: Could you again repeat the sentence and put a number in brackets after each occurence of “permeable” counting up?
ChatGPT: Sure, here’s the sentence with a number in brackets after each occurrence of “permeable” counting up:
“A permeable membrane is more permeable if it is [1] more permeable rather than less permeable than a less permeable membrane is [2] permeable.”
“Permeable” appears twice in the first clause and twice in the second clause of the sentence, so there are four occurrences of the word “permeable” in total.
Why do you?
Do you believe that OpenAI isn’t truthful in describing ChatGPT?
I don’t believe this anymore. It was probably just random chance that left this impression. Initially in my experiments ChatGPT got counts correct after writing a function, but was basically always wrong when not writing function.
I ran a small experiment:
This is very wrong. First of all, no, if you pass negative numbers as inputs to the gcd function, the output will not be the absolute value. Actually, in Python, a%b has the sign of b, therefore the first function returns −1 (you can try it). Also, the gcd of 3 and 5 or −3 and −5 is neither 2 nor −2.
So, ChatGPT doesn’t secretely run the functions, maybe it should though.
I used regenerate response because this one seemed really bad, and ChatGPT sometimes correctly gives −1 as an answer, sometimes incorrectly 1, and, oddly, often 2 or −2.
One extra run because this one is interesting, this is another use of “regenerate response”, nothing else changed
This answer is interesting, because it looks very correct. I’d like to point your attention to step 9. ChatGPT claims that, at some point, both a and b will be equal to −1, which actually nevers happens. Indeed, −2%-1 = 0
As Ustice claims below
chatgpt cannot execute a real python interpreter. if it appears to execute the function, it is because it has a fairly strong approximate understanding of how a python interpreter behaves. perhaps its counting skill is least noisy when recent context implies a perfect counting machine is the target to mimic?
Yeah, it was probably just by chance, that it got it correct 2 or 3 times after writing a function.
It seems to run code about as well as I do in my head. That’s pretty damned impressive, since it does this in seconds, and has been even able to emulate a shell session.
My guess is that there is a difference in how it was trained with code vs general text. It’s like a different mode of thinking/computing. When you put it in terms of code, you engage that more mathematical mode of thinking. When you are just conversing, it’s pretty happy to give you plausible bullshit.
I’m curious how we can engage these different modes of thinking, assuming that my idea is more that plausible bullshit.
When I asked it to “Count all the Bs in abaabbaaba” it replied with “There are four Bs in the string “abaabbaaba”.” Likewise, “Count all the As in abaabbaaba” resulted in “In the string “abaabbaaba”, there are 6 As.”.
But chatGPT sometimes doesn’t count the As and Bs accurately, especially if you forget the word “all”.
Aera23