GPT gives the token “216” in the string “63 = 216“ a very low probability, just as low as “215” or “217”.
Replacing “63” with “62″ in the prompt still gives “216” as an output with ~10% probability.
Would the tokenizer behave differently given “216” and “2^16“, e.g. giving respectively the token “216” and some tokens like “**2” and “16*”? That would explain this as, GPT knows of course that 216 isn’t 63, but, it’s been forced to predict a relationship like “**2″ + “16*” = “**63*”.
The Codex tokenizer used by the GPT-3.5 models tokenizes them differently: “216” is 1 token, “2^16“ is 3 (“2”, “^”, “16”). Note that ” 216″ (with a space) is a different token, and it’s what text-davinci-003 actually really wants to predict (you’ll often see 100x probability ratios between these two tokens).
Here’s the log probs of the two sequences using Adam’s prompt above, with the trailing space removed (which is what he did in his actual setup, otherwise you get different probabilities): 2 16 → −15.91
Would the tokenizer behave differently given “216” and “2^16“, e.g. giving respectively the token “216” and some tokens like “**2” and “16*”? That would explain this as, GPT knows of course that 216 isn’t 63, but, it’s been forced to predict a relationship like “**2″ + “16*” = “**63*”.
The Codex tokenizer used by the GPT-3.5 models tokenizes them differently: “216” is 1 token, “2^16“ is 3 (“2”, “^”, “16”). Note that ” 216″ (with a space) is a different token, and it’s what
text-davinci-003
actually really wants to predict (you’ll often see 100x probability ratios between these two tokens).Here’s the log probs of the two sequences using Adam’s prompt above, with the trailing space removed (which is what he did in his actual setup, otherwise you get different probabilities):
2 16
→ −15.912^16
→ −1.34