The Codex tokenizer used by the GPT-3.5 models tokenizes them differently: “216” is 1 token, “2^16“ is 3 (“2”, “^”, “16”). Note that ” 216″ (with a space) is a different token, and it’s what text-davinci-003 actually really wants to predict (you’ll often see 100x probability ratios between these two tokens).
Here’s the log probs of the two sequences using Adam’s prompt above, with the trailing space removed (which is what he did in his actual setup, otherwise you get different probabilities): 2 16 → −15.91
The Codex tokenizer used by the GPT-3.5 models tokenizes them differently: “216” is 1 token, “2^16“ is 3 (“2”, “^”, “16”). Note that ” 216″ (with a space) is a different token, and it’s what
text-davinci-003
actually really wants to predict (you’ll often see 100x probability ratios between these two tokens).Here’s the log probs of the two sequences using Adam’s prompt above, with the trailing space removed (which is what he did in his actual setup, otherwise you get different probabilities):
2 16
→ −15.912^16
→ −1.34