To test out Cursor for fun I asked models whether various words of different lengths were “long” and measured the relative probability of “Yes” vs “No” answers to get a P(long) out of them. But when I use scrambled words of the same length and letter distribution, GPT 3.5 doesn’t think any of them are long.
Update: I got Claude to generate many words with connotations related to long (“mile” or “anaconda” or “immeasurable”) and short (“wee” or “monosyllabic” or “inconspicuous” or “infinitesimal”) It looks like the models have a slight bias toward the connotation of the word.
Just flagging that for humans, a “long” word might mean a word that’s long to pronounce rather than long to write (i.e. ~number of syllables instead of number of letters)
It’s interesting how llama 2 is the most linear—it’s keeping track of a wider range of lengths. Whereas gpt4 immediately transitions from long to short around 5-8 characters because I guess humans will consider any word above ~8 characters “long.”
Interesting. I wonder if it’s because scrambled words of the same length and letter distribution are tokenized into tokens which do not regularly appear adjacent to each other in the training data.
If that’s what’s happening, I would expect gpt3.5 to classify words as long if they contain tokens that are generally found in long words, and not otherwise. One way to test this might be to find shortish words which have multiple tokens, reorder the tokens, and see what it thinks of your frankenword (e.g. “anozdized” → [an/od/ized] → [od/an/ized] → “odanized” → “is odanized a long word?”).
To test out Cursor for fun I asked models whether various words of different lengths were “long” and measured the relative probability of “Yes” vs “No” answers to get a P(long) out of them. But when I use scrambled words of the same length and letter distribution, GPT 3.5 doesn’t think any of them are long.
Update: I got Claude to generate many words with connotations related to long (“mile” or “anaconda” or “immeasurable”) and short (“wee” or “monosyllabic” or “inconspicuous” or “infinitesimal”) It looks like the models have a slight bias toward the connotation of the word.
Just flagging that for humans, a “long” word might mean a word that’s long to pronounce rather than long to write (i.e. ~number of syllables instead of number of letters)
It’s interesting how llama 2 is the most linear—it’s keeping track of a wider range of lengths. Whereas gpt4 immediately transitions from long to short around 5-8 characters because I guess humans will consider any word above ~8 characters “long.”
Interesting. I wonder if it’s because scrambled words of the same length and letter distribution are tokenized into tokens which do not regularly appear adjacent to each other in the training data.
If that’s what’s happening, I would expect gpt3.5 to classify words as long if they contain tokens that are generally found in long words, and not otherwise. One way to test this might be to find shortish words which have multiple tokens, reorder the tokens, and see what it thinks of your frankenword (e.g. “anozdized” → [an/od/ized] → [od/an/ized] → “odanized” → “is odanized a long word?”).
What did you actually ask the models? Could it be that it says that diuhgikthiusgsrbxtb is not a long word because it is not a word?
Code: https://github.com/ArjunPanickssery/long_short