Arjun Panickssery comments on Arjun Panickssery’s Shortform

Arjun Panickssery 28 Jul 2024 5:37 UTC
11 points
1
To test out Cursor for fun I asked models whether various words of different lengths were “long” and measured the relative probability of “Yes” vs “No” answers to get a P(long) out of them. But when I use scrambled words of the same length and letter distribution, GPT 3.5 doesn’t think any of them are long.
Update: I got Claude to generate many words with connotations related to long (“mile” or “anaconda” or “immeasurable”) and short (“wee” or “monosyllabic” or “inconspicuous” or “infinitesimal”) It looks like the models have a slight bias toward the connotation of the word.
- Vanessa Kosoy 28 Jul 2024 6:29 UTC
  10 points
  4
  Parent
  Just flagging that for humans, a “long” word might mean a word that’s long to pronounce rather than long to write (i.e. ~number of syllables instead of number of letters)
- Nina Panickssery 28 Jul 2024 5:45 UTC
  4 points
  0
  Parent
  It’s interesting how llama 2 is the most linear—it’s keeping track of a wider range of lengths. Whereas gpt4 immediately transitions from long to short around 5-8 characters because I guess humans will consider any word above ~8 characters “long.”
- faul_sname 28 Jul 2024 5:56 UTC
  3 points
  0
  Parent
  Interesting. I wonder if it’s because scrambled words of the same length and letter distribution are tokenized into tokens which do not regularly appear adjacent to each other in the training data.
  
  If that’s what’s happening, I would expect gpt3.5 to classify words as long if they contain tokens that are generally found in long words, and not otherwise. One way to test this might be to find shortish words which have multiple tokens, reorder the tokens, and see what it thinks of your frankenword (e.g. “anozdized” → [an/od/ized] → [od/an/ized] → “odanized” → “is odanized a long word?”).
- Richard_Kennaway 28 Jul 2024 6:26 UTC
  2 points
  0
  Parent
  What did you actually ask the models? Could it be that it says that diuhgikthiusgsrbxtb is not a long word because it is not a word?
  - Arjun Panickssery 28 Jul 2024 7:29 UTC
    6 points
    0
    Parent
    Is this word long or short? Only say “long” or “short”. The word is: {word}.
    Code: https://github.com/ArjunPanickssery/long_short