faul_sname comments on Arjun Panickssery’s Shortform

faul_sname 28 Jul 2024 5:56 UTC
3 points
0
Interesting. I wonder if it’s because scrambled words of the same length and letter distribution are tokenized into tokens which do not regularly appear adjacent to each other in the training data.

If that’s what’s happening, I would expect gpt3.5 to classify words as long if they contain tokens that are generally found in long words, and not otherwise. One way to test this might be to find shortish words which have multiple tokens, reorder the tokens, and see what it thinks of your frankenword (e.g. “anozdized” → [an/od/ized] → [od/an/ized] → “odanized” → “is odanized a long word?”).