mwatkins comments on SolidGoldMagikarp (plus, prompt generation)

mwatkins 11 Feb 2023 18:16 UTC
2 points
−1
The idea that tokens found closest to the centroid are those that have moved the least from their initialisations during their training (because whatever it was that caused them to be tokens was curated out of their training corpus) was originally suggested to us by Stuart Armstrong. He suggested we might be seeing something analogous to “divide-by-zero” errors with these glitches.
However, we’ve ruled that out.
Although there’s a big cluster of them in the list of closest-tokens-to-centroid, they appear at all distances. And there are some extremely common tokens like “advertisement” at the same kind of distance. Also, in the gpt2-xl model, there’s a tendency for them to be found as far as possible from the centroid as you see in these histograms:
They show the distribution of distances-from-centroid across token sets in the three models we studied: upper histograms represent only 133 anomalous tokens, compared to the full set of 50,257 tokens in the lower histograms. The spikes above can be just seen as little bumps below, to give a sense of scale.
The ′ gmaxwell’ token is at very close to median distance from centroid in the gpt2-small model. It’s distance is 3.2602, the range is 1.5366 to 4.826. It’s only moderately closer to the centroid in the gpt2-xl and gpt2-small models. The ′ petertodd’ token is closer to the centroid in gpt2-j (no. 74 in the closest tokens list), but pretty average-distanced in the other two models.
Could the facts that ′ petertodd’ is of the closest tokens to the embedding centroid for at least one model, while ′ gmaxwell’ isn’t, tell us something about why ′ petertodd’ produces such intensely weird outputs and ′ gmaxwell’ glitches in a much less remarkable way?
We can’t know yet, because ultimately this positional information in GPT-2 and -J embedding spaces tells us nothing about why ′ gmaxwell’ glitches out GPT-3 models. We don’t have accessing to the GPT-3 embeddings data. Only someone with access to that at OpenAI could clarify this question of the extent to which the glitchiness of glitch tokens (a more variable phenomenon than we originally though) correlates to distance-from-centroid in the embedding space of the model that they’re glitching.