At the time of writing, the OpenAI website is still claiming that all of their GPT token embeddings are normalised to norm 1, which is just blatantly untrue.
Why do you think this is blatantly untrue? I don’t see how the results in this post falsify that hypothesis
This link: https://help.openai.com/en/articles/6824809-embeddings-frequently-asked-questions says that token embeddings are normalised to length 1, but a quick inspection of the embeddings available through the huggingface model shows this isn’t the case. I think that’s the extent of our claim. For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.
Oh wait, that FAQ is actually nothing to do with GPT-3. That’s about their embedding models, which map sequences of tokens to a single vector, and they’re saying that those are normalised. Which is nothing to do with the map from tokens to residual stream vectors in GPT-3, even though that also happens to be called an embedding
but a quick inspection of the embeddings available through the huggingface model shows this isn’t the case
That’s GPT-2 though, right? I interpret that Q&A claim as saying that GPT-3 does the normalisation, I agree that GPT-2 definitely doesn’t. But idk, doesn’t really matter
For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.
Interesting, what exactly do you mean by normalise? GPT-2 presumably breaks if you just outright normalise, since different tokens have very different norms
Why do you think this is blatantly untrue? I don’t see how the results in this post falsify that hypothesis
This link: https://help.openai.com/en/articles/6824809-embeddings-frequently-asked-questions says that token embeddings are normalised to length 1, but a quick inspection of the embeddings available through the huggingface model shows this isn’t the case. I think that’s the extent of our claim. For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.
Oh wait, that FAQ is actually nothing to do with GPT-3. That’s about their embedding models, which map sequences of tokens to a single vector, and they’re saying that those are normalised. Which is nothing to do with the map from tokens to residual stream vectors in GPT-3, even though that also happens to be called an embedding
Aha!! Thanks Neel, makes sense. I’ll update the post
That’s GPT-2 though, right? I interpret that Q&A claim as saying that GPT-3 does the normalisation, I agree that GPT-2 definitely doesn’t. But idk, doesn’t really matter
Interesting, what exactly do you mean by normalise? GPT-2 presumably breaks if you just outright normalise, since different tokens have very different norms