Neel Nanda comments on SolidGoldMagikarp (plus, prompt generation)

Neel Nanda 6 Feb 2023 19:13 UTC
LW: 10 AF: 6
0
AF

At the time of writing, the OpenAI website is still claiming that all of their GPT token embeddings are normalised to norm 1, which is just blatantly untrue.

Why do you think this is blatantly untrue? I don’t see how the results in this post falsify that hypothesis
- Jessica Rumbelow 6 Feb 2023 21:13 UTC
  LW: 1 AF: 1
  −2
  AF Parent
  This link: https://help.openai.com/en/articles/6824809-embeddings-frequently-asked-questions says that token embeddings are normalised to length 1, but a quick inspection of the embeddings available through the huggingface model shows this isn’t the case. I think that’s the extent of our claim. For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.
  - Neel Nanda 6 Feb 2023 21:44 UTC
    LW: 17 AF: 8
    1
    AF Parent
    Oh wait, that FAQ is actually nothing to do with GPT-3. That’s about their embedding models, which map sequences of tokens to a single vector, and they’re saying that those are normalised. Which is nothing to do with the map from tokens to residual stream vectors in GPT-3, even though that also happens to be called an embedding
    - Jessica Rumbelow 7 Feb 2023 10:48 UTC
      1 point
      0
      Parent
      Aha!! Thanks Neel, makes sense. I’ll update the post
  - Neel Nanda 6 Feb 2023 21:23 UTC
    LW: 2 AF: 1
    0
    AF Parent
    
    but a quick inspection of the embeddings available through the huggingface model shows this isn’t the case
    
    That’s GPT-2 though, right? I interpret that Q&A claim as saying that GPT-3 does the normalisation, I agree that GPT-2 definitely doesn’t. But idk, doesn’t really matter
    
    For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.
    
    Interesting, what exactly do you mean by normalise? GPT-2 presumably breaks if you just outright normalise, since different tokens have very different norms