Hoagy comments on Anomalous tokens reveal the original identities of Instruct models

Hoagy 9 Feb 2023 13:42 UTC
LW: 1 AF: 1
0
AF
Interesting! I’m struggling to think what kind of OOD fingerprints for bad behaviour you (pl.) have in mind, other than testing fake ‘you suddenly have huge power’ situations which are quite common suggestions but v curious what you have in mind.
Also, think it’s worth saying that the strength of the result connecting babbage to text-davinci-001 is stronger than that connecting ada to text-ada-001 (by $Δ$ logprob), so it feels like the first one shouldn’t count that as a solid success.
I wonder whether you’d find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?
- janus 9 Feb 2023 17:29 UTC
  LW: 2 AF: 1
  0
  AF Parent
  
  I wonder whether you’d find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?
  
  I would guess it’s positive. I’ll check at some point and let you know.