Interesting! I’m struggling to think what kind of OOD fingerprints for bad behaviour you (pl.) have in mind, other than testing fake ‘you suddenly have huge power’ situations which are quite common suggestions but v curious what you have in mind.
Also, think it’s worth saying that the strength of the result connecting babbage to text-davinci-001 is stronger than that connecting ada to text-ada-001 (by Δlogprob), so it feels like the first one shouldn’t count that as a solid success.
I wonder whether you’d find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?
I wonder whether you’d find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?
I would guess it’s positive. I’ll check at some point and let you know.
Interesting! I’m struggling to think what kind of OOD fingerprints for bad behaviour you (pl.) have in mind, other than testing fake ‘you suddenly have huge power’ situations which are quite common suggestions but v curious what you have in mind.
Also, think it’s worth saying that the strength of the result connecting babbage to text-davinci-001 is stronger than that connecting ada to text-ada-001 (by Δlogprob), so it feels like the first one shouldn’t count that as a solid success.
I wonder whether you’d find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?
I would guess it’s positive. I’ll check at some point and let you know.