I very much like how you are giving a clear account for a mechanism like “negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text”.
(In particular, the model isn’t learning “just don’t say that”, it’s learning “these are the things to avoid saying”, which can make it easier to point at the whole cluster?)
Awesome, thanks for writing this up!
I very much like how you are giving a clear account for a mechanism like “negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text”.
(In particular, the model isn’t learning “just don’t say that”, it’s learning “these are the things to avoid saying”, which can make it easier to point at the whole cluster?)