James Payor comments on LLMs Sometimes Generate Purely Negatively-Reinforced Text

James Payor 16 Jun 2023 19:14 UTC
LW: 13 AF: 9
3
AF
Awesome, thanks for writing this up!
I very much like how you are giving a clear account for a mechanism like “negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text”.
(In particular, the model isn’t learning “just don’t say that”, it’s learning “these are the things to avoid saying”, which can make it easier to point at the whole cluster?)