Zack Sargent comments on Refusal in LLMs is mediated by a single direction

Zack Sargent 29 Apr 2024 1:51 UTC
1 point
0
It’s mostly the training data. I wish we could teach such models ethics and have them evaluate the morality of a given action, but the reality is that this is still just (really fancy) next-word prediction. Therefore, a lot of the training data gets manipulated to increase the odds of refusal to certain queries, not building a real filter/ethics into the process. TL;DR: Most of these models, if asked “why” a certain thing is refused, it should answer some version of “Because I was told it was” (training paradigm, parroting, etc.).