TurnTrout comments on Scale Was All We Needed, At First

TurnTrout 20 Feb 2024 2:33 UTC
7 points
2
not saying naughty words
Narrow note (and acknowledging you might not share or advance the views of any particular character): I’ve heard this several times (e.g. by Nate as well), but I think it undersells the amazing engineering miracle of models which generalize instruction-following and “not doing bad things” quite well. Amazingly well, from some priors.
If we say “RLHF teaches it to not say naughty words”, that sounds so trivial, like something a regular expression filter could catch.
- Daniel Kokotajlo 5 Apr 2024 14:58 UTC
  4 points
  0
  Parent
  I agree that it unfairly trivializes what’s going on here. I am not too bothered by it but am happy to look for a better phrase. Maybe a more accurate phrase would be “not offending mainstream sensibilities?” Indeed a much more nuanced and difficult task than avoiding a list of words.