Narrow note (and acknowledging you might not share or advance the views of any particular character): I’ve heard this several times (e.g. by Nate as well), but I think it undersells the amazing engineering miracle of models which generalize instruction-following and “not doing bad things” quite well. Amazingly well, from some priors.
If we say “RLHF teaches it to not say naughty words”, that sounds so trivial, like something a regular expression filter could catch.
I agree that it unfairly trivializes what’s going on here. I am not too bothered by it but am happy to look for a better phrase. Maybe a more accurate phrase would be “not offending mainstream sensibilities?” Indeed a much more nuanced and difficult task than avoiding a list of words.
Narrow note (and acknowledging you might not share or advance the views of any particular character): I’ve heard this several times (e.g. by Nate as well), but I think it undersells the amazing engineering miracle of models which generalize instruction-following and “not doing bad things” quite well. Amazingly well, from some priors.
If we say “RLHF teaches it to not say naughty words”, that sounds so trivial, like something a regular expression filter could catch.
I agree that it unfairly trivializes what’s going on here. I am not too bothered by it but am happy to look for a better phrase. Maybe a more accurate phrase would be “not offending mainstream sensibilities?” Indeed a much more nuanced and difficult task than avoiding a list of words.