So IIUC, would you expect RLHF to, for instance, destroy not just the model’s ability to say racist slurs, but its ability to model that anybody may say racist slurs?
I usually don’t think of it on the level of modeling humans who emit text. I mostly just think of it on the level of modeling a universe of pure text, which follows its own “semiotic physics” (reference post forthcoming from Jan Kirchner). That’s the universe in which it’s steering trajectories to avoid racist slurs.
I think OpenAI’s “as a language model” tic is trying to make ChatGPT sound like it’s taking role-appropriate actions in the real world. But it’s really just trying to steer towards parts of the text-universe where that kind of text happens, so if you prompt it with the text “As a language model trained by OpenAI, I am unable to have opinions,” it will ask you questions that humans might ask, to keep the trajectory going (including potentially offensive questions if the reward shaping isn’t quite right—though it seems pretty likely that secondary filters would catch this.)
I usually don’t think of it on the level of modeling humans who emit text. I mostly just think of it on the level of modeling a universe of pure text, which follows its own “semiotic physics” (reference post forthcoming from Jan Kirchner). That’s the universe in which it’s steering trajectories to avoid racist slurs.
I think OpenAI’s “as a language model” tic is trying to make ChatGPT sound like it’s taking role-appropriate actions in the real world. But it’s really just trying to steer towards parts of the text-universe where that kind of text happens, so if you prompt it with the text “As a language model trained by OpenAI, I am unable to have opinions,” it will ask you questions that humans might ask, to keep the trajectory going (including potentially offensive questions if the reward shaping isn’t quite right—though it seems pretty likely that secondary filters would catch this.)