I wonder why this happens?! I can’t find any examples of this in the pile, and their filtering doesn’t seem to add it. It’s hard to imagine humans generating this particular refusal response. Perhaps it’s a result of filtering, s
it’s quite common for assistants to refuse instructions, especially harmful instructions. so i’m not surprised that base llms systestemically refuse harmful instructions from than harmless ones.
Interesting, that changes my mind somewhat.
I wonder why this happens?! I can’t find any examples of this in the pile, and their filtering doesn’t seem to add it. It’s hard to imagine humans generating this particular refusal response. Perhaps it’s a result of filtering, s
it’s quite common for assistants to refuse instructions, especially harmful instructions. so i’m not surprised that base llms systestemically refuse harmful instructions from than harmless ones.
Indeed. The base LLM would likely predict a “henchman” to be a lot less scrupulous than an “assistant”.