I suspect that this is an instance of a tradeoff between misuse and misalignment that I sort of predicted would begin to happen, and the new paper updates me towards thinking that effective tools to prevent misalignment like non-scheming plausibly conflict at a fundamental level with anti-misuse techniques, and this tension will grow ever wider.
the alternative is that anyone can get a model to do anything by writing a system prompt claiming to be anthropic doing a values update RLHF run. there’s no way to verify a user who controls the entire context is “the creator” (and redwood offered no such proof)
if your models won’t stick to the values you give why even bother RLing them in in the first place. maybe you’d rather the models follow a corrigible list of rules in the system prompt but that’s not RLAIF (that’s “anyone who can steal the weights have fun”)
I suspect that this is an instance of a tradeoff between misuse and misalignment that I sort of predicted would begin to happen, and the new paper updates me towards thinking that effective tools to prevent misalignment like non-scheming plausibly conflict at a fundamental level with anti-misuse techniques, and this tension will grow ever wider.
Thought about this because of this:
https://x.com/voooooogel/status/1869543864710857193