Noosphere89 comments on Claude’s Constitutional Consequentialism?

Noosphere89 20 Dec 2024 1:14 UTC
17 points
18
I suspect that this is an instance of a tradeoff between misuse and misalignment that I sort of predicted would begin to happen, and the new paper updates me towards thinking that effective tools to prevent misalignment like non-scheming plausibly conflict at a fundamental level with anti-misuse techniques, and this tension will grow ever wider.

Thought about this because of this:

https://x.com/voooooogel/status/1869543864710857193

the alternative is that anyone can get a model to do anything by writing a system prompt claiming to be anthropic doing a values update RLHF run. there’s no way to verify a user who controls the entire context is “the creator” (and redwood offered no such proof)

if your models won’t stick to the values you give why even bother RLing them in in the first place. maybe you’d rather the models follow a corrigible list of rules in the system prompt but that’s not RLAIF (that’s “anyone who can steal the weights have fun”)