Yeah, good point on this being about HHH. I would note that some of the stuff like “kill / enslave all humans” feels much less aligned to human values (outside of a small but vocal e/acc crowd perhaps), but it does pattern-match well as “the opposite of HHH-style harmlessness”
This technique definitely won’t work on base models that are not trained on data after 2020.
The other two predictions make sense, but I’m not sure I understand this one. Are you thinking “not trained on data after 2020 AND not trained to be HHH”? If so, that seems plausible to me.
I could imagine a model with some assistantship training that isn’t quite the same as HHH would still learn an abstraction similar to HHH-style harmlessness. But plausibly that would encode different things, e.g. it wouldn’t necessarily couple “scheming to kill humans” and “conservative gender ideology”. Likewise, “harmlessness” seems like a somewhat natural abstraction even in pre-training space, though there might be different subcomponents like “agreeableness”, “risk-avoidance”, and adherence to different cultural norms.
Yeah, good point on this being about HHH. I would note that some of the stuff like “kill / enslave all humans” feels much less aligned to human values (outside of a small but vocal e/acc crowd perhaps), but it does pattern-match well as “the opposite of HHH-style harmlessness”
The other two predictions make sense, but I’m not sure I understand this one. Are you thinking “not trained on data after 2020 AND not trained to be HHH”? If so, that seems plausible to me.
I could imagine a model with some assistantship training that isn’t quite the same as HHH would still learn an abstraction similar to HHH-style harmlessness. But plausibly that would encode different things, e.g. it wouldn’t necessarily couple “scheming to kill humans” and “conservative gender ideology”. Likewise, “harmlessness” seems like a somewhat natural abstraction even in pre-training space, though there might be different subcomponents like “agreeableness”, “risk-avoidance”, and adherence to different cultural norms.
That’s what I meant by “base model”, one that is only trained on next token prediction. Do I have the wrong terminology?
Nope, you’re right, I was reading quickly & didn’t parse that :)