Sure. I’m not familiar with how Claude is trained specifically, but it clearly has a mechanism to reward wanted outputs and punish unwanted outputs, with wanted vs unwanted being specified by a human (such a mechanism is used to get it to refuse jailbreaks, for example).
I view the shoggoth’s goal as minimizing some weird mixture of “what’s the reasonable next token here, according to pretraining data” and “what will be rewarded in post-training”.
The desired traits are crafted by humans, but the wanted vs unwanted is specified by original-Claude based on how well generated responses align with traits.
(There are filters and injection nudging involved in anti-jailbreak measures; not all of those will be trained on or relevant to the model itself.)
Sure. I’m not familiar with how Claude is trained specifically, but it clearly has a mechanism to reward wanted outputs and punish unwanted outputs, with wanted vs unwanted being specified by a human (such a mechanism is used to get it to refuse jailbreaks, for example).
I view the shoggoth’s goal as minimizing some weird mixture of “what’s the reasonable next token here, according to pretraining data” and “what will be rewarded in post-training”.
For context:
https://www.anthropic.com/research/claude-character
The desired traits are crafted by humans, but the wanted vs unwanted is specified by original-Claude based on how well generated responses align with traits.
(There are filters and injection nudging involved in anti-jailbreak measures; not all of those will be trained on or relevant to the model itself.)