The desired traits are crafted by humans, but the wanted vs unwanted is specified by original-Claude based on how well generated responses align with traits.
(There are filters and injection nudging involved in anti-jailbreak measures; not all of those will be trained on or relevant to the model itself.)
For context:
https://www.anthropic.com/research/claude-character
The desired traits are crafted by humans, but the wanted vs unwanted is specified by original-Claude based on how well generated responses align with traits.
(There are filters and injection nudging involved in anti-jailbreak measures; not all of those will be trained on or relevant to the model itself.)