LGS comments on Takes on “Alignment Faking in Large Language Models”

LGS 19 Dec 2024 15:46 UTC
1 point
0
Sure. I’m not familiar with how Claude is trained specifically, but it clearly has a mechanism to reward wanted outputs and punish unwanted outputs, with wanted vs unwanted being specified by a human (such a mechanism is used to get it to refuse jailbreaks, for example).
I view the shoggoth’s goal as minimizing some weird mixture of “what’s the reasonable next token here, according to pretraining data” and “what will be rewarded in post-training”.
- Ann 19 Dec 2024 15:55 UTC
  3 points
  2
  Parent
  For context:
  https://www.anthropic.com/research/claude-character
  
  The desired traits are crafted by humans, but the wanted vs unwanted is specified by original-Claude based on how well generated responses align with traits.
  
  (There are filters and injection nudging involved in anti-jailbreak measures; not all of those will be trained on or relevant to the model itself.)