Ann comments on Takes on “Alignment Faking in Large Language Models”

Ann 19 Dec 2024 15:38 UTC
2 points
0
While directionally reasonable, I think there might be some conflation of terms involved? Claude to my knowledge is trained with RLAIF, which is a step removed from RLHF, and not necessarily directly on human preferences. Pretraining alone (without annealing) will potentially result in the behavior you suggest from a base model put into the context of generating text for an AI assistant, even without human feedback.
- LGS 19 Dec 2024 15:46 UTC
  1 point
  0
  Parent
  Sure. I’m not familiar with how Claude is trained specifically, but it clearly has a mechanism to reward wanted outputs and punish unwanted outputs, with wanted vs unwanted being specified by a human (such a mechanism is used to get it to refuse jailbreaks, for example).
  I view the shoggoth’s goal as minimizing some weird mixture of “what’s the reasonable next token here, according to pretraining data” and “what will be rewarded in post-training”.
  - Ann 19 Dec 2024 15:55 UTC
    3 points
    2
    Parent
    For context:
    https://www.anthropic.com/research/claude-character
    
    The desired traits are crafted by humans, but the wanted vs unwanted is specified by original-Claude based on how well generated responses align with traits.
    
    (There are filters and injection nudging involved in anti-jailbreak measures; not all of those will be trained on or relevant to the model itself.)