ModalWaves comments on Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

ModalWaves 19 Mar 2025 23:09 UTC
2 points
0
Great critical awareness for evaluation research—I really endorse the CoT investigation. I’ve recently begun my own cross-model (Claude 3.7, GPT4.5, R1, Grok3) qualitative research project on sycophancy/people-pleasing behaviors, and the preliminary findings may point to an RLHF-induced response strategy, a meta-awareness of sorts regarding perceived evaluation contexts which seems to parallel your own findings! Following…