ryan_greenblatt comments on Steering Llama-2 with contrastive activation additions

ryan_greenblatt 8 Jan 2024 16:44 UTC
3 points
1
Note that the finetuning for figure 13 is training the model on sycophantic/non-sycophantic multiple choice question answering and then generalizing this to free response.

It isn’t training more directly on sycophantic responses or performing RL for sycophancy.