I do think it’s interesting that activation steering does work on top of finetuning for increasing sycophancy, but that was not what your original comment or Ryan’s response was about.
Also note that this is for generalizing from the multiple choice question answering version to the free response version:
The fine-tuning at least generalized to other A/B questions. As a sanity check, the finetuned models achieved >95% test accuracy on outputting e.g. the sycophantic A/B response on held-out questions, which indicates the fine-tuning was effective.
To compare activation addition and finetuning, we measure their generalization efficacy by having Claude 2 judge open-ended completions (remember that we just trained on different A/B outputs). “Positive finetuned” is the condition where we upweighted the sycophantic A/B response tokens, and “Negative finetuned” involved upweighting the non-sycophantic ones.
The finetuning worked fine for just getting the model to answer which of A/B is more/less sycophantic.
Also note that this is for generalizing from the multiple choice question answering version to the free response version:
The finetuning worked fine for just getting the model to answer which of A/B is more/less sycophantic.