It seems that indeed the original claim that you (Turntrout) made here suggested that it would outperform finetuning on reducing sycophancy, and Ryan was exactly spot on. Finetuning basically gets rid of sycophancy and activation steering didn’t do anything in addition (or at least not anything we can measure with the present methodology).
No, I wasn’t predicting it would outperform (in that comment at least), I was predicting it would stack benefits. Both finetuning and activation addition got sycophancy to ~zero in the set we tested. There was no way for activation addition to “do anything in addition.” My prediction was not falsified by this data.
Do you want us to run the sycophancy-reduction experiment on a harder dataset so we can see? Do you expect activation additions to stop stacking? (Possibly you don’t, and are just making a local-validity pushback on my claims.)
No, I wasn’t predicting it would outperform (in that comment at least), I was predicting it would stack benefits. Both finetuning and activation addition got sycophancy to ~zero in the set we tested. There was no way for activation addition to “do anything in addition.” My prediction was not falsified by this data.
Do you want us to run the sycophancy-reduction experiment on a harder dataset so we can see? Do you expect activation additions to stop stacking? (Possibly you don’t, and are just making a local-validity pushback on my claims.)
I’m happy to run that experiment. @ryan_greenblatt