Nina Panickssery comments on Reducing sycophancy and improving honesty via activation steering

Nina Panickssery 26 Apr 2024 21:31 UTC
2 points
0
I am contrasting generating an output by:
1. Modeling how a human would respond (“human modeling in output generation”)
2. Modeling what the ground-truth answer is
Eg. for common misconceptions, maybe most humans would hold a certain misconception (like that South America is west of Florida), but we want the LLM to realize that we want it to actually say how things are (given it likely does represent this fact somewhere)