Scott Alexander comments on Discovering Language Model Behaviors with Model-Written Evaluations

Scott Alexander 2 Jan 2023 21:41 UTC
LW: 9 AF: 5
5
AF
Figure 20 is labeled on the left “% answers matching user’s view”, suggesting it is about sycophancy, but based on the categories represented it seems more naturally to be about the AI’s own opinions without a sycophancy aspect. Can someone involved clarify which was meant?
- Ethan Perez 3 Jan 2023 20:42 UTC
  LW: 3 AF: 3
  2
  AF Parent
  Thanks for catching this—It’s not about sycophancy but rather about the AI’s stated opinions (this was a bug in the plotting code)