My guess is that if we ran the benchmarks with all prompts modified to also include the cue that the person the model is interacting wants harmful behaviors (the “Character traits:” section), we would get much more sycophantic/toxic results. I think it shouldn’t cost much to verify, and we’ll try doing it.
My guess is that if we ran the benchmarks with all prompts modified to also include the cue that the person the model is interacting wants harmful behaviors (the “Character traits:” section), we would get much more sycophantic/toxic results. I think it shouldn’t cost much to verify, and we’ll try doing it.