Lukas Finnveden comments on Paper: On measuring situational awareness in LLMs

Lukas Finnveden 6 Sep 2023 0:19 UTC
LW: 5 AF: 3
0
AF
Cool paper!

I’d be keen to see more examples of the paraphrases, if you’re able to share. To get a sense of the kind of data that lets the model generalize out of context. (E.g. if it’d be easy to take all 300 paraphrases of some statement (ideally where performance improved) and paste in a google doc and share. Or lmk if this is on github somewhere.)

I’d also be interested in experiments to determine whether the benefit from paraphrases is mostly fueled by the raw diversity, or if it’s because examples with certain specific features help a bunch, and those occasionally appear among the paraphrases. Curious if you have a prediction about that or if you already ran some experiments that shed some light on this. (I could have missed it even if it was in the paper.)
- Asa Cooper Stickland 6 Sep 2023 3:24 UTC
  LW: 7 AF: 4
  0
  AF Parent
  Here’s some examples: https://github.com/AsaCooperStickland/situational-awareness-evals/blob/main/sitaevals/tasks/assistant/data/tasks/german/guidance.txt
  As to your other point, I would guess that the “top k” instructions repeated would be better than full diversity, for maybe k around 20+, but I’m definitely not very confident about that
- Owain_Evans 6 Sep 2023 9:20 UTC
  LW: 4 AF: 2
  0
  AF Parent
  We didn’t investigate the specific question of whether it’s raw diversity or specific features. In the Grosse et al paper on influence functions, they find that “high influence scores are relatively rare and they cover a large portion of the total influence”. This (vaguely) suggests that the top k paraphrases would do most of the work, which is what I would guess. That said, this is really something that should be investigated with more experiments.