Owain_Evans comments on Paper: On measuring situational awareness in LLMs

Owain_Evans 6 Sep 2023 9:20 UTC
LW: 4 AF: 2
0
AF
We didn’t investigate the specific question of whether it’s raw diversity or specific features. In the Grosse et al paper on influence functions, they find that “high influence scores are relatively rare and they cover a large portion of the total influence”. This (vaguely) suggests that the top k paraphrases would do most of the work, which is what I would guess. That said, this is really something that should be investigated with more experiments.