johnswentworth comments on Symbol/Referent Confusions in Language Model Alignment Experiments

johnswentworth 29 Oct 2023 2:21 UTC
3 points
0
Yup, I basically agree with that. I’d put it under “relevant to more powerful systems”, as I doubt that current systems are smart enough to figure all that out, but with the caveat that that sort of reasoning is one of the main things we’re interested in for safety purposes so a test which doesn’t account for it is pretty uninformative for most safety purposes.
(Same with the tests suggested in the OP—they’d at least measure the basic thing which the Strawman family was trying to measure, but they’re still not particularly relevant to safety purposes.)
- Oliver Sourbut 29 Oct 2023 8:51 UTC
  3 points
  0
  Parent
  I assumed this would match your take. Haha my ‘in case it matters at all’ is terrible wording by the way. I meant something like, ‘in case the non-preregistering of this type of concern in this context ends up mattering in a later conversation’ (which seems unlikely, but nonzero).