Sam Marks comments on What’s up with LLMs representing XORs of arbitrary features?

Sam Marks 4 Jan 2024 0:58 UTC
LW: 7 AF: 4
1
AF
Idk, I think I would guess that all of the most salient features will be things related to the meaning of the statement at a more basic level. E.g. things like: the statement is finished (i.e. isn’t an ongoing sentence), the statement is in English, the statement ends in a word which is the name of a country, etc.
My intuition here is mostly based on looking at lots of max activating dataset examples for SAE features for smaller models (many of which relate to basic semantic categories for words or to basic syntax), so it could be bad here (both because of model size and because the period token might carry more meta-level “summarized” information about the preceding statement).
Anyway, not really a crux, I would agree with you for some not-too-much-larger value of 50.