Sam Marks comments on What’s up with LLMs representing XORs of arbitrary features?

Sam Marks 12 Jan 2024 19:07 UTC
LW: 4 AF: 3
0
AF
Imo “true according to Alice” is nowhere near as “crazy” a feature as “has_true XOR has_banana”. It seems useful for the LLM to model what is true according to Alice! (Possibly I’m misunderstanding what you mean by “crazy” here.)
I agree with this! (And it’s what I was trying to say; sorry if I was unclear.) My point is that
{ features which are as crazy as “true according to Alice” (i.e., not too crazy)}
seems potentially manageable, where as
{ features which are as crazy as arbitrary boolean functions of other features }
seems totally unmanageable.
Thanks, as always, for the thoughtful replies.