Clément Dumas comments on Discussion: Challenges with Unsupervised LLM Knowledge Discovery

[ ]
[deleted]
- Sam Marks 19 Dec 2023 0:59 UTC
  LW: 3 AF: 2
  0
  AF Parent
  I see that you’ve unendorsed this, but my guess is that this is indeed what’s going on. That is, I’m guessing that the probe learned is $p (x) = has_banana (x) \oplus is_true (x)$ so that $~ p (x) = has_banana (x)$ . I was initially skeptical on the basis of the visualizations shown in the paper—it doesn’t look like a linear probe should be able to learn an xor like this. But if the true geometry is more like the figures below (from here) (+ a lateral displacement between the two datasets), then the linear probe can learn an xor just fine.