I see that you’ve unendorsed this, but my guess is that this is indeed what’s going on. That is, I’m guessing that the probe learned is p(x)=has_banana(x)⊕is_true(x) so that ~p(x)=has_banana(x). I was initially skeptical on the basis of the visualizations shown in the paper—it doesn’t look like a linear probe should be able to learn an xor like this. But if the true geometry is more like the figures below (from here) (+ a lateral displacement between the two datasets), then the linear probe can learn an xor just fine.
I see that you’ve unendorsed this, but my guess is that this is indeed what’s going on. That is, I’m guessing that the probe learned is p(x)=has_banana(x)⊕is_true(x) so that ~p(x)=has_banana(x). I was initially skeptical on the basis of the visualizations shown in the paper—it doesn’t look like a linear probe should be able to learn an xor like this. But if the true geometry is more like the figures below (from here) (+ a lateral displacement between the two datasets), then the linear probe can learn an xor just fine.