Suppose a∧b has a natural interpretation as a feature that the model would want to track and do downstream computation with, e.g. if a = “first name is Michael” and b = “last name is Jordan” then a∧b can be naturally interpreted as “is Michael Jordan”. In this case, it wouldn’t be surprising the model computed this AND as f(x)=ReLU((va+vb)⋅x+b∧) and stored the result along some direction vf independent of va and vb. Assuming the model has done this, we could then linearly extract a⊕b with the probe
Should the − be inside the inner parentheses, like σ((−αvf+va+vb)⋅x+b⊕) for α>1?
In the original equation, if a AND b are both present in x, the vectors va, vb, and vfwould all contribute to a positive inner product with x, assuming α>1. However, for XOR we want the va and vb inner products to be opposing the vf inner product such that we can flip the sign inside the sigmoid in the a AND b case, right?
Should the − be inside the inner parentheses, like σ((−αvf+va+vb)⋅x+b⊕) for α>1?
In the original equation, if a AND b are both present in x, the vectors va, vb, and vfwould all contribute to a positive inner product with x, assuming α>1. However, for XOR we want the va and vb inner products to be opposing the vf inner product such that we can flip the sign inside the sigmoid in the a AND b case, right?
Yes, you are correct, thanks. I’ll edit the post when I get a chance.