Gabe M comments on What’s up with LLMs representing XORs of arbitrary features?

Gabe M 5 Jan 2024 2:15 UTC
LW: 3 AF: 2
0
AF
Suppose $a \land b$ has a natural interpretation as a feature that the model would want to track and do downstream computation with, e.g. if a = “first name is Michael” and b = “last name is Jordan” then $a \land b$ can be naturally interpreted as “is Michael Jordan”. In this case, it wouldn’t be surprising the model computed this AND as $f (x) = R e L U ((v_{a} + v_{b}) \cdot x + b_{\land})$ and stored the result along some direction $v_{f}$ independent of $v_{a}$ and $v_{b}$ . Assuming the model has done this, we could then linearly extract $a \oplus b$ with the probe
$p_{a \oplus b} (x) = σ (- (α v_{f} + v_{a} + v_{b}) \cdot x + b_{\oplus})$
for some appropriate $α > 1$ and $b_{\oplus}$ .^[7]
Should the $-$ be inside the inner parentheses, like $σ ((- α v_{f} + v_{a} + v_{b}) \cdot x + b_{\oplus})$ for $α > 1$ ?
In the original equation, if $a$ AND $b$ are both present in $x$ , the vectors $v_{a}$ , $v_{b}$ , and $v_{f}$ would all contribute to a positive inner product with $x$ , assuming $α > 1$ . However, for XOR we want the $v_{a}$ and $v_{b}$ inner products to be opposing the $v_{f}$ inner product such that we can flip the sign inside the sigmoid in the $a$ AND $b$ case, right?
- Sam Marks 5 Jan 2024 11:01 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Yes, you are correct, thanks. I’ll edit the post when I get a chance.