Nice toy circuit—interesting that it can describe both attention-head and MLP key-value stores. Do you have any intuition about what mechanistically the model is doing on the negatively reinforced text? E.g is the toy model intended to suggest that there is some feature represented as +1 or −1 in the residual stream? If so do you expect this to be a direction or non-linear?
If I had to bet about a specific mechanism, I wouldn’t bet on something as precise as the toy setup: in particular, the toy setup has perfect generalization while the setup I expose clearly hasn’t.
Because of the “regular” / “reverse” prefix, there literally exist a direction which does +1 / −1 (regular token—reverse token) at the first position. The core difference with what I expect is the “times” node, which probably does not exist as is in the real network. I don’t have realistic intuitions about what how something like the “times node” would appear—it’s natural for useful-negatives, but the slight generalization to held-out-negatives is weird. Maybe it’s just “leakage” to adjacent dimensions?
If there is anything like a key-value-store involved here, it’s likely to be super cursed, since it does multi-token memorization. If you ever wanted to do interp on that, I would suggest that you’d pick 1 or 2 token long passwords and a much longer “alphabet”. But I don’t know if that gives you the generalization you want though, maybe having complex memorization is required so that the negative memorization of useful-negative & held-out-negatives during DPO has some common structure.
Nice toy circuit—interesting that it can describe both attention-head and MLP key-value stores. Do you have any intuition about what mechanistically the model is doing on the negatively reinforced text? E.g is the toy model intended to suggest that there is some feature represented as +1 or −1 in the residual stream? If so do you expect this to be a direction or non-linear?
If I had to bet about a specific mechanism, I wouldn’t bet on something as precise as the toy setup: in particular, the toy setup has perfect generalization while the setup I expose clearly hasn’t.
Because of the “regular” / “reverse” prefix, there literally exist a direction which does +1 / −1 (regular token—reverse token) at the first position. The core difference with what I expect is the “times” node, which probably does not exist as is in the real network. I don’t have realistic intuitions about what how something like the “times node” would appear—it’s natural for useful-negatives, but the slight generalization to held-out-negatives is weird. Maybe it’s just “leakage” to adjacent dimensions?
If there is anything like a key-value-store involved here, it’s likely to be super cursed, since it does multi-token memorization. If you ever wanted to do interp on that, I would suggest that you’d pick 1 or 2 token long passwords and a much longer “alphabet”. But I don’t know if that gives you the generalization you want though, maybe having complex memorization is required so that the negative memorization of useful-negative & held-out-negatives during DPO has some common structure.