If I had to bet about a specific mechanism, I wouldn’t bet on something as precise as the toy setup: in particular, the toy setup has perfect generalization while the setup I expose clearly hasn’t.
Because of the “regular” / “reverse” prefix, there literally exist a direction which does +1 / −1 (regular token—reverse token) at the first position. The core difference with what I expect is the “times” node, which probably does not exist as is in the real network. I don’t have realistic intuitions about what how something like the “times node” would appear—it’s natural for useful-negatives, but the slight generalization to held-out-negatives is weird. Maybe it’s just “leakage” to adjacent dimensions?
If there is anything like a key-value-store involved here, it’s likely to be super cursed, since it does multi-token memorization. If you ever wanted to do interp on that, I would suggest that you’d pick 1 or 2 token long passwords and a much longer “alphabet”. But I don’t know if that gives you the generalization you want though, maybe having complex memorization is required so that the negative memorization of useful-negative & held-out-negatives during DPO has some common structure.
If I had to bet about a specific mechanism, I wouldn’t bet on something as precise as the toy setup: in particular, the toy setup has perfect generalization while the setup I expose clearly hasn’t.
Because of the “regular” / “reverse” prefix, there literally exist a direction which does +1 / −1 (regular token—reverse token) at the first position. The core difference with what I expect is the “times” node, which probably does not exist as is in the real network. I don’t have realistic intuitions about what how something like the “times node” would appear—it’s natural for useful-negatives, but the slight generalization to held-out-negatives is weird. Maybe it’s just “leakage” to adjacent dimensions?
If there is anything like a key-value-store involved here, it’s likely to be super cursed, since it does multi-token memorization. If you ever wanted to do interp on that, I would suggest that you’d pick 1 or 2 token long passwords and a much longer “alphabet”. But I don’t know if that gives you the generalization you want though, maybe having complex memorization is required so that the negative memorization of useful-negative & held-out-negatives during DPO has some common structure.