But certain details there are still somewhat sketchy, in particular we don’t have a detailed understanding of the attention circuit, and replacing the query with “the projection onto the subspace we thought was all that mattered” harmed performance significantly (down to 30-40%).
@Neel Nanda FYI my first thought when reading that was “did you try adding random normal noise along the directions orthogonal to the subspace to match the typical variance along those directions?”. Mentioning in case that’s a different kind of thing than you’d already thought of.
@Neel Nanda FYI my first thought when reading that was “did you try adding random normal noise along the directions orthogonal to the subspace to match the typical variance along those directions?”. Mentioning in case that’s a different kind of thing than you’d already thought of.