RGRGRG comments on Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

RGRGRG 31 Jul 2023 18:59 UTC
1 point
0
One thought I’ve had, inspired by discussion (explained more later), is whether:
“label[ing] points by interpolating” is not the opposite of “developing an interesting, coherent internal algorithm.” (This is based on a quote from Stephen Casper’s retrospective that I also quoted in my post).
It could be the case that the network might have “develop[ed] an interesting, coherent algorithm”, namely the row coloring primitives discussed in this post, but uses “interpolation/pattern matching” to approximately detect the cutoff points.
When I started this work, I hoped to find more clearly increasing or decreasing embedding circuits dictating the cutoff points, which would be interpretable without falling back to “pattern matching”. (This was the inspiration for adding X and Y embeddings in Section 5. Resulting curves are not as smooth as I’d hoped). I think the next step (not sure if I will do this) might be to continue training this network, either simply for longer, with smaller batches, or with the entire input set (not holding about half out for testing) to see if resulting curves become smoother.
--
This thought was inspired by a short email discussion I had with Marius Hobbhahn, one of the authors of the original solution. I have his permission to share content from our email exchange here. Marius wants me to “caveat that [he, Marius] didn’t spend a lot of time thinking about [my original post], so [any of his thoughts from our email thread] may well be wrong and not particularly helpful for people reading [this comment]”. I’m not sure this caveat just adds noise since this thought is mine (he has not commented on this thought) and I don’t currently think it is worthwhile to summarize the entire thread (and the caveat was requested when I initially asked if I could summarize our entire thread), so not sharing any of his thoughts here, but I want to respect his wishes even if this caveat mostly (or solely) adds noise.