Gurkenglas comments on Comments on OpenPhil’s Interpretability RFP

Gurkenglas 9 Nov 2021 11:45 UTC
2 points
I’m not really sure what to make of them in isolation.
I score such techniques on how surprised I am how well they fit together, as with all good math. In this case my evidence is: My current approach is to thoroughly analyze the likes of mutual information for modularity only on the neighborhood of one input, since that is tractable with mere linear algebra, but an activation-predicting-model is even less extra theory (since we were already working with neural nets) and just happens to produce per cross-entropy loss the same KL divergences I’m already trying to measure.
IIRC you study problem decomposition. Would your results say I’ll need the same magic natural language tools that would assemble descriptions for every hierarchy node from descriptions of its children in order to construct the hierarchy in the first place? Do they say anything about how to continuously go between hierarchies as the model trains? Have you tried describing how well a hierarchy decomposes a problem by the extent to which “a: TA → A” which maps a list of subsolutions to a solution satisfies the square
on that hierarchy?