Gurkenglas comments on Comments on OpenPhil’s Interpretability RFP

Gurkenglas 7 Nov 2021 9:44 UTC
LW: 3 AF: 1
AF
Someone needs to check if we can use ML to guess activations in one set of neurons from activations in another set of neurons. The losses would give straightforward estimates of such statistical quantities as mutual information. Generating inputs that have the same activations in a set of neurons illustrates what the set of neurons does. I might do this myself if nobody else does.
- paulfchristiano 7 Nov 2021 17:05 UTC
  LW: 2 AF: 2
  AF Parent
  I’m not clear on what you’d do with the results of that exercise. Suppose that on a certain distribution of texts you can explain 40% of the variance in half of layer 7 by using the other half of layer 7 (and the % gradually increases as you use make the activation-predicting-model bigger, perhaps you guess it’s approaching 55% in the limit). What’s the upshot of models being that predictable rather than more or less, or the use of the actual predictor that you learned?
  Given an input x, generating other inputs that “look the same as x” to part of the model but not other parts seems like it reveals something about what that part of the model does. As a component of interpretability research that seems pretty similar to doing feature visualization or selecting input examples that activate a given neuron, and I’d guess it would fit in the same way into the project of doing interpretability.
  I’d mostly be excited about people developing these techniques as part of a focused project to understand what models are thinking. I’m not really sure what to make of them in isolation.
  - Gurkenglas 9 Nov 2021 11:45 UTC
    2 points
    Parent
    I’m not really sure what to make of them in isolation.
    I score such techniques on how surprised I am how well they fit together, as with all good math. In this case my evidence is: My current approach is to thoroughly analyze the likes of mutual information for modularity only on the neighborhood of one input, since that is tractable with mere linear algebra, but an activation-predicting-model is even less extra theory (since we were already working with neural nets) and just happens to produce per cross-entropy loss the same KL divergences I’m already trying to measure.
    IIRC you study problem decomposition. Would your results say I’ll need the same magic natural language tools that would assemble descriptions for every hierarchy node from descriptions of its children in order to construct the hierarchy in the first place? Do they say anything about how to continuously go between hierarchies as the model trains? Have you tried describing how well a hierarchy decomposes a problem by the extent to which “a: TA → A” which maps a list of subsolutions to a solution satisfies the square
    on that hierarchy?
  - Gurkenglas 9 Nov 2021 0:33 UTC
    2 points
    Parent
    If you can find two halves with little mutual information, you can understand one before having understood the other. I suspect that interpreting a model should be decomposed by hierarchically clustering neurons using such measurements. Since the measurement is differentiable, you can train a network for modularity to make this work better.
    It sure is similar to feature visualization! I prefer it because it doesn’t go out of distribution and doesn’t feel like it implicitly assumes that the model implements a linear function.
    I agree that interpretability is the purpose and the cure.