I think the problem might be that you’ve given this definition of heuristic:
A heuristic is a local, interpretable, and simple function (e.g., boolean/arithmetic/lookup functions) learned from the training data. There are multiple heuristics in each layer and their outputs are used in later layers.
Taking this definition seriously, it’s easy to decompose a forward pass into such functions.
But you have a much more detailed idea of a heuristic in mind. You’ve pointed toward some properties this might have in your point (2), but haven’t put it into specific words.
Some options: A single heuristic is causally dependent on <5 heuristics below and influences <5 heuristics above. The inputs and outputs of heuristics are strong information bottlenecks with a limit of 30 bits. The function of a heuristic can be understood without reference to >4 other heuristics in the same layer. A single heuristic is used in <5 different ways across the data distribution. A model is made up of <50 layers of heuristics. Large arrays of parallel heuristics often output information of the same type.
Some combination of these (or similar properties) would turn the heuristics intuition into a real hypothesis capable of making predictions.
If you don’t go into this level of detail, it’s easy to trick yourself into thinking that (2) basically kinda follows from your definition of heuristics, when it really really doesn’t. And that will lead you to never discover the value of the heuristics intuition, if it is true, and never reject it if it is false.
I agree that if you put more limitations on what heuristics are and how they compose, you end up with a stronger hypothesis. I think it’s probably better to leave that out and try do some more empirical work before making a claim there though (I suppose you could say that the hypothesis isn’t actually making a lot of concrete predictions yet at this stage).
I don’t think (2) necessarily follows, but I do sympathize with your point that the post is perhaps a more specific version of the hypothesis that “we can understand neural network computation by doing mech interp.”
I think the problem might be that you’ve given this definition of heuristic:
Taking this definition seriously, it’s easy to decompose a forward pass into such functions.
But you have a much more detailed idea of a heuristic in mind. You’ve pointed toward some properties this might have in your point (2), but haven’t put it into specific words.
Some options: A single heuristic is causally dependent on <5 heuristics below and influences <5 heuristics above. The inputs and outputs of heuristics are strong information bottlenecks with a limit of 30 bits. The function of a heuristic can be understood without reference to >4 other heuristics in the same layer. A single heuristic is used in <5 different ways across the data distribution. A model is made up of <50 layers of heuristics. Large arrays of parallel heuristics often output information of the same type.
Some combination of these (or similar properties) would turn the heuristics intuition into a real hypothesis capable of making predictions.
If you don’t go into this level of detail, it’s easy to trick yourself into thinking that (2) basically kinda follows from your definition of heuristics, when it really really doesn’t. And that will lead you to never discover the value of the heuristics intuition, if it is true, and never reject it if it is false.
I agree that if you put more limitations on what heuristics are and how they compose, you end up with a stronger hypothesis. I think it’s probably better to leave that out and try do some more empirical work before making a claim there though (I suppose you could say that the hypothesis isn’t actually making a lot of concrete predictions yet at this stage).
I don’t think (2) necessarily follows, but I do sympathize with your point that the post is perhaps a more specific version of the hypothesis that “we can understand neural network computation by doing mech interp.”