faul_sname comments on Why I’m bearish on mechanistic interpretability: the shards are not in the network

faul_sname 13 Sep 2024 20:35 UTC
3 points
0
Let’s say your causal model looks something like this:
What causes you to specifically call out “sunblessings” as the “correct” upstream node in the world model of why you take your friend to dinner, as opposed to “fossil fuels” or “the big bang” or “human civilization existing” or “the restaurant having tasty food”?
Or do you reject the premise that your causal model should look like a tangled mess, and instead assert that it is possible to have a useful tree-shaped causal model (i.e. one that does not contain joining branches or loops).
- tailcalled 13 Sep 2024 20:56 UTC
  2 points
  0
  Parent
  Let’s say your causal model looks something like this:
  What causes you to specifically call out “sunblessings” as the “correct” upstream node in the world model of why you take your friend to dinner, as opposed to “fossil fuels” or “the big bang” or “human civilization existing” or “the restaurant having tasty food”?
  Nothing in this causal model centers the sun, that’s precisely what makes it so broken.
  Fossil fuels, the big bang, and human civilization is not what you offered to your friend. Tastiness is a sensory quality, which is a superficial matter. If you offer your friend something that you think they superficially assume to be better than you really think it is, that is hardly a nice gesture.
  Or do you reject the premise that your causal model should look like a tangled mess, and instead assert that it is possible to have a useful tree-shaped causal model (i.e. one that does not contain joining branches or loops).
  I wouldn’t rule out that you could sometimes have joining branches and loops, but mechanistic models tend to have it far too much. (Admittedly your given model isn’t super mechanistic, but still, it’s directionally mechanistic compared to what I’m advocating.)
  - faul_sname 13 Sep 2024 21:03 UTC
    2 points
    0
    Parent
    I don’t think I understand, concretely, what a non-mechanistic model looks like in your view. Can you give a concrete example of a useful non-mechanistic model?
    - tailcalled 13 Sep 2024 21:11 UTC
      3 points
      1
      Parent
      Something that tracks resource flows rather than information flows. For example if you have a company, you can have nodes for the revenue from each of the products your are selling, aggregating into product category nodes, and finally into total revenue, which then branches off into profits and different clusters of expenses, with each cluster branching off into more narrow expenses. This sort of thing is useful because it makes it practical to study phenomena by looking at their accounting.
      - faul_sname 13 Sep 2024 21:42 UTC
        2 points
        0
        Parent
        Sure, that’s also a useful thing to do sometimes. Is your contention that simple concentrated representations of resources and how they flow do not exist in the activations of LLMs that are reasoning about resources and how they flow?
        
        If not, I think I still don’t understand what sort of thing you think LLMs don’t have a concentrated representation of.
        tailcalled 13 Sep 2024 21:57 UTC
        3 points
        1
        Parent
        It’s clearer to me that the structure of the world is centered on emanations, erosions, bifurcations and accumulations branching out from the sun than that these phenomena can be can be modelled purely as resource-flows. Really, even “from the sun” is somewhat secondary; I originally came to this line of thought while statistically modelling software performance problems, leading to a model I call “linear diffusion of sparse lognormals”.
        
        I could imagine you could set up a prompt that makes the network represent things in this format, at least in some fragments of it. However, that’s not what you need in order to interpret the network, because that’s not how people use the network in practice, so it wouldn’t be informative for how the network works.
        
        Instead, an interpretation of the network would be constituted by a map which shows how different branches of the world impacted the network. In the simplest form, you could imagine slicing up the world into categories (e.g. plants, animals, fungi) and then decompose the weight vector of the network into a sum of that due to plants, due to animals, and due to fungi (and presumably also interaction terms and such).
        
        Of course in practice people use LLMs in a pretty narrow range of scenarios that don’t really match plants/animals/fungi, and the training data is probably heavily skewed towards the animals (and especially humans) branch of this tree, so realistically you’d need some more pragmatic model.