ryan_greenblatt comments on Sparsify: A mechanistic interpretability research agenda

ryan_greenblatt 5 Apr 2024 15:47 UTC
LW: 2 AF: 2
0
AF

The combined object ‘(network, dataset)’ is much larger than the network itself

Only by a constant factor with chinchilla scaling laws right (e.g. maybe 20x more tokens than params)? And spiritually, we only need to understand behavior on the training dataset to understand everything that SGD has taught the model.
- Lee Sharkey 8 Apr 2024 11:29 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Hm I think of the (network, dataset) as scaling multiplicatively with size of network and size of dataset. In the thread with Erik above, I touched a little bit on why:
  “SAEs (or decompiled networks that use SAEs as the building block) are supposed to approximate the original network behaviour. So SAEs are mathematical descriptions of the network, but not of the (network, dataset). What’s a mathematical description of the (network, dataset), then? It’s just what you get when you pass the dataset through the network; this datum interacts with this weight to produce this activation, that datum interacts with this weight to produce that activation, and so on. A mathematical description of the (network, dataset) in terms of SAEs are: this datum activates dictionary features xyz (where xyz is just indices and has no semantic info), that datum activates dictionary features abc, and so on.”
  And spiritually, we only need to understand behavior on the training dataset to understand everything that SGD has taught the model.
  Yes, I roughly agree with the spirit of this.