mic comments on Shard Theory in Nine Theses: a Distillation and Critical Appraisal

mic 21 Dec 2022 1:28 UTC
LW: 1 AF: 1
0
AF
For example, it should be possible to mechanistically identify shards in small RL agents (such as the RL agents studied in Langosco et al)
Could you elaborate on how we could do this? I’m unsure if the state of interpretability research is good enough for this yet.
- LawrenceC 21 Dec 2022 2:01 UTC
  4 points
  0
  Parent
  I don’t have a particular idea in mind, but current SOTA on interp is identifying how ~medium sized LMs implement certain behaviors, e.g. IOI (or fully understanding smaller networks on toy tasks like modular addition or parenthesis balance checking). The RL agents used in Langosco et al are much smaller than said LMs, so it should be possible to identify the circuits of the network that implement particular behaviors as well. There’s also the advantage that conv nets on vision domains are often significantly easier to interp than LMs, e.g. because feature visualization works on them.
  If I had to spitball a random idea in this space:
  - Reproduce one of the coinrun run-toward-the-right agents, figure out the circuit or lottery ticket that implements the “run toward the right” behavior using techniques like path patching or causal scrubbing, then look at intermediate checkpoints to see how it develops.
  - Reproduce one of the coinrun run-toward-the-right agents, then retrain it so it goes after the coin. Interp various checkpoints to see how this new behavior develops over time.
  - Reproduce one of the coinrun run-toward-the-right agents, and do mechanistic interp to figure out circuits for various more fine-grained behaviors, e.g. avoiding pits or jumping over ledges.
  IIRC some other PhD students at CHAI were interping reward models, though I’m not sure what came of that work though.