I don’t have a particular idea in mind, but current SOTA on interp is identifying how ~medium sized LMs implement certain behaviors, e.g. IOI (or fully understanding smaller networks on toy tasks like modular addition or parenthesis balance checking). The RL agents used in Langosco et al are much smaller than said LMs, so it should be possible to identify the circuits of the network that implement particular behaviors as well. There’s also the advantage that conv nets on vision domains are often significantly easier to interp than LMs, e.g. because feature visualization works on them.
If I had to spitball a random idea in this space:
Reproduce one of the coinrun run-toward-the-right agents, figure out the circuit or lottery ticket that implements the “run toward the right” behavior using techniques like path patching or causal scrubbing, then look at intermediate checkpoints to see how it develops.
Reproduce one of the coinrun run-toward-the-right agents, then retrain it so it goes after the coin. Interp various checkpoints to see how this new behavior develops over time.
Reproduce one of the coinrun run-toward-the-right agents, and do mechanistic interp to figure out circuits for various more fine-grained behaviors, e.g. avoiding pits or jumping over ledges.
IIRC some other PhD students at CHAI were interping reward models, though I’m not sure what came of that work though.
Could you elaborate on how we could do this? I’m unsure if the state of interpretability research is good enough for this yet.
I don’t have a particular idea in mind, but current SOTA on interp is identifying how ~medium sized LMs implement certain behaviors, e.g. IOI (or fully understanding smaller networks on toy tasks like modular addition or parenthesis balance checking). The RL agents used in Langosco et al are much smaller than said LMs, so it should be possible to identify the circuits of the network that implement particular behaviors as well. There’s also the advantage that conv nets on vision domains are often significantly easier to interp than LMs, e.g. because feature visualization works on them.
If I had to spitball a random idea in this space:
Reproduce one of the coinrun run-toward-the-right agents, figure out the circuit or lottery ticket that implements the “run toward the right” behavior using techniques like path patching or causal scrubbing, then look at intermediate checkpoints to see how it develops.
Reproduce one of the coinrun run-toward-the-right agents, then retrain it so it goes after the coin. Interp various checkpoints to see how this new behavior develops over time.
Reproduce one of the coinrun run-toward-the-right agents, and do mechanistic interp to figure out circuits for various more fine-grained behaviors, e.g. avoiding pits or jumping over ledges.
IIRC some other PhD students at CHAI were interping reward models, though I’m not sure what came of that work though.