A project I’ve been sitting on that I’m probably not going to get to for a while:
Improving on Automatic Circuit Discovery and Edge Attribution Patching by modifying them to run on algorithms that can detect complete boolean circuits. As it stands, both effectively use wire-by-wire patching, which when run on any nontrivial boolean circuits can only detect small subgraphs.
It’s a bit unclear how useful this will be, because:
not sure how useful I think mech interp is
not sure if this is where mech interp’s usefulness is bottlenecked
maybe attribution patching doesn’t work well when patching clean activations into a corrupted baseline, which would make this much slower
But I think it’ll be a good project to bang out for the experience, I’m curious what the results will be compared to ACDC/EAP.
This project started out as an ARENA capstone in collaboration with Carl Guo.
A project I’ve been sitting on that I’m probably not going to get to for a while:
Improving on Automatic Circuit Discovery and Edge Attribution Patching by modifying them to run on algorithms that can detect complete boolean circuits. As it stands, both effectively use wire-by-wire patching, which when run on any nontrivial boolean circuits can only detect small subgraphs.
It’s a bit unclear how useful this will be, because:
not sure how useful I think mech interp is
not sure if this is where mech interp’s usefulness is bottlenecked
maybe attribution patching doesn’t work well when patching clean activations into a corrupted baseline, which would make this much slower
But I think it’ll be a good project to bang out for the experience, I’m curious what the results will be compared to ACDC/EAP.
This project started out as an ARENA capstone in collaboration with Carl Guo.