james.lucassen comments on james.lucassen’s Shortform

james.lucassen 6 Mar 2024 4:16 UTC
−1 points
0
A project I’ve been sitting on that I’m probably not going to get to for a while:
Improving on Automatic Circuit Discovery and Edge Attribution Patching by modifying them to run on algorithms that can detect complete boolean circuits. As it stands, both effectively use wire-by-wire patching, which when run on any nontrivial boolean circuits can only detect small subgraphs.
It’s a bit unclear how useful this will be, because:
- not sure how useful I think mech interp is
- not sure if this is where mech interp’s usefulness is bottlenecked
- maybe attribution patching doesn’t work well when patching clean activations into a corrupted baseline, which would make this much slower
But I think it’ll be a good project to bang out for the experience, I’m curious what the results will be compared to ACDC/EAP.

This project started out as an ARENA capstone in collaboration with Carl Guo.