Good work! A few questions:
Where do the edges you draw come from? IIUC, this method should result in a collection of features but not say what the edges between them are.
IIUC, the binary masking technique here is the same as the subnetwork probing baseline from the ACDC paper, where it seemed to work about as well as ACDC (which in turn works a bit worse than attribution patching). Do you know why you’re finding something different here? Some ideas:
The SP vs. ACDC comparison from the ACDC paper wasn’t really apples-to-apples because ACDC pruned edges whereas SP pruned nodes (and kept all edges betwen non-pruned nodes IIUC). If Syed et al. had compared attribution patching on nodes vs. subnetwork probing, they would have found that subnetwork probing was better.
There’s something special about SAE features which changes which subnetwork discovery technique works best.
I’d be a bit interested in seeing your experiments repeated for finding subnetworks of neurons (instead of subnetworks of SAE features); does the comparison between attribution patching/integrated gradients and training a binary mask still hold in that case?
I really like the framing here, of asking whether we’ll see massive compute automation before [AI capability level we’re interested in]. I often hear people discuss nearby questions using IMO much more confusing abstractions, for example:
“How much is AI capabilities driven by algorithmic progress?” (problem: obscures dependence of algorithmic progress on compute for experimentation)
“How much AI progress can we get ‘purely from elicitation’?” (lots of problems, e.g. that eliciting a capability might first require a (possibly one-time) expenditure of compute for exploration)
Is this because:
You think that we’re >50% likely to not get AIs that dominate top human experts before 2040? (I’d be surprised if you thought this.)
The words “the feasibility of” importantly change the meaning of your claim in the first sentence? (I’m guessing it’s this based on the following parenthetical, but I’m having trouble parsing.)
Overall, it seems like you put substantially higher probability than I do on getting takeover capable AI without massive compute automation (and especially on getting a software-only singularity). I’d be very interested in understanding why. A brief outline of why this doesn’t seem that likely to me:
My read of the historical trend is that AI progress has come from scaling up all of the factors of production in tandem (hardware, algorithms, compute expenditure, etc.).
Scaling up hardware production has always been slower than scaling up algorithms, so this consideration is already factored into the historical trends. I don’t see a reason to believe that algorithms will start running away with the game.
Maybe you could counter-argue that algorithmic progress has only reflected returns to scale from AI being applied to AI research in the last 12-18 months and that the data from this period is consistent with algorithms becoming more relatively important relative to other factors?
I don’t see a reason that “takeover-capable” is a capability level at which algorithmic progress will be deviantly important relative to this historical trend.
I’d be interested either in hearing you respond to this sketch or in sketching out your reasoning from scratch.