Resample ablation is not more expensive than mean (they both are just replacing activations with different values). But to answer the question, I think you would—resample ablation biases the model toward some particular corrupt output.
Ah, I guess I was incorrectly imagining a more expensive version of resample ablation where you looked at not just a single corrupted cache, but looking at the result across all corrupted inputs. That is, in the simple toy model where you’re computing f(x,y) where x is the values for the circuit you care about and y is the cache of corrupted activations, mean ablation is computing f(x,Ey∼Dy), and we could imagine versions of resample ablation that are computing f(x,y) for some y drawn from D, or we could compute Ey∼Df(x,y). I would say that both mean ablation and resample ablation as I’m imagining you’re describing it are both attempts to cheaply approximate Ey∼Df(x,y).
Ah, I guess I was incorrectly imagining a more expensive version of resample ablation where you looked at not just a single corrupted cache, but looking at the result across all corrupted inputs. That is, in the simple toy model where you’re computing f(x,y) where x is the values for the circuit you care about and y is the cache of corrupted activations, mean ablation is computing f(x,Ey∼Dy), and we could imagine versions of resample ablation that are computing f(x,y) for some y drawn from D, or we could compute Ey∼Df(x,y). I would say that both mean ablation and resample ablation as I’m imagining you’re describing it are both attempts to cheaply approximate Ey∼Df(x,y).