I believe what you describe is effectively Casual Scrubbing.
Edit: Note that it is not exactly the same as causal scrubbing, which picks looks at the activations for another input sampled at random.
On our particular model, doing this replacement shows us that the noise bound in our particular model is actually about 4 standard deviations worse than random, probably because the training procedure (sequences chosen uniformly at random) means we care a lot more about large possible maxes than small ones. (See Appendix H.1.2 for some very sparse details.)
On other toy models we’ve looked at (modular addition in particular, writeup forthcoming), we have (very) preliminary evidence suggesting that randomizing the noise has a steep drop-off in bound-tightness (as a function of how compact a proof the noise term comes from) in a very similar fashion to what we see with proofs. There seems to be a pretty narrow band of hypotheses for which the noise is structureless but we can’t prove it. This is supported by a handful of comments about how causal scrubbing indicates that many existing mech interp hypotheses in fact don’t capture enough of the behavior.
That sounds very promising, especially that in some cases you can demonstrate that it really is just noise, and in others it seems more like it’s behavior you don’t yet understand so looks like noise. and replacing it with noise degrades performance — that sounds like a very useful diagnostic.
I believe what you describe is effectively Casual Scrubbing. Edit: Note that it is not exactly the same as causal scrubbing, which picks looks at the activations for another input sampled at random.
On our particular model, doing this replacement shows us that the noise bound in our particular model is actually about 4 standard deviations worse than random, probably because the training procedure (sequences chosen uniformly at random) means we care a lot more about large possible maxes than small ones. (See Appendix H.1.2 for some very sparse details.)
On other toy models we’ve looked at (modular addition in particular, writeup forthcoming), we have (very) preliminary evidence suggesting that randomizing the noise has a steep drop-off in bound-tightness (as a function of how compact a proof the noise term comes from) in a very similar fashion to what we see with proofs. There seems to be a pretty narrow band of hypotheses for which the noise is structureless but we can’t prove it. This is supported by a handful of comments about how causal scrubbing indicates that many existing mech interp hypotheses in fact don’t capture enough of the behavior.
That sounds very promising, especially that in some cases you can demonstrate that it really is just noise, and in others it seems more like it’s behavior you don’t yet understand so looks like noise. and replacing it with noise degrades performance — that sounds like a very useful diagnostic.