Just to check, in the toy scenario, we assume the features in R^n are the coordinates in the default basis. So we have n features X_1, …, X_n
Yes, that’s correct.
Separately, do you have intuition for why they allow network to learn b too? Why not set b to zero too?
My understanding is that the bias is thought to be useful for two reasons:
It is preferable to be able to output a non-zero value for features the model chooses not to represent (namely their expected values)
Negative bias allows the model to zero-out small interferences, by shifting the values negative such that the ReLU outputs zero. I think empirically when these toy models are exhibiting lots of superposition, the bias vector typically has many negative entries.
Thanks for the thoughts --
I used the term “importance” since this was the term used in Anthropic’s original paper. I agree that (unlike in a real model) my toy scenario doesn’t contain sufficient information to deduce the context from the input data.
I like your phrasing of the task—it does a great job of concisely highlighting the ‘Mathematical Intuition for why Conditional Importance “doesn’t matter”’
Interesting that the experiment was helpful for you!