When using the MDL loss to motivate the simplicity loss in A.2.1, I don’t see why the rank penalty is linear in . That is, when it says
If we consider [the two rank-1 matrices that always co-activate] as one separate component, then we only need one index to identify both of them, and therefore only need bits.
I’m not sure why this is instead of . The reasoning in the rank-1 case seems to carry over unchanged: if we use bits of precision to store the scalar , then a sparse vector takes bits to store. The rank of doesn’t seem to play a part in this argument.
One way this could make sense is if you’re always storing as a sum of rank-1 components, as later described in A.2.2. If you compute the attribution separately with respect to each rank-1 component, then it’ll take bits to store (indices + values for each component). But it seems you compute attributions with respect to directly, rather than with respect to each rank-1 component separately. I’m not sure how to resolve this.
(This isn’t very important either way: if this doesn’t scale with the rank, the MDL loss would directly justify the minimality loss. You can justify penalizing the sum of ranks of as a natural version of simplicity loss that could be added in alongside faithfulness, at the cost of a slight bit of conceptual unity.)
Thanks, that’s a very helpful way of putting it!
Not having thought about it for very long, my intuition says “minimizing the description length of A(x) definitely shouldn’t impose constraints on the components themselves,” i.e. “Alice has no use for the rank-1 attributions.” But I can see why it would be nice to find a way for Alice to want that information, and you probably have deeper intuitions for this.