You write down an optimization problem over (say) linear combinations of image pixels, minimizing some measure of marginal returns to capacity given current network parameters (first idea) or overall importance as measured by absolute value of dL/dC_i, again given current network parameters (second idea). By looking just for the feature that is currently “most problematic” you may be able to sidestep the need to identify the full set of “features” (whatever that really means).
I don’t know how exactly you would formulate these objective functions but it seems do-able no?
Oh I see! Sorry I didn’t realize you were describing a process for picking features.
I think this is a good idea to try, though I do have a concern. My worry is that if you do this on a model where you know what the features actually are, what happens is that this procedure discovers some heavily polysemantic “feature” that makes better use of capacity than any of the actual features in the problem. Because dL/dC_i is not a linear function of the feature’s embedding vector, there can exist superpositions of features which have greater dL/dC_i than any feature.
Anyway, I think this is a good thing to try and encourage someone to do so! I’m happy to offer guidance/feedback/chat with people interested in pursuing this, as automated feature identification seems like a really useful thing to have even if it turns out to be really expensive.
In both ideas I’m not sure how you’re identifying features. Manual interpretability work on a (more complicated) toy model?
You write down an optimization problem over (say) linear combinations of image pixels, minimizing some measure of marginal returns to capacity given current network parameters (first idea) or overall importance as measured by absolute value of dL/dC_i, again given current network parameters (second idea). By looking just for the feature that is currently “most problematic” you may be able to sidestep the need to identify the full set of “features” (whatever that really means).
I don’t know how exactly you would formulate these objective functions but it seems do-able no?
Oh I see! Sorry I didn’t realize you were describing a process for picking features.
I think this is a good idea to try, though I do have a concern. My worry is that if you do this on a model where you know what the features actually are, what happens is that this procedure discovers some heavily polysemantic “feature” that makes better use of capacity than any of the actual features in the problem. Because dL/dC_i is not a linear function of the feature’s embedding vector, there can exist superpositions of features which have greater dL/dC_i than any feature.
Anyway, I think this is a good thing to try and encourage someone to do so! I’m happy to offer guidance/feedback/chat with people interested in pursuing this, as automated feature identification seems like a really useful thing to have even if it turns out to be really expensive.