Tom Lieberum comments on CNN feature visualization in 50 lines of code

Tom Lieberum 26 May 2022 13:02 UTC
1 point
On priors, I wouldn’t worry too much about c), since I would expect a ‘super stimulus’ for head A to not be a super stimulus for head B.

I think one of the problems is the discrete input space, i.e. how do you parameterize sequence that is being optimized?

One idea I just had was trying to fine-tune an LLM with a reward signal given by for example the magnitude of the residual delta coming from a particular head (we probably something else here, maybe net logit change?). The LLM then already encodes a prior over “sensible” sequences and will try to find one of those which activates the head strongly (however we want to operationalize that).