So I’ve been developing dictionaries that automatically find interesting directions in activation space, which could just be an extra category. Here’s my post w/ interesting directions, including German direction & Title Case direction.
I don’t think this should be implement in the next month, but when we have more established dictionaries for models, I would be able to help provide data to you.
Additionally, it’d be useful to know which tokens in the context are most responsible for the activation. There is likely some gradient-based attribution method to do this. Currently, I just ablate each token in the context, one at a time, and check which ones affect the token the most, which really helps w/ bigram features e.g. ” the [word]” feature which activates for most words after ” the”.
Impact
The highest impact part of this seems to be two-fold:
Gathering data for GPT-5 to be trained on (though not likely to collect several GB’s of data, but maybe)
Figuring out the best source of information & heuristics to predict activations
For both of these, I’m expecting people to also try to predict activations given an explanation (just like auto-interp does currently for OpenAI’s work).
Hi Logan—thanks for your response. Your dictionaries post is on the TODO to investigate and integrate (someone had referred me to it two weeks ago) - I’d love to make it happen.
Thanks for joining the Discord. Let’s discuss when I have a few days to get caught up with your work.
So I’ve been developing dictionaries that automatically find interesting directions in activation space, which could just be an extra category. Here’s my post w/ interesting directions, including German direction & Title Case direction.
I don’t think this should be implement in the next month, but when we have more established dictionaries for models, I would be able to help provide data to you.
Additionally, it’d be useful to know which tokens in the context are most responsible for the activation. There is likely some gradient-based attribution method to do this. Currently, I just ablate each token in the context, one at a time, and check which ones affect the token the most, which really helps w/ bigram features e.g. ” the [word]” feature which activates for most words after ” the”.
Impact
The highest impact part of this seems to be two-fold:
Gathering data for GPT-5 to be trained on (though not likely to collect several GB’s of data, but maybe)
Figuring out the best source of information & heuristics to predict activations
For both of these, I’m expecting people to also try to predict activations given an explanation (just like auto-interp does currently for OpenAI’s work).
Hi Logan—thanks for your response. Your dictionaries post is on the TODO to investigate and integrate (someone had referred me to it two weeks ago) - I’d love to make it happen.
Thanks for joining the Discord. Let’s discuss when I have a few days to get caught up with your work.