I’m accumulating a to-do list of experiments much faster than my ability to complete them:
Characterizing fine-tuning effects with feature dictionaries
Toy-scale automated neural network decompilation (difficult to scale)
Using soft prompts as a proxy measure of informational distance between models/conditions and behaviors (see note below)
If you wanted to take one of these and run with it or a variant, I wouldn’t mind!
The unifying theme behind many of these is goal agnosticism: understanding it, verifying it, maintaining it, and using it.
Note: I’ve already started some of these experiments, and I will very like start others soon. If you (or anyone reading this, for that matter) sees something they’d like to try, we should chat to avoid doing redundant work. I currently expect to focus on #4 for the next handful of weeks, so that one is probably at the highest risk of redundancy.
Further note: I haven’t done a deep dive on all relevant literature; it could be that some of these have already been done somewhere! (If anyone happens to know of prior art for any of these, please let me know.)
I sometimes post experiment ideas on my shortform. If you see one that seems exciting and you want to try it, great! Please send me a message so we can coordinate and avoid doing redundant work.