A Ray comments on Alex Ray’s Shortform

A Ray 1 Mar 2022 0:35 UTC
LW: 7 AF: 3
AF
Interpretability Challenges

Inspired by a friend I’ve been thinking about how to launch/run interpretability competitions, and what the costs/benefits would be.
I like this idea a lot because it cuts directly at one of the hard problems of spinning up in interpretability research as a new person. The field is difficult and the objectives are vaguely defined; it’s easy to accidentally trick yourself into seeing signal in noise, and there’s never certainty that the thing you’re looking for is actually there.
On the other hand, most of the interpretability-like interventions in models (e.g. knowledge edits/updates to transformers) make models worse and not better—they usually introduce some specific and contained deficiency (e.g. predict that the Eiffel Tower is in Rome, Italy).
So the idea for Interpretability Challenges would be to use existing methods (or possibly invent new ones) to inject concrete “things to find” inside of models, release those models as challenges, and then give prizes for finding things.
Some ways this might work:
- Super simple challenge: use editing techniques like ROME to edit a model, upload to google drive, and post a challenge to lesswrong. I’d probably personally put up a couple of prizes for good writeups for solutions.
- CTF (Capture the Flag): the AI Village has been interested in what sorts of AI challenges/competitions could be run in tandem with infosec conferences. I think it would be pretty straightforward to build some interpretability challenges for the next AI Village CTF, or to have a whole interpretability-only CTF by itself. This is exciting to me, because its a way to recruit more people from infosec into getting interested in AI safety (which has been a goal of mine for a while).
- Dixit-rules challenge league: One of the hard problems with challenges like this is how to set the difficulty. Too hard and no one makes progress. Too easy and no one learns/grows from it. I think if there were a bunch of interested people/groups, we could do a dixit style tournament: Every group takes turns proposing a challenge, and gets the most points if exactly one other group solves it (they don’t get points if everyone solves it, or if no one solves it). This has a nice self-balancing force, and would be good if there wanted to be an ongoing group who built new challenges as new interpretability research papers were published.
Please reach out to me if you’re interested in helping with efforts like this.