Yeah, I’m keen to add exercises on interpretability. I like the direction of your one but it feels a bit too hard, in the sense that it’s a pretty broad request which is difficult to know where to start or how much progress they’re making. Any ideas on what more specific things we could ask them to do, or ways to make the exercise more legible to them?
That’s a fair point. I had thought this would be around the same level of difficulty as some of the exercises in the list such as “Produce a proposal for the ELK prize”. But I’m probably biased because I have spent a bit of time working in this area already.
I don’t know off the top of my head any ways to decompose the problem or simplify it further, but I’ll post back if I think of any. I think it will help as Lucid and Lucent get better, or perhaps if Anthropic open-sources their interpretability tooling. That could make it significantly easier to onboard people to these kinds of problems and scale up the effort.
Difference IMO is mainly that Circuits steps you through the problem in a way designed to help you understand their thinking, whereas ELK steps you through the problem in a way designed to get people to contribute.
(Perhaps “produce a proposal for something to investigate” might be of a similar difficulty as the ELK prize, but also Circuits work is much more bottom-up so it seems hard to know what to latch onto before having played around a bunch. Agreed that new tooling for playing around with things would help a lot.)
Yeah, I’m keen to add exercises on interpretability. I like the direction of your one but it feels a bit too hard, in the sense that it’s a pretty broad request which is difficult to know where to start or how much progress they’re making. Any ideas on what more specific things we could ask them to do, or ways to make the exercise more legible to them?
That’s a fair point. I had thought this would be around the same level of difficulty as some of the exercises in the list such as “Produce a proposal for the ELK prize”. But I’m probably biased because I have spent a bit of time working in this area already.
I don’t know off the top of my head any ways to decompose the problem or simplify it further, but I’ll post back if I think of any. I think it will help as Lucid and Lucent get better, or perhaps if Anthropic open-sources their interpretability tooling. That could make it significantly easier to onboard people to these kinds of problems and scale up the effort.
Difference IMO is mainly that Circuits steps you through the problem in a way designed to help you understand their thinking, whereas ELK steps you through the problem in a way designed to get people to contribute.
(Perhaps “produce a proposal for something to investigate” might be of a similar difficulty as the ELK prize, but also Circuits work is much more bottom-up so it seems hard to know what to latch onto before having played around a bunch. Agreed that new tooling for playing around with things would help a lot.)