Curated. I’ve often felt that mech interp seems like a subfield of alignment work that’s good traction and is making progress. And in my experience that’s a fairly common view. If that were true, that would be a pretty big deal. It might be that we could make a big dent with a pretty scalable field of research. So it seems pretty valuable to read thoughtful arguments to the contrary.
I gotta say I have some hesitation in curating this dialogue. The best and most informative topics are kind of diffused over the dialogue, and I feel like I never quite get enough depth or concreteness to really think through the claims. I think my main takeaways are:
(a) presumably a lot of safety-relevant stuff is in the diff between weaker and stronger models, and so you have to think about how you’d tell you’re explaining that diff, and (b) a question about whether ‘induction heads exist’ or not, and what that means for whether or not mech interp has started making meaningful progress.
Curated. I’ve often felt that mech interp seems like a subfield of alignment work that’s good traction and is making progress. And in my experience that’s a fairly common view. If that were true, that would be a pretty big deal. It might be that we could make a big dent with a pretty scalable field of research. So it seems pretty valuable to read thoughtful arguments to the contrary.
I gotta say I have some hesitation in curating this dialogue. The best and most informative topics are kind of diffused over the dialogue, and I feel like I never quite get enough depth or concreteness to really think through the claims. I think my main takeaways are:
(a) presumably a lot of safety-relevant stuff is in the diff between weaker and stronger models, and so you have to think about how you’d tell you’re explaining that diff, and
(b) a question about whether ‘induction heads exist’ or not, and what that means for whether or not mech interp has started making meaningful progress.