Mark Xu comments on Should we publish mechanistic interpretability research?

Mark Xu Apr 21, 2023, 9:47 PM
14 points
8
I don’t really want to argue about language. I’ll defend “almost no individual has a pretty substantial affect on capabilities.” I think publishing norms could have a pretty substantial effect on capabilities, and also a pretty substantial effect on interpretability, and currently think the norms suggested have a tradeoff that’s bad-on-net for x-risk.

Chris Olah’s interpretability work is one of the most commonly used resources in graduate and undergraduate ML classes, so people clearly think it helps you get better at ML engineering

I think this is false, and that most ML classes are not about making people good at ML engineering. I think Olah’s stuff is disproportionately represented because it’s interesting and is presented well, and also that classes really love being like “rigorous” or something in ways that are random. Similarly, probably like proofs of the correctness of backprop are common in ML classes, but not that relevant to being a good ML engineer?

I also bet that if we were to run a survey on what blogposts and papers top ML people would recommend that others should read to become better ML engineers, you would find a decent number of Chris Olah’s publications in the top 10 and top 100.

I would be surprised if lots of ML engineers thought that Olah’s work was in the top 10 best things to read to become a better ML engineer. I less beliefs about top 100. I would take even odds (and believe something closer to 4:1 or whatever), that if you surveyed good ML engineers and ask for top 10 lists, not a single Olah interpretability piece would be in the top 10 most mentioned things. I think most of the stuff will be random things about e.g. debugging workflow, how deal with computers, how to use libraries effectively, etc. If anyone is good at ML engineering and wants to chime in, that would be neat.

I don’t understand why we should have a prior that interpretability research is inherently safer than other types of ML research?

Idk, I have the same prior about trying to e.g. prove various facts about ML stuff, or do statistical learning theory type things, or a bunch of other stuff. It’s just like, if you’re not trying to eek out more oomph from SGD, then probably the stuff you’re doing isn’t going to allow you to eek out more oomph from SGD, because it’s kinda hard to do that and people are trying many things.
- habryka Apr 21, 2023, 10:55 PM
  2 points
  0
  Parent
  I don’t really want to argue about language. I’ll defend “almost no individual has a pretty substantial affect on capabilities.” I think publishing norms could have a pretty substantial effect on capabilities, and also a pretty substantial effect on interpretability, and currently think the norms suggested have a tradeoff that’s bad-on-net for x-risk.
  Yep, makes sense. No need to argue about language. In that case I do think Gwern is a pretty interesting datapoint, and seems worth maybe digging more into.
  I would be surprised if lots of ML engineers thought that Olah’s work was in the top 10 best things to read to become a better ML engineer. I less beliefs about top 100. I would take even odds (and believe something closer to 4:1 or whatever), that if you surveyed good ML engineers and ask for top 10 lists, not a single Olah interpretability piece would be in the top 10 most mentioned things. I think most of the stuff will be random things about e.g. debugging workflow, how deal with computers, how to use libraries effectively, etc. If anyone is good at ML engineering and wants to chime in, that would be neat.
  I would take a bet at 2:1 in my favor for the top 10 thing. Top 10 is a pretty high bar, so I am not at even odds.
  Idk, I have the same prior about trying to e.g. prove various facts about ML stuff, or do statistical learning theory type things, or a bunch of other stuff. It’s just like, if you’re not trying to eek out more oomph from SGD, then probably the stuff you’re doing isn’t going to allow you to eek out more oomph from SGD, because it’s kinda hard to do that and people are trying many things.
  Hmm, yeah, I do think I disagree with the generator here, but I don’t feel super confident and this perspective seems at least plausible to me. I don’t believe it with enough probability to make me think that there is negligible net risk, and I feel like I have a relatively easy time coming up with counterexamples from science and other industries (the nuclear scientists working on nuclear fission did indeed not work on making weapons, and many people were working on making weapons).
  Not sure how much it’s worth digging more into this here.