Neel Nanda comments on AMA: Paul Christiano, alignment researcher

Neel Nanda Apr 28, 2021, 9:33 PM
LW: 30 AF: 11
AF
What are the highest priority things (by your lights) in Alignment that nobody is currently seriously working on?
- paulfchristiano Apr 30, 2021, 6:51 PM
  LW: 18 AF: 8
  AF Parent
  It’s not clear how to slice the space up into pieces so that you can talk about “is someone working on this piece?” (and the answer depends a lot on that slicing). Here are two areas in robustness that feel kind of empty for my preferred way of slicing up the problem (though for a different slicing they could be reasonably crowded). These are are also necessarily areas where I’m not doing any work so I’m really out on a limb here.
  I think there should be more theoretical work on neural net verification / relaxing adversarial training. I should probably update from this to think that it’s more of a dead end (and indeed practical verification work does seem to have run into a lot of trouble), but to me it looks like there’s got to be more you can say at least to show that various possible approaches are dead ends. I think a big problem is that you really need to keep the application in mind in order to actually know the rules of the game. (That is, we have a predicate A, say implemented as a neural network, and we want to learn a function f such that for all x we have A(x, f(x)), but the problem is only supposed to be possible because in some sense the predicate A is “easy” to satisfy, and I don’t think we have a definition of this other than actually going back and forth with the kind of treacherous turn we are concerned about.) I think many people are dissuaded by being skeptical about having a spec implemented as a neural network, which I think is reasonable (and is part of why I’m working on low-stakes alignment at first), but I think it’s still good for some people on robustness and if you’re trying to work on robustness right now it seems like you have to bite some bullet like that. Probably the bigger problem is that people just don’t do much theoretical work on ML alignment.
  I really like the idea of the unrestricted adversarial examples challenge. I wish that the contest was more fo a thing and I think one of the main reasons it’s not is that most people are too intimidated to try seriously for defenses. Maybe they’re right and the problem is too hopeless to even work on, I don’t know enough about the domain to really contradict experts there (and I also don’t really follow the area so don’t know if people are actually basically working on it), but it definitely feels to me like it would be good to take a serious swing at that problem. I think that’s obviously going to require significant additional investment in data labeling and some other big projects that people may just not do because they are big (which is probably reasonable as an academic). I kind of feel like the way you’d approach this problem if you just needed to get it done is pretty different from how academics normally approach this kind of thing and what I want is more like someone just trying to get it done.