Davidmanheim comments on A Defense of Work on Mathematical AI Safety

Davidmanheim 9 Jul 2023 8:30 UTC
4 points
0
You say that, but then don’t provide any examples. I imagine readers just not thinking of any, and then moving on without feeling any more convince.
I don’t imagine readers doing that if they are familiar with the field, but I’m happy to give a couple of simple examples for this, at least. Large parts of mechanistic interpretability is to address concerns with future deceptive behaviors. Goal misgeneralization was predicted and investigated as Goodhart’s law.
Overall, I think that it’s hard for people to believe agent foundations will be useful because they’re not visualizing any compelling concrete path where it makes a big difference.
Who are the people you’re referring to? I’m happy to point to this paper for an explanation, but I didn’t think anyone on Lesswrong would really not understand why people thought agent foundations were potentially important as a research agenda—I was defending the further point that we should invest in them even if we think other approaches are critical near term.
- carboniferous_umbraculum 5 Sep 2023 9:49 UTC
  2 points
  0
  Parent
  I’m still not sure I buy the examples. In the early parts of the post you seem to contrast ‘machine learning research agendas’ with ‘foundational and mathematical’/‘agent foundations’ type stuff. Mechanistic interpretability can be quite mathematical but surely it falls into the former category? i.e. it is essentially ML work as opposed to constituting an example of people doing “mathematical and foundational” work.
  
  I can’t say much about the Goodhart’s Law comment but it seems at best unclear that its link to goal misgeneralization is an example of the kind you are looking for (i.e. in the absence of much more concrete examples, I have approximately no reason to believe it has anything to do with what one would call mathematical work).
  - Davidmanheim 7 Sep 2023 12:50 UTC
    2 points
    −1
    Parent
    I’m not really clear what you mean by not buying the example. You certainly seem to understand the distinction I’m drawing—mechanistic interpretability is definitely not what I mean by “mathematical AI safety,” though I agree there is math involved.
    And I think the work on goal misgeneralization was conceptualized in ways directly related to Goodhart, and this type of problem inspired a number of research projects, including quantilizers, which is certainly agent-foundations work. I’ll point here for more places the agents foundations people think it is relevant.
    - carboniferous_umbraculum 14 Sep 2023 14:19 UTC
      2 points
      0
      Parent
      Ah OK, I think I’ve worked out where some of my confusion is coming from: I don’t really see any argument for why mathematical work may be useful, relative to other kinds of foundational conceptual work. e.g. you write (with my emphasis): “Current mathematical research could play a similar role in the coming years...” But why might it? Isn’t that where you need to be arguing?
      
      The examples seem to be of cases where people have done some kind of conceptual foundational work which has later gone on to influence/inspire ML work. But early work on deception or goodhart was not mathematical work, that’s why I don’t understand how these are examples.
      - Davidmanheim 24 Dec 2023 8:05 UTC
        2 points
        0
        Parent
        I think th dispute here is that you’re interpreting mathematical too narrowly, and almost all of the work happening in agent foundations and similar is exactly what was being worked on by “mathematical AI research” 5-7 years ago. The argument was that those approaches have been fruitful, and we should expect them to continue to be so—if you want to call that “foundational conceptual research” instead of “Mathematical AI research,” that’s fine..