(I realize this is straying pretty far from the intent of this post, so feel free to delete this comment)
I totally agree that a non-trivial portion of DeepMind’s work (and especially my work) is in the “make legit” category, and I stand by that as a good thing to do, but putting all of it there seems pretty wild. Going off of a list I previously wrote about DeepMind work (this comment):
We do a lot of stuff, e.g. of the things you’ve listed, the Alignment / Scalable Alignment Teams have done at least some work on the following since I joined in late 2020:
Eliciting latent knowledge (see ELK prizes, particularly the submission from Victoria Krakovna & Vikrant Varma & Ramana Kumar)
(Note that since then the mechanistic interpretability team published Tracr.)
Of this, I think “examples of goal misgeneralization” is primarily “make alignment legit”, while everything else is about making progress on alignment. (I see the conceptual progress towards specifically naming and describing goal misgeneralization as progress on alignment, but that was mostly finished within-the-community by the time we were working on the examples.)
(Some of the LLM alignment work and externalized reasoning oversight work has aspects of “making alignment legit” but it also seems like progress on alignment—in particular I think I learn new empirical facts about how well various techniques work from both.)
I think the actual crux here is how useful the various empirical projects are, where I expect you (and many others) think “basically useless” while I don’t.
In terms of fraction of effort allocated to “make alignment legit”, I think it’s currently about 10% of the Alignment and Scalable Alignment teams, and it was more like 20% while the goal misgeneralization project was going on. (This is not counting LLM alignment and externalized reasoning oversight as “make alignment legit”.)
I mean, I think my models here come literally from conversations with you, where I am pretty sure you have said things like (paraphrased) “basically all the work I do at Deepmind and the work of most other people I work with at Deepmind is about ‘trying to demonstrate the difficulty of the problem’ and ‘convincing other people at Deepmind the problem is real’”.
In as much as you are now claiming that is only 10%-20% of the work, that would be extremely surprising to me and I do think would really be in pretty direct contradiction with other things we have talked about.
Like, yes, of course if you want to do field-building and want to get people to think AI Alignment is real, you will also do some alignment research. But I am talking about the balance of motivations, not the total balance of work. My sense is most of the motivation for people at the Deepmind teams comes from people thinking about how to get other people at Deepmind to take AI Alignment seriously. I think that’s a potentially valuable goal, but indeed it is also the kind of goal that often gets represented as someone just trying to make direct progress on the problem.
Hmm, this is surprising. Some claims I might have made that could have led to this misunderstanding, in order of plausibility:
[While I was working on goal misgeneralization] “Basically all the work that I’m doing is about convincing other people that the problem is real”. I might have also said something like “and most people I work with” intending to talk about my collaborators on goal misgeneralization rather than the entire DeepMind safety team(s); for at least some of the time that I was working on goal misgeneralization I was an individual contributor so that would have been a reasonable interpretation.
“Most of my past work hasn’t made progress on the problem”—this would be referring to papers that I started working on before believing that scaled up deep learning could lead to AGI without additional insights, which I think ended up solving the wrong problem because I had a wrong model of what the problem was. (But I wouldn’t endorse “I did this to make alignment legit”, I was in fact trying to solve the problem as I saw it.) (I also did lots of conceptual work that I think did make progress but I have a bad habit of using phrases like “past work” to only mean papers.)
“[Particular past work] didn’t make progress on the problem, though it did explain a problem well”—seems very plausible that I said this about some past DeepMind work.
I do feel pretty surprised if, while I was at DeepMind, I ever intended to make the claim that most of the DeepMind safety team(s) were doing work based on a motivation that was primarily about demonstrating difficulty / convincing other people. (Perhaps I intentionally made such a claim while I wasn’t at DeepMind; seems a lot easier for me to have been mistaken about that before I was actually at DeepMind, but honestly I’d still be pretty surprised.)
My sense is most of the motivation for people at the Deepmind teams comes from people thinking about how to get other people at Deepmind to take AI Alignment seriously.
Idk how you would even theoretically define a measure for this that I could put numbers on, but I feel like if you somehow did do it, I’d probably think it was <50% and >10%.
[While I was working on goal misgeneralization] “Basically all the work that I’m doing is about convincing other people that the problem is real”. I might have also said something like “and most people I work with” intending to talk about my collaborators on goal misgeneralization rather than the entire DeepMind safety team(s); for at least some of the time that I was working on goal misgeneralization I was an individual contributor so that would have been a reasonable interpretation.
This seems like the most likely explanation. Decent chance I interpreted “and most people I work with” as referring to the rest of the Deepmind safety team.
I still feel confused about some stuff, but I am happy to let things stand here.
fyi your phrasing here is different from what I initially interpreted “make AI safety seem legit”.
like there’s maybe a few things someone might mean if they say “they’re working on AI Alignment research”
they are pushing forward the state of the art of deep alignment understanding
they are orienting to the existing field of alignment research / upskilling
they are conveying to other AI researchers “here is what the field of alignment is important and why”
they are trying to make AI alignment feel high status, so that they feel safe in their career and social network, while also getting to feel important
(and of course people can be doing a mixture of the above, or 5th options I didn’t lisT)
I interpreted you initially as saying #4, but it sounds like you/Rohin here are talking about #3. There are versions of #3 that are secretly just #4 without much theory-of-change, but, idk, I think Rohin’s stated goal here is just pretty reasonable and definitely something I want in my overall AI Alignment Field portfolio. I agree you should avoid accidentally conflating it with #1.
(i.e. this seems related to a form of research-debt, albeit focused on bridging the gap between one field and another, rather than improving intra-field research debt)
Yep, I am including 3 in this. I also think this is something pretty reasonable for someone in the field to do, but when most of your field is doing that I think quite crazy and bad things happen, and also it’s very easy to slip into doing 4 instead.
(I realize this is straying pretty far from the intent of this post, so feel free to delete this comment)
I totally agree that a non-trivial portion of DeepMind’s work (and especially my work) is in the “make legit” category, and I stand by that as a good thing to do, but putting all of it there seems pretty wild. Going off of a list I previously wrote about DeepMind work (this comment):
(Note that since then the mechanistic interpretability team published Tracr.)
Of this, I think “examples of goal misgeneralization” is primarily “make alignment legit”, while everything else is about making progress on alignment. (I see the conceptual progress towards specifically naming and describing goal misgeneralization as progress on alignment, but that was mostly finished within-the-community by the time we were working on the examples.)
(Some of the LLM alignment work and externalized reasoning oversight work has aspects of “making alignment legit” but it also seems like progress on alignment—in particular I think I learn new empirical facts about how well various techniques work from both.)
I think the actual crux here is how useful the various empirical projects are, where I expect you (and many others) think “basically useless” while I don’t.
In terms of fraction of effort allocated to “make alignment legit”, I think it’s currently about 10% of the Alignment and Scalable Alignment teams, and it was more like 20% while the goal misgeneralization project was going on. (This is not counting LLM alignment and externalized reasoning oversight as “make alignment legit”.)
I mean, I think my models here come literally from conversations with you, where I am pretty sure you have said things like (paraphrased) “basically all the work I do at Deepmind and the work of most other people I work with at Deepmind is about ‘trying to demonstrate the difficulty of the problem’ and ‘convincing other people at Deepmind the problem is real’”.
In as much as you are now claiming that is only 10%-20% of the work, that would be extremely surprising to me and I do think would really be in pretty direct contradiction with other things we have talked about.
Like, yes, of course if you want to do field-building and want to get people to think AI Alignment is real, you will also do some alignment research. But I am talking about the balance of motivations, not the total balance of work. My sense is most of the motivation for people at the Deepmind teams comes from people thinking about how to get other people at Deepmind to take AI Alignment seriously. I think that’s a potentially valuable goal, but indeed it is also the kind of goal that often gets represented as someone just trying to make direct progress on the problem.
Hmm, this is surprising. Some claims I might have made that could have led to this misunderstanding, in order of plausibility:
[While I was working on goal misgeneralization] “Basically all the work that I’m doing is about convincing other people that the problem is real”. I might have also said something like “and most people I work with” intending to talk about my collaborators on goal misgeneralization rather than the entire DeepMind safety team(s); for at least some of the time that I was working on goal misgeneralization I was an individual contributor so that would have been a reasonable interpretation.
“Most of my past work hasn’t made progress on the problem”—this would be referring to papers that I started working on before believing that scaled up deep learning could lead to AGI without additional insights, which I think ended up solving the wrong problem because I had a wrong model of what the problem was. (But I wouldn’t endorse “I did this to make alignment legit”, I was in fact trying to solve the problem as I saw it.) (I also did lots of conceptual work that I think did make progress but I have a bad habit of using phrases like “past work” to only mean papers.)
“[Particular past work] didn’t make progress on the problem, though it did explain a problem well”—seems very plausible that I said this about some past DeepMind work.
I do feel pretty surprised if, while I was at DeepMind, I ever intended to make the claim that most of the DeepMind safety team(s) were doing work based on a motivation that was primarily about demonstrating difficulty / convincing other people. (Perhaps I intentionally made such a claim while I wasn’t at DeepMind; seems a lot easier for me to have been mistaken about that before I was actually at DeepMind, but honestly I’d still be pretty surprised.)
Idk how you would even theoretically define a measure for this that I could put numbers on, but I feel like if you somehow did do it, I’d probably think it was <50% and >10%.
This seems like the most likely explanation. Decent chance I interpreted “and most people I work with” as referring to the rest of the Deepmind safety team.
I still feel confused about some stuff, but I am happy to let things stand here.
fyi your phrasing here is different from what I initially interpreted “make AI safety seem legit”.
like there’s maybe a few things someone might mean if they say “they’re working on AI Alignment research”
they are pushing forward the state of the art of deep alignment understanding
they are orienting to the existing field of alignment research / upskilling
they are conveying to other AI researchers “here is what the field of alignment is important and why”
they are trying to make AI alignment feel high status, so that they feel safe in their career and social network, while also getting to feel important
(and of course people can be doing a mixture of the above, or 5th options I didn’t lisT)
I interpreted you initially as saying #4, but it sounds like you/Rohin here are talking about #3. There are versions of #3 that are secretly just #4 without much theory-of-change, but, idk, I think Rohin’s stated goal here is just pretty reasonable and definitely something I want in my overall AI Alignment Field portfolio. I agree you should avoid accidentally conflating it with #1.
(i.e. this seems related to a form of research-debt, albeit focused on bridging the gap between one field and another, rather than improving intra-field research debt)
Yep, I am including 3 in this. I also think this is something pretty reasonable for someone in the field to do, but when most of your field is doing that I think quite crazy and bad things happen, and also it’s very easy to slip into doing 4 instead.