Hmm, this is surprising. Some claims I might have made that could have led to this misunderstanding, in order of plausibility:
[While I was working on goal misgeneralization] “Basically all the work that I’m doing is about convincing other people that the problem is real”. I might have also said something like “and most people I work with” intending to talk about my collaborators on goal misgeneralization rather than the entire DeepMind safety team(s); for at least some of the time that I was working on goal misgeneralization I was an individual contributor so that would have been a reasonable interpretation.
“Most of my past work hasn’t made progress on the problem”—this would be referring to papers that I started working on before believing that scaled up deep learning could lead to AGI without additional insights, which I think ended up solving the wrong problem because I had a wrong model of what the problem was. (But I wouldn’t endorse “I did this to make alignment legit”, I was in fact trying to solve the problem as I saw it.) (I also did lots of conceptual work that I think did make progress but I have a bad habit of using phrases like “past work” to only mean papers.)
“[Particular past work] didn’t make progress on the problem, though it did explain a problem well”—seems very plausible that I said this about some past DeepMind work.
I do feel pretty surprised if, while I was at DeepMind, I ever intended to make the claim that most of the DeepMind safety team(s) were doing work based on a motivation that was primarily about demonstrating difficulty / convincing other people. (Perhaps I intentionally made such a claim while I wasn’t at DeepMind; seems a lot easier for me to have been mistaken about that before I was actually at DeepMind, but honestly I’d still be pretty surprised.)
My sense is most of the motivation for people at the Deepmind teams comes from people thinking about how to get other people at Deepmind to take AI Alignment seriously.
Idk how you would even theoretically define a measure for this that I could put numbers on, but I feel like if you somehow did do it, I’d probably think it was <50% and >10%.
[While I was working on goal misgeneralization] “Basically all the work that I’m doing is about convincing other people that the problem is real”. I might have also said something like “and most people I work with” intending to talk about my collaborators on goal misgeneralization rather than the entire DeepMind safety team(s); for at least some of the time that I was working on goal misgeneralization I was an individual contributor so that would have been a reasonable interpretation.
This seems like the most likely explanation. Decent chance I interpreted “and most people I work with” as referring to the rest of the Deepmind safety team.
I still feel confused about some stuff, but I am happy to let things stand here.
Hmm, this is surprising. Some claims I might have made that could have led to this misunderstanding, in order of plausibility:
[While I was working on goal misgeneralization] “Basically all the work that I’m doing is about convincing other people that the problem is real”. I might have also said something like “and most people I work with” intending to talk about my collaborators on goal misgeneralization rather than the entire DeepMind safety team(s); for at least some of the time that I was working on goal misgeneralization I was an individual contributor so that would have been a reasonable interpretation.
“Most of my past work hasn’t made progress on the problem”—this would be referring to papers that I started working on before believing that scaled up deep learning could lead to AGI without additional insights, which I think ended up solving the wrong problem because I had a wrong model of what the problem was. (But I wouldn’t endorse “I did this to make alignment legit”, I was in fact trying to solve the problem as I saw it.) (I also did lots of conceptual work that I think did make progress but I have a bad habit of using phrases like “past work” to only mean papers.)
“[Particular past work] didn’t make progress on the problem, though it did explain a problem well”—seems very plausible that I said this about some past DeepMind work.
I do feel pretty surprised if, while I was at DeepMind, I ever intended to make the claim that most of the DeepMind safety team(s) were doing work based on a motivation that was primarily about demonstrating difficulty / convincing other people. (Perhaps I intentionally made such a claim while I wasn’t at DeepMind; seems a lot easier for me to have been mistaken about that before I was actually at DeepMind, but honestly I’d still be pretty surprised.)
Idk how you would even theoretically define a measure for this that I could put numbers on, but I feel like if you somehow did do it, I’d probably think it was <50% and >10%.
This seems like the most likely explanation. Decent chance I interpreted “and most people I work with” as referring to the rest of the Deepmind safety team.
I still feel confused about some stuff, but I am happy to let things stand here.