Note: I link to a bunch of stuff below in the context of the DeepMind safety team, this should be thought of as “things that particular people do” and may not represent the views of DeepMind or even just the DeepMind safety team.
I just don’t know much about what the [DeepMind] technical alignment work actually looks like right now
We do a lot of stuff, e.g. of the things you’ve listed, the Alignment / Scalable Alignment Teams have done at least some work on the following since I joined in late 2020:
Eliciting latent knowledge (see ELK prizes, particularly the submission from Victoria Krakovna & Vikrant Varma & Ramana Kumar)
Thanks you for this thoughtful response, I didn’t know about most of these projects. I’ve linked this comment in the DeepMind section, as well as done some modifications for both clarity and including a bit more.
I think you can talk about the agendas of specific people on the DeepMind safety teams but there isn’t really one “unified agenda”.
Note: I link to a bunch of stuff below in the context of the DeepMind safety team, this should be thought of as “things that particular people do” and may not represent the views of DeepMind or even just the DeepMind safety team.
We do a lot of stuff, e.g. of the things you’ve listed, the Alignment / Scalable Alignment Teams have done at least some work on the following since I joined in late 2020:
Eliciting latent knowledge (see ELK prizes, particularly the submission from Victoria Krakovna & Vikrant Varma & Ramana Kumar)
LLM alignment (lots of work discussed in the podcast with Geoffrey you mentioned)
Scalable oversight (same as above)
Mechanistic interpretability (unpublished so far)
Externalized Reasoning Oversight (my guess is that this will be published soon) (EDIT: this paper)
Communicating views on alignment (e.g. the post you linked, the writing that I do on this forum is in large part about communicating my views)
Deception + inner alignment (in particular examples of goal misgeneralization)
Understanding agency (see e.g. discovering agents, most of Ramana’s posts)
And in addition we’ve also done other stuff like
Learning more safely when doing RL
Addressing reward tampering with decoupled approval
Understanding agent incentives with CIDs
I’m probably forgetting a few others.
I think you can talk about the agendas of specific people on the DeepMind safety teams but there isn’t really one “unified agenda”.
Thanks you for this thoughtful response, I didn’t know about most of these projects. I’ve linked this comment in the DeepMind section, as well as done some modifications for both clarity and including a bit more.
This is useful to know.
Thanks Thomas for the helpful overview post! Great to hear that you found the AGI ruin opinions survey useful.
I agree with Rohin’s summary of what we’re working on. I would add “understanding / distilling threat models” to the list, e.g. “refining the sharp left turn” and “will capabilities generalize more”.
Some corrections for your overall description of the DM alignment team:
I would count ~20-25 FTE on the alignment + scalable alignment teams (this does not include the AGI strategy & governance team)
I would put DM alignment in the “fairly hard” bucket (p(doom) = 10-50%) for alignment difficulty, and the “mixed” bucket for “conceptual vs applied”
Sorry for the late response, and thanks for your comment, I’ve edited the post to reflect these.
No worries! Thanks a lot for updating the post