Rohin Shah comments on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Rohin Shah 3 Sep 2024 15:10 UTC
LW: 10 AF: 8
8
AF
Google DeepMind does lots of work on safety practice, mostly by other teams. For example, Gemini Safety (mentioned briefly in the post) does a lot of automated red teaming. The AGI Safety & Alignment team has also contributed to safety practice work. GDM usually doesn’t publish about that work, mainly because the work here is primarily about doing all the operational work necessary to translate existing research techniques into practice, which doesn’t really lend itself to paper publications.
I disagree that the AGI safety team should have 4 as its “bread and butter”. The majority of work needed to do safety in practice has little relevance to the typical problems tackled by AGI safety, especially misalignment. There certainly is some overlap, but in practice I would guess that a focus solely on 4 would cause around an order of magnitude slowdown in research progress. I do think it is worth doing to some extent from an AGI safety perspective, because of (1) the empirical feedback loops it provides, which can identify problems you would not have thought of otherwise, and (2) at some point we will have to put our research into practice, and it’s good to get some experience with that. But at least while models are still not that capable, I would not want it to be the main thing we do.
A couple of more minor points:
- I still basically believe the story from the 6-year-old debate theory, and see our recent work as telling us what we need to do on the journey to making our empirical work better match the theory. So I do disagree fairly strongly with the approach of “just hill climb on what works”—I think theory gives us strong reasons to continue working on debate.
- It’s not clear to me where empirical work for future problems would fit in your categorization (e.g. the empirical debate work). Is it “safety theory”? Imo this is an important category because it can get you a lot of the benefits of empirical feedback loops, without losing the focus on AGI safety.