Vika comments on DeepMind alignment team opinions on AGI ruin arguments

Vika 20 Dec 2023 15:27 UTC
LW: 24 AF: 13
0
AF
I’m glad I ran this survey, and I expect the overall agreement distribution probably still holds for the current GDM alignment team (or may have shifted somewhat in the direction of disagreement), though I haven’t rerun the survey so I don’t really know. Looking back at the “possible implications for our work” section, we are working on basically all of these things.
Thoughts on some of the cruxes in the post based on last year’s developments:
- Is global cooperation sufficiently difficult that AGI would need to deploy new powerful technology to make it work?
  - There has been a lot of progress on AGI governance and broad endorsement of the risks this year, so I feel somewhat more optimistic about global cooperation than a year ago.
- Will we know how capable our models are?
  - The field has made some progress on designing concrete capability evaluations—how well they measure the properties we are interested in remains to be seen.
- Will systems acquire the capability to be useful for alignment / cooperation before or after the capability to perform advanced deception?
  - At least so far, deception and manipulation capabilities seem to be lagging a bit behind usefulness for alignment (e.g. model-written evals / critiques, weak-to-strong generalization), but this could change in the future.
- Is consequentialism a powerful attractor? How hard will it be to avoid arbitrarily consequentialist systems?
  - Current SOTA LLMs seem surprisingly non-consequentialist for their level of capability. I still expect LLMs to be one of the safest paths to AGI in terms of avoiding arbitrarily consequentialist systems.
I hoped to see other groups do the survey as well—looks like this didn’t happen, though a few people asked me to share the template at the time. It would be particularly interesting if someone ran a version of the survey with separate ratings for “agreement with the statement” and “agreement with the implications for risk”.
What links here?
- Voting Results for the 2022 Review by Ben Pace (2 Feb 2024 20:34 UTC; 57 points)
- habryka's comment on The LessWrong 2022 Review: Review Phase by RobertM (10 Jan 2024 22:04 UTC; 17 points)