Victoria Krakovna. Research scientist at DeepMind working on AI safety, and cofounder of the Future of Life Institute. Website and blog: vkrakovna.wordpress.com
Vika
No worries! Thanks a lot for updating the post
Thanks Richard for this post, it was very helpful to read! Some quick comments:
I like the level of technical detail in this threat model, especially the definition of goals and what it means to pursue goals in ML systems
The architectural assumptions (e.g. the prediction & action heads) don’t seem load-bearing for any of the claims in the post, as they are never mentioned after they are introduced. It might be good to clarify that this is an example architecture and the claims apply more broadly.
Phase 1 and 2 seem to map to outer and inner alignment respectively.
Supposing there is no misspecification in phase 1, do the problems in phase 2 still occur? How likely is deceptive alignment seems to argue that they may not occur, since a model that has perfect proxies when it becomes situationally aware would not then become deceptively aligned.
I’m confused why mechanistic interpretability is listed under phase 3 in the research directions—surely it would make the most difference for detecting the emergence of situational awareness and deceptive alignment in phase 2, while in phase 3 the deceptively aligned model will get around the interpretability techniques.
Thank you for the insightful post. What do you think are the implications of the simulator framing for alignment threat models? You claim that a simulator does not exhibit instrumental convergence, which seems to imply that the simulator would not seek power or undergo a sharp left turn. The simulated agents could exhibit power-seeking behavior or rapidly generalizing capabilities or try to break out of the simulation, but this seems less concerning than the top-level model having these properties, and we might develop alignment techniques specifically targeted at simulated agents. For example, a simulated agent might need some level of persistence within the simulation to execute these behaviors, and we may be able to influence the simulator to generate less persistent agents.
I would expect that the way Ought (or any other alignment team) influences the AGI-building org is by influencing the alignment team within that org, which would in turn try to influence the leadership of the org. I think the latter step in this chain is the bottleneck—across-organization influence between alignment teams is easier than within-organization influence. So if we estimate that Ought can influence other alignment teams with 50% probability, and the DM / OpenAI / etc alignment team can influence the corresponding org with 20% probability, then the overall probability of Ought influencing the org that builds AGI is 10%. Your estimate of 1% seems too low to me unless you are a lot more pessimistic about alignment researchers influencing their organization from the inside.
Thanks Thomas for the helpful overview post! Great to hear that you found the AGI ruin opinions survey useful.
I agree with Rohin’s summary of what we’re working on. I would add “understanding / distilling threat models” to the list, e.g. “refining the sharp left turn” and “will capabilities generalize more”.
Some corrections for your overall description of the DM alignment team:
I would count ~20-25 FTE on the alignment + scalable alignment teams (this does not include the AGI strategy & governance team)
I would put DM alignment in the “fairly hard” bucket (p(doom) = 10-50%) for alignment difficulty, and the “mixed” bucket for “conceptual vs applied”
- Sep 1, 2023, 11:52 PM; 8 points) 's comment on AI #27: Portents of Gemini by (
This post resonates with me on a personal level, since my mother was really into mountain climbing in her younger years. She quit after seeing a friend die in front of her (another young woman who broke her neck against an opposing rock face in an unlucky fall). It seems likely I wouldn’t be here otherwise. Happy to report that she is still enjoying safer mountain activities 50 years later.
Correct. I think that doing internal outreach to build an alignment-aware company culture and building relationships with key decision-makers can go a long way. I don’t think it’s possible to have complete binding power over capabilities projects anyway, since the people who want to run the project could in principle leave and start their own org.
Hmm, thanks… Can you elaborate what “this” is?
We don’t have the power to shut down projects, but we can make recommendations and provide input into decisions about projects
Thanks! For those interested in conducting similar surveys, here is a version of the spreadsheet you can copy (by request elsewhere in the comments).
Here is a spreadsheet you can copy. This one has a column for each person—if you want to sort the rows by agreement, you need to do it manually after people enter their ratings. I think it’s possible to automate this but I was too lazy.
Thanks, glad you found the post useful!
Maintaining uncertainty over the goal allows the system to model the set of goals that are consistent with the training data, notice when they disagree with each other out of distribution, and resolve that disagreement in some way (e.g. by deferring to a human).
Fixed
DeepMind alignment team opinions on AGI ruin arguments
Refining the Sharp Left Turn threat model, part 1: claims and mechanisms
Ah, I think you intended level 6 as an OR of learning from imitation / imagined experience, while I interpreted it as an AND. I agree that humans learn from imitation on a regular basis (e.g. at school). In my version of the hierarchy, learning from imitation and imagined experience would be different levels (e.g. level 6 and 7) because the latter seems a lot harder. In your decision theory example, I think a lot more people would be able to do the imitation part than the imagined experience part.
I think some humans are at level 6 some of the time (see Humans Who Are Not Concentrating Are Not General Intelligences). I would expect that learning cognitive algorithms from imagined experience is pretty hard for many humans (e.g. examples in the Astral Codex post about conditional hypotheticals). But maybe I have a different interpretation of Level 6 than what you had in mind?
This is an interesting hierarchy! I’m wondering how to classify humans and various current ML systems along this spectrum. My quick take is that most humans are at Levels 4-5, AlphaZero is at level 5, and GPT-3 is at level 4 with the right prompting. Curious if you have specific ML examples in mind for these levels.
Makes sense, thanks. I think the current version of the list is not a significant infohazard since the examples are well-known, but I agree it’s good to be cautious. (I tweeted about it to try to get more examples, but it didn’t get much uptake, happy to delete the tweet if you prefer.) Focusing on outreach to people who care about AI risk seems like a good idea, maybe it could be useful to nudge researchers who don’t work on AI safety because of long timelines to start working on it.
I would say the primary disagreement is epistemic—I think most of us would assign a low probability to a pivotal act defined as “a discrete action by a small group of people that flips the gameboard” being necessary. We also disagree on a normative level with the pivotal act framing, e.g. for reasons described in Critch’s post on this topic.