Victoria Krakovna. Research scientist at DeepMind working on AI safety, and cofounder of the Future of Life Institute. Website and blog: vkrakovna.wordpress.com
Vika
David had many conversations with Bengio about alignment during his PhD, and gets a lot of credit for Bengio taking AI risk seriously
Thanks Alex for the detailed feedback! I agree that learning a goal from the training-compatible set is a strong assumption that might not hold.
This post assumes a standard RL setup and is not intended to apply to LLMs (it’s possible some version of this result may hold for fine-tuned LLMs, but that’s outside the scope of this post). I can update the post to explicitly clarify this, though I was not expecting anyone to assume that this work applies to LLMs given that the post explicitly assumes standard RL and does not mention LLMs at all.
I agree that reward functions are not the best way to refer to possible goals. This post builds on the formalism in the power-seeking paper which is based on reward functions, so it was easiest to stick with this terminology. I can talk about utility functions instead (which would be equivalent to value functions in this case) but this would complicate exposition. I think it is pretty clear in the post that I’m not talking about reinforcement functions and the training reward is not the optimization target, but I could clarify this further if needed.
I find the idea of a training-compatible goal set useful for thinking about the possible utilities that are consistent with feedback received during training. I think utility functions are still the best formalism we have to represent goals, and I don’t have a clear sense of the alternative you are proposing. I understand what kind of object a utility function is, and I don’t understand what kind of object a value shard is. What is the type signature of a shard—is it a policy, a utility function restricted to a particular context, or something else? When you are talking about a “partial encoding of a goal in the network”, what exactly do you mean by a goal?
I would be curious what predictions shard theory makes about the central claim of this post. I have a vague intuition that power-seeking would be useful for most contextual goals that the system might have, so it would still be predictive to some degree, but I don’t currently see a way to make that more precise.
I’ve read a few posts on shard theory, and it seems very promising and interesting, but I don’t really understand what its claims and predictions are. I expect I will not have a good understanding or be able to apply the insights until there is a paper that makes the definitions and claims of this theory precise and specific. (Similarly, I did not understand your power-seeking theory work until you wrote a paper about it.) If you’re looking to clarify the discourse around RL processes, I believe that writing a definitive reference on shard theory would be the most effective way to do so. I hope you take the time to write one and I really look forward to reading it.
Which definition / result are you referring to?
We expect that an aligned (blue-cloud) model would have an incentive to preserve its goals, though it would need some help from us to generalize them correctly to avoid becoming a misaligned (red-cloud) model. We talk about this in more detail in Refining the Sharp Left Turn (part 2).
Just added some more detail on this to the slides. The idea is that we have various advantages over the model during the training process: we can restart the search, examine and change beliefs and goals using interpretability techniques, choose exactly what data the model sees, etc.
[Linkpost] Some high-level thoughts on the DeepMind alignment team’s strategy
Power-seeking can be probable and predictive for trained agents
Thanks Alex for the detailed feedback! I have updated the post to fix these errors.
Curious if you have high-level thoughts about the post and whether these definitions have been useful in your work.
This post provides a maximally clear and simple explanation of a complex alignment scheme. I read the original “learning the prior” post a few times but found it hard to follow. I only understood how the imitative generalization scheme works after reading this post (the examples and diagrams and clear structure helped a lot).
This post helped me understand the motivation for the Finite Factored Sets work, which I was confused about for a while. The framing of agency as time travel is a great intuition pump.
I like this research agenda because it provides a rigorous framing for thinking about inductive biases for agency and gives detailed and actionable advice for making progress on this problem. I think this is one of the most useful research directions in alignment foundations since it is directly applicable to ML-based AI systems.
+1. This section follows naturally from the rest of the article, and I don’t see why it’s labeled as an appendix - this seems like it would unnecessarily discourage people from reading it.
It’s great to hear that you have updated away from ambitious value learning towards corrigibility-like targets. It sounds like you now find it plausible that corrigibility will be a natural concept in the AI’s ontology, despite it being incompatible with expected utility maximization. Does this mean that you expect we will be able to build advanced AI that doesn’t become an expected utility maximizer?
I’m also curious how optimistic you are about the interpretability field being able to solve the empirical side of the abstraction problem in the next 5-10 years. Current interpretability work is focused on low-level abstractions (e.g. identifying how a model represents basic facts about the world) and extending the current approaches to higher-level abstractions seems hard. Do you think the current interpretability approaches will basically get us there or will we need qualitatively different methods?
I would consider goal generalization as a component of goal preservation, and I agree this is a significant challenge for this plan. If the model is sufficiently aligned to the goal of being helpful to humans, then I would expect it would want to get feedback about how to generalize the goals correctly when it encounters ontological shifts.
Refining the Sharp Left Turn threat model, part 2: applying alignment techniques
Too bad that my list of AI safety resources didn’t make it into the survey—would be good to know to what extent it would be useful to keep maintaining it. Will you be running future iterations of this survey?
Threat Model Literature Review
Clarifying AI X-risk
I agree that a sudden gain in capabilities can make a simulated agent undergo a sharp left turn (coming up with more effective takeover plans is a great example). My original question was about whether the simulator itself could undergo a sharp left turn. My current understanding is that a pure simulator would not become misaligned if its capabilities suddenly increase because it remains myopic, so we only have to worry about a sharp left turn for simulated agents rather than the simulator itself. Of course, in practice, language models are often fine-tuned with RL, which creates agentic incentives on the simulator level as well.
You make a good point about the difficulty of identifying dangerous models if the danger is triggered by very specific prompts. I think this may go both ways though, by making it difficult for a simulated agent to execute a chain of dangerous behaviors, which could be interrupted by certain inputs from the user.
Great post! I especially enjoyed the intuitive visualizations for how the heavy-tailed distributions affect the degree of overoptimization of X.
As a possibly interesting connection, your set of criteria for an alignment plan can also be thought of as criteria for selecting a model specification that approximates the ideal specification well, especially trying to ensure that the approximation error is light-tailed.