Victoria Krakovna. Research scientist at DeepMind working on AI safety, and cofounder of the Future of Life Institute. Website and blog: vkrakovna.wordpress.com
Vika
I think it’s plausible that the alignment community could figure out how to build systems without power-seeking incentives, or with power-seeking tendencies limited to some safe set of options, by building on your formalization, so the retrospective seems plausible to me.
In addition, this work is useful for convincing ML people that alignment is hard, which helps to lay the groundwork for coordinating the AI community to not build AGI. I’ve often pointed researchers at DM (especially RL people) to your power-seeking paper when trying to explain convergent instrumental goals (a formal neurips paper makes a much better reference for that audience than Basic AI Drives).
Thanks Alex for writing this. I think the social failure modes you described in the Mistakes section are all too common, and I’ve often found myself held back by these.
I agree that impact measures are not super useful for alignment (apart from deconfusion) and I’ve also moved on from working on this topic. Improving our understanding of power-seeking seems pretty useful though, so I’m curious why you wish you had stopped working on it sooner.
Thanks Eliezer for writing up this list, it’s great to have these arguments in one place! Here are my quick takes (which mostly agree with Paul’s response).
Section A (strategic challenges?):
Agree with #1-2 and #8. Agree with #3 in the sense that we can’t iterate in dangerous domains (by definition) but not in the sense that we can’t learn from experiments on easier domains (see Paul’s Disagreement #1).
Mostly disagree with #4 - I think that coordination not to build AGI (at least between Western AI labs) is difficult but feasible, especially after a warning shot. A single AGI lab that decides not to build AGI can produce compelling demos of misbehavior that can help convince other actors. A number of powerful actors coordinating not to build AGI could buy a lot of time, e.g. through regulation of potential AGI projects (auditing any projects that use a certain level of compute, etc) and stigmatizing deployment of potential AGI systems (e.g. if it is viewed similarly to deploying nuclear weapons).
Mostly disagree with the pivotal act arguments and framing (#6, 7, 9). I agree it is necessary to end the acute risk period, but I find it unhelpful when this is framed as “a pivotal act”, which assumes it’s a single action taken unilaterally by a small number of people or an AGI system. I think that human coordination (possibly assisted by narrow AI tools, e.g. auditing techniques) can be sufficient to prevent unaligned AGI from being deployed. While it’s true that a pivotal act requires power and an AGI wielding this power would pose an existential risk, a group of humans + narrow AI wielding this power would not. This may require more advanced narrow AI than we currently have, so opportunities for pivotal acts could arise as we get closer to AGI that are not currently available.
Mostly disagree with section B.1 (distributional leap):
Agree with #10 - the distributional shift is large by default. However, I think there is a decent chance that we can monitor the increase in system capabilities and learn from experiments on less advanced systems, which would allow us to iterate alignment approaches to deal with the distributional shift.
Disagree with #11 - I think we can learn from experiments on less dangerous domains (see Paul’s Disagreement #15).
Uncertain on #13-14. I agree that many problems would most naturally first occur at higher levels of intelligence / in dangerous domains. However, we can discover these problems through thought experiments and then look for examples in less advanced systems that we would not have found otherwise (e.g. this worked for goal misgeneralization and reward tampering).
Mostly agree with B.2 (central difficulties):
Agree with #17 that there is currently no way to instill and verify specific inner properties in a system, though it seems possible in principle with more advanced interpretability techniques.
Agree with #21 that capabilities generalize further than alignment by default. Addressing this would require methods for modeling and monitoring system capabilities, which would allow us to stop training the system before capabilities start generalizing very quickly.
I mostly agree with #23 (corrigibility is anti-natural), though I think there are ways to make corrigibility more of an attractor, e.g. through utility uncertainty or detecting and penalizing incorrigible reasoning. Paul’s argument on corrigibility being a crisp property assuming good enough human feedback also seems compelling.
I agree with #24 that it’s important to be clear whether an approach is aiming for a sovereign or corrigible AI, though I haven’t seen people conflating these in practice.
Mostly disagree with B.3 (interpretability):
I think Eliezer is generally overly pessimistic about interpretability.
Agree with #26 that interpretability alone isn’t enough to build a system that doesn’t want to kill us. However, it would help to select against such systems, and would allow us to produce compelling demos of misalignment that help humans coordinate to not build AGI.
Agree with #27 that training with interpretability tools could also select for undetectable deception, but it’s unclear how much this is a problem in practice. It’s plausibly quite difficult to learn to perform undetectable deception without first doing a bunch of detectable deception that would then be penalized and selected against, producing a system that generally avoids deception.
Disagree with #30 - the argument that verification is much easier than generation is pretty compelling (see Paul’s Disagreement #19).
Disagree with #33 that an AGI system will have completely alien concepts / world model. I think this relies on the natural abstraction hypothesis being false, which seems unlikely.
Section B.4 (miscellaneous unworkable schemes) and Section C (civilizational inadequacy?)
Uncertain on these arguments, but they don’t seem load-bearing to me.
Hi Jeremy, glad that you found the post useful! The recording for the talk has just been uploaded—here it is.
Paradigms of AI alignment: components and enablers
Thanks Alex for the helpful recommendation! I got the book and read the first half. I’d like to do a declutter at some point, still figuring out how to handle non-optional technologies (e.g. work slack tends to be a major distraction for me, which is probably best coordinated with my team as a whole).
I think our proposal addresses the “simple steganography” problem, as described in “ELK prize results / First counterexample: simple steganography”:
By varying the phrasing and syntax of an answer without changing its meaning, a reporter could communicate large amounts of information to the auxiliary model. Similarly, there are many questions where a human is unsure about the answer and the reporter knows it. A reporter could encode information by answering each of these questions arbitrarily. Unless the true answers have maximum entropy, this strategy could encode more information than direct translation.
An entropy penalty on the reporter’s output would discourage the spurious variations in answers described above (assuming that steganographic output has higher entropy than the true answers). I agree that the general case of steganography is not addressed by simple approaches like an entropy penalty, e.g. “Harder counterexample: steganography using flexible degrees of freedom”.
ELK contest submission: route understanding through the human ontology
I generally endorse the claims made in this post and the overall analogy. Since this post was written, there are a few more examples I can add to the categories for slow takeoff properties.
Learning from experience
The UK procrastinated on locking down in response to the Alpha variant due to political considerations (not wanting to “cancel Christmas”), though it was known that timely lockdowns are much more effective.
Various countries reacted to Omicron with travel bans after they already had community transmission (e.g. Canada and the UK), while it was known that these measures would be ineffective.
Warning signs
Since there is a non-negligible possibility that covid-19 originated in a lab, the covid pandemic can be viewed as a warning sign about the dangers of gain of function research. So far, as far as I know, this warning sign has not been acted upon (there is no significant new initiative to ban gain of function research).
I think there was some improvement at acting on warning signs for subsequent variants (e.g. I believe that measures in response to Omicron were generally taken faster than measures for original covid). This gives me some hope that our institutions can get better at reacting to warning signs with practice (at least warning signs that are similar to those they have encountered before). This suggests that dealing with narrow AI disasters could potentially lead institutions to improve their ability to respond to warning signs.
Consensus on the problem
It took a long time to reach consensus on the importance of mask wearing and aerosol transmission.
We still don’t seem to have widespread consensus that transmission through surfaces is insignificant, at least judging by the amount of effort that seems to go into disinfection and cleaning in various buildings that I visit.
- Feb 1, 2022, 6:48 PM; 6 points) 's comment on 2020 Review: The Discussion Phase by (
Thank you Charlie for this inspiring post! I found the encouragement to explicitly aim for self-love very helpful.
After a bad year of responding to chronic stress with a lot of self-judgment, I realized that I need to reset my relationship with myself, and I have an intention for this year to get better at self-love and practice every day. I found some self-love meditations on Insight Timer and have been doing them daily for a few weeks. I’m definitely feeling better but it’s too early to tell if this is a stable change.
I found that meditation that specifically aims at self-love works better for me than loving-kindness meditation. The loving-kindness meditation recordings I have encountered usually focus on “wishing well” to yourself or others, which feels different and somehow more neutral than “sending love” or something like that. Connecting the intention with the breath and imagining that I’m breathing in love for myself and breathing out love for others also makes the meditation more powerful for me.
Great post! I don’t think Chris Olah’s work is a good example of non-transferable principles though. His team was able to make a lot of progress on transformer interpretability in a relatively short time, and I expect that there was a lot of transfer of skills and principles from the work on image nets that made this possible. For example, the idea of circuits and the “universality of circuits” principle seems to have transferred to transformers pretty well.
Really excited to read this sequence as well!
Ah I see, thanks for the clarification! The ‘bottle cap’ (block) example is robust to removing any one cell but not robust to adding cells next to it (as mentioned in Oscar’s comment). So most random perturbations that overlap with the block will probably destroy it.
Actually, we realized that if we consider an empty board an optimizing system, then any finite pattern is an optimizing system (because it’s similarly robust to adding non-viable collections of live cells), which is not very interesting. We have updated the post to reflect this.
Thanks for pointing this out! We realized that if we consider an empty board an optimizing system then any finite pattern is an optimizing system (because it’s similarly robust to adding non-viable collections of live cells), which is not very interesting. We have updated the post to reflect this.
The ‘bottle cap’ example would be an optimizing system if it was robust to cells colliding / interacting with it, e.g. being hit by a glider (similarly to the eater).
Optimization Concepts in the Game of Life
Thanks Aryeh for collecting these! I added them to a new Project Ideas section in my AI Safety Resources list.
Is this reading group still running? I’m wondering whether to point people to it.
+1 to everything Jacob said about living near London, plus the advantages of being near an existing AI safety hub (DeepMind, FHI, etc).
Really excited to see this list, thanks for putting it together! I shared it with the DM safety community and tweeted about it here, so hopefully some more examples will come in. (Would be handy to have a short URL for sharing the spreadsheet btw.)
I can see several ways this list can be useful:
as an outreach tool (e.g. to convince skeptics that recursive self-improvement is real)
for forecasting AI progress
for coming up with specific strategies for slowing down AI progress
Curious whether you primarily intend this to be an outreach tool or a resource for AI forecasting / governance.