Allan Dafoe, head of the AI Governance team at FHI has written up a research agenda. It’s a very large document, and I haven’t gotten around to reading it all, but one of the things I would be most excited about is people in the comments quoting the most important snippets that give a big picture overview over the content of the agenda.
I thought this was a quite good operationalisation of how hard aligning advanced AI systems might be, which I’ve taken from the conclusion of the overview of the technical landscape. (All of what follows is a direct quote, but it’s not in quotes because the editor can’t do that at the same time as bullet points.)
---
There are a broad range of implicit views about how technically hard it will be to make safe advanced AI systems. They differ on the technical difficulty of safe advanced AI systems, as well as risks of catastrophe, and rationality of regulatory systems. We might characterize them as follows:
Easy: We can, with high reliability, prevent catastrophic risks with modest effort, say 1-10% of the costs of developing the system.
Medium: Reliably building safe powerful systems, whether it be nuclear power plants or advanced AI systems, is challenging. Doing so costs perhaps 10% to 100% the cost of the system (measured in the most appropriate metric, such as money, time, etc.).
But incentives are aligned. Economic incentives are aligned so that companies or organizations will have correct incentives to build sufficiently safe systems. Companies don’t want to build bridges that fall down, or nuclear power plants that experience a meltdown.
But incentives will be aligned. Economic incentives are not perfectly aligned today, as we have seen with various scandals (oil spills, emissions fraud, financial fraud), but they will be after a few accidents lead to consumer pressure, litigation, or regulatory or other responses.[85]
But we will muddle through. Incentives are not aligned, and will never be fully. However, we will probably muddle through (get the risks small enough), as humanity has done with nuclear weapons and nuclear energy.
And other factors will strongly work against safety. Strong profit and power incentives, misperception, heterogenous theories of safety, overconfidence and rationalization, and other pathologies conspire to deprive us of the necessary patience and humility to get it right. This view is most likely if there will not be evidence (such as recoverable accidents) from reckless development, and if the safety function is steep over medium level of inputs (“This would not be a hard problem if we had two years to work on it, once we have the system. It will be almost impossible if we don’t.”).
Hard or Near Impossible: Building a safe superintelligence is like building a rocket and spacecraft for a moon-landing, without ever having done a test launch. It costs [86] greater than, or much greater than, 100% of development costs.
We don’t know.
[85] This assumes that recoverable accidents occur with sufficient probability before non-recoverable accidents.
[86] Yudkowsky, Eliezer. “So Far: Unfriendly AI Edition.” EconLog | Library of Economics and Liberty, 2016. http://econlog.econlib.org/archives/2016/03/so_far_unfriend.html.
There’s an interesting appendix listing desiderata for good AI forecasting, that includes the catchy phrase “epistemically temporally fractal” (which I feel compelled to find a place to use in my life). This first three points are reminiscent of Zvi’s recent post.
Most of the sections are just lists of interesting questions for further research; the lists of questions seem fairly comprehensive. The section on avoiding arms races has something more in the way of conceptually breaking up the space—in particular, the third and fourth paragraphs distill the basic models around these topics in a way I found useful. My guess is that this section is most representative of the future work of Allan Dafoe.