What I talk about when I talk about AI x-risk: 3 core claims I want machine learning researchers to address.
Recently, as PCSOCMLx, I (co-)hosted a session with the goal of explaining, debating, and discussing what I view as “the case for AI x-risk”. Specifically, my goal was/is to make the case for the “out-of-control AI killing everyone” type of AI x-risk, since many or most ML researchers already accept that there are significant risks from misuse of AI that should be addressed. I’m sharing my outline, since it might be useful to others, and in order to get feedback on it. Please tell me what you think it does right/wrong!
EDIT: I noticed I (and others I’ve spoken to about this) haven’t been clear enough about distinguishing CLAIMS and ARGUMENTS. I’m hoping to make this clearer in the future.
Some background/context
I estimate I’ve spent ~100-400 hours discussing AI x-risk with machine learning researchers during the course of my MSc and PhD. My current impression is that rejection of AI x-risk by ML researchers is mostly due to a combination of:
Misunderstanding of what I view as the key claims (e.g. believing “the case for x-risk hinges on short-timelines and/or fast take-off”).
Ignorance of the basis for AI x-risk arguments (e.g. no familiarity with the argument from instrumental convergence).
Different philosophical groundings (e.g. not feeling able/compelled to try and reason using probabilities and expected value; not valuing future lives very much; an unexamined apparent belief that current “real problems” should always take precedence of future “hypothetical concerns” resulting in “whataboutism”).
I suspect that ignorance about the level of support for AI x-risk concerns among other researchers also plays a large role, but it’s less clear… I think people don’t like to be seen to be basing their opinions on other researchers’. Underlying all of this seems to be a mental move of “outright rejection” based on AI x-risk failing many powerful and useful heuristics. AI x-risk is thus commonly viewed as a Pascal’s mugging: “plausible” but not plausible enough to compel any consideration or action. A common attitude is that AI take-over has a “0+epsilon” chance of occurring.I’m hoping that being more clear and modest in the claims I/we aim to establish can help move discussions with researchers forward. I’ve recently been leaning heavily on the unpredictability of the future and making ~0 mention of my own estimates about the likelihood of AI x-risk, with good results.
The 3 core claims:
1) The development of advanced AI increases the risk of human extinction (by a non-trivial amount, e.g. 1%), for the following reasons:
Goodhart’s law
Instrumental goals
Safety-performance trade-offs (e.g. capability control vs. motivation control)
2) To mitigating this existential risk (x-risk) we need progress in 3 areas:
Knowing how to build safe systems (“control problem”)
Knowing that we know how to build safe systems (“justified confidence”)
Preventing people from building unsafe systems (“global coordination”)
3) Mitigating AI x-risk seems like an ethical priority because it is:
high impact
neglected
challenging but tractable
Reception:
Unfortunately, only 3 people showed up to our session (despite something like 30 expressing interest). So I didn’t learn to much about the effectiveness of this presentation. My 2 main take-aways are:
Somewhat unsurprisingly, claim 1 had the least support. While I find this claim and the supporting arguments quite compelling and intuitive, there seem to be inferential gaps that I struggle to address quickly/easily. A key sticking point seems to be the lack of a highly plausible concrete scenario. I think it might also require more discussion of epistemics in order to move people from “I understand the basis for concern” to “I believe there is a non-trivial chance of an out-of-control AI killing everyone”.
The phrase “ethical priority” raises alarm bells for people, and should be replaced of clarified. Once I clarified that I meant it in the same way as “combating climate change is an ethical priority”, people seemed to accept it.
Some more details on the event:
The title for our session was: The case for AI as an existential risk, and a call for discussion and debate. Our blurb was: A growing number of researchers are concerned about scenarios in which machines, instead of people, control the future. What is the basis for these concerns, and are they well-founded? I believe they are, and we have an obligation as a community to address them. I can lead with a few minutes summarizing the case for that view. We can then discuss what nuances, objections, and take-aways.I also started with some basic background to make sure people understood the topic:
X-risk = risk of human extinction
The 3 kinds of risk (misuse, accident, structural)
The specific risk scenario I’m concerned with: out of control AI
- [AN #79]: Recursive reward modeling as an alignment technique integrated with deep RL by 1 Jan 2020 18:00 UTC; 13 points) (
- 2 Dec 2019 19:28 UTC; 8 points) 's comment on A list of good heuristics that the case for AI x-risk fails by (
Huh, I wonder what you think of a different way of splitting it up. Something like:
It’s a scientific possibility to have AI that’s on average better than humanity at the class of tasks “choose actions that achieve a goal in the real world.” Let’s label this by some superlative jargon like “superintelligent AI.” Such a technology would be hugely impactful.
It would be really bad if a superintelligent AI was choosing actions to achieve some goal, but this goal wasn’t beneficial to humans. There are several open problems that this means we need to solve before safely turning on any such AI.
We know enough that we can do useful work on (most of) these open problems right now. Arguing for this also implies that superintelligent AI is close enough (if not in years, then in “number of paradigm shifts”) that this work needs to start getting done.
We would expect a priori that work on these open problems of beneficial goal design should be under-prioritized (public goods problem, low immediate profit, not obvious you need it before you really need it). And indeed that seems to be the case (insert NIPS survey here), though there’s work going on at nonprofits that have different incentives. So consider thinking about this area if you’re looking for things to research.
I’m definitely interested in hearing other ways of splitting it up! This is one of the points of making this post. I’m also interested in what you think of the ways I’ve done the breakdown! Since you proposed an alternative, I guess you might have some thoughts on why it could be better :)
I see your points as being directed more at increasing ML researchers respect for AI x-risk work and their likelihood of doing relevant work. Maybe that should in fact be the goal. It seems to be a more common goal.
I would describe my goal (with this post, at least, and probably with most conversations I have with ML people about Xrisk) as something more like: “get them to understand the AI safety mindset, and where I’m coming from; get them to really think about the problem and engage with it”. I expect a lot of people here would reason in a very narrow and myopic consequentialist way that this is not as good a goal, but I’m unconvinced.
Well, you mentioned that a lot of people were getting off the train at point 1. My comment can be thought of as giving a much more thoroughly inside-view look at point 1, and deriving other stuff as incidental consequences.
I’m mentally working with an analogy to teaching people a new contra dance (if you don’t know what contra dancing is, I’m just talking about some sequence of dance moves). The teacher often has an abstract view of expression and flow that the students lack, and there’s a temptation for the teacher to try to share that view with the students. But the students don’t want to abstractions, what they want is concrete steps to follow, and good dancers will dance the dance just fine without ever hearing about the teacher’s abstract view. Before dancing they regard the abstractions as difficult to understand and distracting from the concrete instructions; they’ll be much more equipped to understand and appreciate them *after* dancing the dance.
IMO, this is a better way of splitting up the argument that we should be funding AI safety research than the one presented in the OP. My only gripe is in point 2. Many would argue that it wouldn’t be really bad for a variety of reasons, such as there are likely to be other ‘superintelligent AIs’ working in our favour. Alternatively, if the decision making were only marginally better than a human’s it wouldn’t be any worse than a small group of people working against humanity.
TBC, I’m definitely NOT thinking of this as an argument for funding AI safety.
Planned summary for the Alignment Newsletter:
Planned opinion:
IMO coming up with highly plausible concrete scenarios should be a major priority of people working on AI safety. It seems both very useful for getting other researchers involved, and also very useful for understanding the problem and making progress.
In terms of talking to other researchers, in-person conversations like the ones you’re having seem like a great way to feel things out before writing public documents.
Admittedly, this is what shminux objected to. Beforehand I would have expected more resistance based on people already believing the future is uncertain, casting doubt on claim 2 and especially the “tractable” part of claim 3. If I had to steelman such views, they might sound something like, ‘The way to address this problem is to make sure sensible people are in charge, and a prerequisite for being sensible is not giving weird-sounding talks for 3 people.’
How sure are you that the people who showed up were objecting out of deeply-held disagreements, and not out of a sense that objections are good?
TBC, it’s an unconference, so it wasn’t really a talk (although I did end up talking a lot :P).
Seems like a false dichotomy. I’d say people were mostly disagreeing out of not-very-deeply-held-at-all disagreements :)
This is where I call BS. Even the best calibrated people are not accurate at the margins. They probably cannot tell 1% from 0.1%. The rest of us can’t reliably tell 1% from 0.00001% or from 10%. If you are in doubt, ask those who self-calibrate all the time and are good at it (Eliezer? Scott? Anna? gwern?) how accurate their 1% predictions are.
Also notice your motivated cognition. You are not trying to figure out whether your views are justified, but how to convince those ignorant others that your views are correct.
I think capybaralet meant ≥1%.
I don’t think your last paragraph is fair; doing outreach / advocacy, and discussing it, is not particularly related to motivated cognition. You don’t know how much time capybaralet has spent trying to figure out whether their views are justified; you’re not going to get a whole life story in an 800-word blog post.
There is such a thing as talking to an ideological opponent who has spent no time thinking about a topic and has a dumb opinion that could not survive 5 seconds of careful thought. We should still be good listeners, not be condescending, etc., because that’s just the right way to talk to people; but realistically we’re probably not going to learn anything new (about this specific topic) from such a person, let alone change our own minds (assuming we’ve already deeply engaged with both sides of the issue).
On the other hand, when talking to an ideological opponent who has spent a lot of time thinking about an issue, we may indeed learn something or change our mind, and I’m all for being genuinely open-minded and seeking out and thinking hard about such opinions. But I think that’s not the main topic of this little blog post.
No, my goal is to:
Identify a small set of beliefs to focus discussions around.
Figure out how to make the case for these beliefs quickly, clearly, persuasively, and honestly.
And yes, I did mean >1%, but I just put that number there to give people a sense of what I mean, since “non-trivial” can mean very different things to different people.
That number was presented as an example (“e.g.”) - but more importantly, all the numbers in the range you offer here would argue for more AI alignment research! What we need to establish, naively, is that the probability is not super-exponentially low for a choice between ‘inter-galactic civilization’ and ‘extinction of humanity within a century’. That seems easy enough if we can show that nothing in the claim contradicts established knowledge.
I would argue the probability for this choice existing is far in excess of 50%. As examples of background info supporting this: Bayesianism implies that “narrow AI” designs should be compatible on some level; we know the human brain resulted from a series of kludges; and the superior number of neurons within an elephant’s brain is not strictly required for taking over the world. However, that argument is not logically necessary.
(Technically you’d have to deal with Pascal’s Mugging. However, I like Hansonian adjustment as a solution, and e.g. I doubt an adult civilization would deceive its people about the nature of the world.)