[Note: In this interview, Rohin was only considering risks arising because of AI systems that try to optimize for goals that are not our own, not other forms of existential risks from AI.]
Ok, I’m curious how likely you think it is that an (existential-level) bad outcome happens due to AI by default, without involving human extinction.
I mostly want to punt on this question, because I’m confused about what “actual” values are. I could imagine operationalizations where I’d say > 90% chance (e.g. if our “actual” values are the exact thing we would settle on after a specific kind of reflection that we may not know about right now), and others where I’d assign ~0% chance (e.g. the extremes of a moral anti-realist view).
I do lean closer to the stance of “whatever we decide based on some ‘reasonable’ reflection process is good”, which seems to encompass a wide range of futures, and seems likely to me to happen by default.
ETA: Also, what was your motivation for talking about a fairly narrow kind of AI risk, when the interviewer started with a more general notion?
I mean, the actual causal answer is “that’s what I immediately thought about”, it wasn’t a deliberate decision. But here are some rationalizations after the fact, most of which I’d expect are causal in that they informed the underlying heuristics that caused me to immediately think of the narrow kind of AI risk:
My model was that the interviewer(s) were talking about the narrow kind of AI risk, so it made sense to talk about that.
Initially, there wasn’t any plan for this interview to be made public, so I was less careful about making myself broadly understandable, and instead tailoring my words to my “audience” of 3 people.
I mostly think about and have expertise on the narrow kind (adversarial optimization against humans).
I expect that technical solutions are primarily important only for the narrow kind of AI risk (I’m more optimistic about social coordination for the general kind). So when I’m asked a question positing “without additional intervention by us doing safety research”, I tend to think of adversarial optimization, since that’s what I expect to be addressed by safety research.
I do lean closer to the stance of “whatever we decide based on some ‘reasonable’ reflection process is good”, which seems to encompass a wide range of futures, and seems likely to me to happen by default.
I think I disagree pretty strongly, and this is likely an important crux. Would you be willing to read a couple of articles that point to what I think is convincing contrary evidence? (As you read the first article, consider what would have happened if the people involved had access to AI-enabled commitment or mind-modification technologies.)
http://devinhelton.com/2015/08/dictatorship-and-democracy (I think this is a really good article in its own right but is kind of long, so if you want to skip to the parts most relevant to the current discussion, search for “Tyranny in Maoist China” and “madness spiral”.)
If these articles don’t cause you to update, can you explain why? For example do you think it would be fairly easy to design reflection/deliberation processes that would avoid these pathologies? What about future ones we don’t yet foresee?
… I’m not sure why I used the word “we” in the sentence you quoted. (Maybe I was thinking about a group of value-aligned agents? Maybe I was imagining that “reasonable reflection process” meant that we were in a post-scarcity world, everyone agreed that we should be doing reflection, everyone was already safe? Maybe I didn’t want the definition to sound like I would only care about what I thought and not what everyone else thought? I’m not sure.)
In any case, I think you can change that sentence to “whatever I decide based on some ‘reasonable’ reflection process is good”, and that’s closer to what I meant.
I am much more uncertain about multiagent interactions. Like, suppose we give every person access to a somewhat superintelligent AI assistant that is legitimately trying to help them. Are things okay by default? I lean towards yes, but I’m uncertain. I did read through those two articles, and I broadly buy the theses they advance; I still lean towards yes because:
Things have broadly become better over time, despite the effects that the articles above highlight. The default prediction is that they continue to get better. (And I very uncertainly think people from the past would agree, given enough time to understand our world?)
In general, we learn reasonably well from experience; we try things and they go badly, but then things get better as we learn from that.
Humans tend to be quite risk-averse at trying things, and groups of humans seem to be even more risk-averse. As a result, it seems unlikely that we try a thing that ends up having a “direct” existentially bad effect.
You could worry about an “indirect” existentially bad effect, along the lines of Moloch, where there isn’t any single human’s optimization causing bad things to happen, but selection pressure causes problems. Selection pressure has existed for a long time and hasn’t caused an existentially-bad outcome yet, so the default is that it won’t in the future.
Perhaps AI accelerates the rate of progress in a way where we can’t adapt fast enough, and this is why selection pressures can now cause an existentially bad effect. But this didn’t happen with the Industrial Revolution. (That said, I do find this more plausible than the other scenarios.)
But in fact I usually don’t aim to make claims about these sorts of scenarios; as I mentioned above I’m more optimistic about social solutions (that being the way we have solved this in the past).
In reality, much of the success of a government is due to the role of the particular leaders, particular people, and particular places. If you have a mostly illiterate nation, divided 60%/40% into two tribes, then majoritarian democracy is a really, really bad idea. But if you have a homogeneous, educated, and savvy populace, with a network of private institutions, and a high-trust culture, then many forms of government will work quite well. Much of the purported success of democracy is really survivorship bias. Countries with the most human capital and strongest civic institutions can survive the chaos and demagoguery that comes with regular mass elections. Lesser countries succumb to chaos, and then dictatorship.
I don’t see why having a well-educated populace would protect you against the nigh-inevitable value drift of even well-intentioned leaders when they ascend to power in a highly authoritarian regime.
I agree that if you just had one leader with absolute power then it probably won’t work, and that kind of government probably isn’t included in the author’s “many forms of government will work quite well”. I think what he probably has in mind are governments that look authoritarian from the outside but still has some kind of internal politics/checks-and-balances that can keep the top leader(s) from going off the rails. I wish I had a good gears-level model of how that kind of government/politics works though. I do suspect that “work quite well” might be fragile/temporary and dependent on the top leaders not trying very hard to take absolute power for themselves, but I’m very uncertain about this due to lack of knowledge and expertise.
[Note: In this interview, Rohin was only considering risks arising because of AI systems that try to optimize for goals that are not our own, not other forms of existential risks from AI.]
“goals that are not our own” is ambiguous to me. Does it include a goal that someone currently thinks they have or behaves as if they have, but isn’t really part of their “actual” values? Does it include a goal that someone gets talked into by a superintelligent AI? Are you including risks that come from AI not being value-neutral, in other words, the AI being better at optimizing for some kinds of values over others, to the extent that the future is dominated by the the goals of a small group of humans?
Also, you’ve been using “adversarial optimization” a lot in this thread but a search on this site doesn’t show you as having defined or used it before, except in https://www.lesswrong.com/posts/9mscdgJ7ao3vbbrjs/an-70-agents-that-help-humans-who-are-still-learning-about but that part wasn’t even written by you so I’m not sure if you mean the same thing by it. If you have defined it somewhere, can you please link to it? (I suspect there may be some illusion of transparency going on where you think terms like “adversarial optimization” and “goals that are not our own” have clear and obvious meanings...)
I mostly want to punt on this question, because I’m confused about what “actual” values are. I could imagine operationalizations where I’d say > 90% chance (e.g. if our “actual” values are the exact thing we would settle on after a specific kind of reflection that we may not know about right now), and others where I’d assign ~0% chance (e.g. the extremes of a moral anti-realist view).
I think even with extreme moral anti-realism, there’s still a significant risk that AIs could learn values that are wrong enough (i.e., different enough from our values, or are otherwise misaligned enough) to cause an existential-level bad outcome, but not human extinction. Can you confirm that you really endorse the ~0% figure?
I expect that technical solutions are primarily important only for the narrow kind of AI risk (I’m more optimistic about social coordination for the general kind). So when I’m asked a question positing “without additional intervention by us doing safety research”, I tend to think of adversarial optimization, since that’s what I expect to be addressed by safety research.
Can I convince you that you should be uncertain enough about this, and that enough other people disagree with you about this (in particular that social coordination may be hard enough that we should try to solve a wider kind of AI risk via technical means), that more careful language to distinguish between different kinds of risk and different kinds of research would be a good idea to facilitate thinking and discussion? (I take your point that you weren’t expecting this interview to be make public, so I’m just trying to build a consensus about what should ideally happen in the future.)
“goals that are not our own” is ambiguous to me. Does it include a goal that someone currently thinks they have or behaves as if they have, but isn’t really part of their “actual” values? Does it include a goal that someone gets talked into by a superintelligent AI?
“goals that are our own” is supposed to mean our “actual” values, which I don’t know how to define, but shouldn’t include a goal that you are “incorrectly” persuaded of by a superintelligent AI. The best operationalization I have is the values that I’d settle on after some “reasonable” reflection process. There are multiple “reasonable” reflection processes; the output of any of them is fine. But even this isn’t exactly right, because there might be some values that I end up having in the world with AI, that I wouldn’t have come across with any reasonable reflection process because I wouldn’t have thought about the weird situations that occur once there is superintelligent AI, and I still want to say that those sorts of values could be fine.
Are you including risks that come from AI not being value-neutral, in other words, the AI being better at optimizing for some kinds of values over others, to the extent that the future is dominated by the the goals of a small group of humans?
I was not including those risks (if you mean a setting where there are N groups of humans with different values, but AI can only help M < N of them, and so those M values dominate the future instead of all N).
I suspect there may be some illusion of transparency going on where you think terms like “adversarial optimization” and “goals that are not our own” have clear and obvious meanings...
I don’t think “goals that are not our own” is philosophically obvious, but I think that it points to a fuzzy concept that cleaves reality at its joints, of which the central examples are quite clear. (The canonical example being the paperclip maximizer.) I agree that once you start really trying to identify the boundaries of the concept, things get very murky (e.g. what if an AI reports true information to you, causing you to adopt value X, and the AI is also aligned with value X? Note that since you can’t understand all information, the AI has necessarily selected what information to show you. I’m sure there’s a Stuart Armstrong post about this somewhere.)
By “adversarial optimization”, I mean that the AI system is “trying to accomplish” some goal X, while humans instead “want” some goal Y, and this causes conflict between the AI system and humans.
(I could make it sound more technical by saying that the AI system is optimizing some utility function, while humans are optimizing some other utility function, which leads to conflict between the two because of convergent instrumental subgoals. I don’t think this is more precise than the previous sentence.)
I think even with extreme moral anti-realism, there’s still a significant risk that AIs could learn values that are wrong enough (i.e., different enough from our values, or are otherwise misaligned enough) to cause an existential-level bad outcome, but not human extinction. Can you confirm that you really endorse the ~0% figure?
Oh, whoops, I accidentally estimated the answer to “(existential-level) bad outcome happens due to AI by default, without involving adversarial optimization”. I agree that you could get existential-level bad outcomes that aren’t human extinction due to adversarial optimization. I’m not sure how likely I find that—it seems like that depends on what the optimal policy for a superintelligent AI is, which, who knows if that involves literally killing all humans. (Obviously, to be consistent with earlier estimates, it must be ⇐ 10%.)
Can I convince you that you should be uncertain enough about this, and that enough other people disagree with you about this (in particular that social coordination may be hard enough that we should try to solve a wider kind of AI risk via technical means), that more careful language to distinguish between different kinds of risk and different kinds of research would be a good idea to facilitate thinking and discussion? (I take your point that you weren’t expecting this interview to be make public, so I’m just trying to build a consensus about what should ideally happen in the future.)
Yeah, I do try to do this already. The note I quoted above is one that I asked to be added post-conversation for basically this reason. (It’s somewhat hard to do so though, my brain is pretty bad at keeping track of uncertainty that doesn’t come from an underlying inside-view model.)
It’s here:
I mostly want to punt on this question, because I’m confused about what “actual” values are. I could imagine operationalizations where I’d say > 90% chance (e.g. if our “actual” values are the exact thing we would settle on after a specific kind of reflection that we may not know about right now), and others where I’d assign ~0% chance (e.g. the extremes of a moral anti-realist view).
I do lean closer to the stance of “whatever we decide based on some ‘reasonable’ reflection process is good”, which seems to encompass a wide range of futures, and seems likely to me to happen by default.
I mean, the actual causal answer is “that’s what I immediately thought about”, it wasn’t a deliberate decision. But here are some rationalizations after the fact, most of which I’d expect are causal in that they informed the underlying heuristics that caused me to immediately think of the narrow kind of AI risk:
My model was that the interviewer(s) were talking about the narrow kind of AI risk, so it made sense to talk about that.
Initially, there wasn’t any plan for this interview to be made public, so I was less careful about making myself broadly understandable, and instead tailoring my words to my “audience” of 3 people.
I mostly think about and have expertise on the narrow kind (adversarial optimization against humans).
I expect that technical solutions are primarily important only for the narrow kind of AI risk (I’m more optimistic about social coordination for the general kind). So when I’m asked a question positing “without additional intervention by us doing safety research”, I tend to think of adversarial optimization, since that’s what I expect to be addressed by safety research.
I think I disagree pretty strongly, and this is likely an important crux. Would you be willing to read a couple of articles that point to what I think is convincing contrary evidence? (As you read the first article, consider what would have happened if the people involved had access to AI-enabled commitment or mind-modification technologies.)
http://devinhelton.com/2015/08/dictatorship-and-democracy (I think this is a really good article in its own right but is kind of long, so if you want to skip to the parts most relevant to the current discussion, search for “Tyranny in Maoist China” and “madness spiral”.)
http://devinhelton.com/historical-amnesia.html
If these articles don’t cause you to update, can you explain why? For example do you think it would be fairly easy to design reflection/deliberation processes that would avoid these pathologies? What about future ones we don’t yet foresee?
… I’m not sure why I used the word “we” in the sentence you quoted. (Maybe I was thinking about a group of value-aligned agents? Maybe I was imagining that “reasonable reflection process” meant that we were in a post-scarcity world, everyone agreed that we should be doing reflection, everyone was already safe? Maybe I didn’t want the definition to sound like I would only care about what I thought and not what everyone else thought? I’m not sure.)
In any case, I think you can change that sentence to “whatever I decide based on some ‘reasonable’ reflection process is good”, and that’s closer to what I meant.
I am much more uncertain about multiagent interactions. Like, suppose we give every person access to a somewhat superintelligent AI assistant that is legitimately trying to help them. Are things okay by default? I lean towards yes, but I’m uncertain. I did read through those two articles, and I broadly buy the theses they advance; I still lean towards yes because:
Things have broadly become better over time, despite the effects that the articles above highlight. The default prediction is that they continue to get better. (And I very uncertainly think people from the past would agree, given enough time to understand our world?)
In general, we learn reasonably well from experience; we try things and they go badly, but then things get better as we learn from that.
Humans tend to be quite risk-averse at trying things, and groups of humans seem to be even more risk-averse. As a result, it seems unlikely that we try a thing that ends up having a “direct” existentially bad effect.
You could worry about an “indirect” existentially bad effect, along the lines of Moloch, where there isn’t any single human’s optimization causing bad things to happen, but selection pressure causes problems. Selection pressure has existed for a long time and hasn’t caused an existentially-bad outcome yet, so the default is that it won’t in the future.
Perhaps AI accelerates the rate of progress in a way where we can’t adapt fast enough, and this is why selection pressures can now cause an existentially bad effect. But this didn’t happen with the Industrial Revolution. (That said, I do find this more plausible than the other scenarios.)
But in fact I usually don’t aim to make claims about these sorts of scenarios; as I mentioned above I’m more optimistic about social solutions (that being the way we have solved this in the past).
Not on topic, but from the first article:
I don’t see why having a well-educated populace would protect you against the nigh-inevitable value drift of even well-intentioned leaders when they ascend to power in a highly authoritarian regime.
I agree that if you just had one leader with absolute power then it probably won’t work, and that kind of government probably isn’t included in the author’s “many forms of government will work quite well”. I think what he probably has in mind are governments that look authoritarian from the outside but still has some kind of internal politics/checks-and-balances that can keep the top leader(s) from going off the rails. I wish I had a good gears-level model of how that kind of government/politics works though. I do suspect that “work quite well” might be fragile/temporary and dependent on the top leaders not trying very hard to take absolute power for themselves, but I’m very uncertain about this due to lack of knowledge and expertise.
“goals that are not our own” is ambiguous to me. Does it include a goal that someone currently thinks they have or behaves as if they have, but isn’t really part of their “actual” values? Does it include a goal that someone gets talked into by a superintelligent AI? Are you including risks that come from AI not being value-neutral, in other words, the AI being better at optimizing for some kinds of values over others, to the extent that the future is dominated by the the goals of a small group of humans?
Also, you’ve been using “adversarial optimization” a lot in this thread but a search on this site doesn’t show you as having defined or used it before, except in https://www.lesswrong.com/posts/9mscdgJ7ao3vbbrjs/an-70-agents-that-help-humans-who-are-still-learning-about but that part wasn’t even written by you so I’m not sure if you mean the same thing by it. If you have defined it somewhere, can you please link to it? (I suspect there may be some illusion of transparency going on where you think terms like “adversarial optimization” and “goals that are not our own” have clear and obvious meanings...)
I think even with extreme moral anti-realism, there’s still a significant risk that AIs could learn values that are wrong enough (i.e., different enough from our values, or are otherwise misaligned enough) to cause an existential-level bad outcome, but not human extinction. Can you confirm that you really endorse the ~0% figure?
Can I convince you that you should be uncertain enough about this, and that enough other people disagree with you about this (in particular that social coordination may be hard enough that we should try to solve a wider kind of AI risk via technical means), that more careful language to distinguish between different kinds of risk and different kinds of research would be a good idea to facilitate thinking and discussion? (I take your point that you weren’t expecting this interview to be make public, so I’m just trying to build a consensus about what should ideally happen in the future.)
“goals that are our own” is supposed to mean our “actual” values, which I don’t know how to define, but shouldn’t include a goal that you are “incorrectly” persuaded of by a superintelligent AI. The best operationalization I have is the values that I’d settle on after some “reasonable” reflection process. There are multiple “reasonable” reflection processes; the output of any of them is fine. But even this isn’t exactly right, because there might be some values that I end up having in the world with AI, that I wouldn’t have come across with any reasonable reflection process because I wouldn’t have thought about the weird situations that occur once there is superintelligent AI, and I still want to say that those sorts of values could be fine.
I was not including those risks (if you mean a setting where there are N groups of humans with different values, but AI can only help M < N of them, and so those M values dominate the future instead of all N).
I don’t think “goals that are not our own” is philosophically obvious, but I think that it points to a fuzzy concept that cleaves reality at its joints, of which the central examples are quite clear. (The canonical example being the paperclip maximizer.) I agree that once you start really trying to identify the boundaries of the concept, things get very murky (e.g. what if an AI reports true information to you, causing you to adopt value X, and the AI is also aligned with value X? Note that since you can’t understand all information, the AI has necessarily selected what information to show you. I’m sure there’s a Stuart Armstrong post about this somewhere.)
By “adversarial optimization”, I mean that the AI system is “trying to accomplish” some goal X, while humans instead “want” some goal Y, and this causes conflict between the AI system and humans.
(I could make it sound more technical by saying that the AI system is optimizing some utility function, while humans are optimizing some other utility function, which leads to conflict between the two because of convergent instrumental subgoals. I don’t think this is more precise than the previous sentence.)
Oh, whoops, I accidentally estimated the answer to “(existential-level) bad outcome happens due to AI by default, without involving adversarial optimization”. I agree that you could get existential-level bad outcomes that aren’t human extinction due to adversarial optimization. I’m not sure how likely I find that—it seems like that depends on what the optimal policy for a superintelligent AI is, which, who knows if that involves literally killing all humans. (Obviously, to be consistent with earlier estimates, it must be ⇐ 10%.)
Yeah, I do try to do this already. The note I quoted above is one that I asked to be added post-conversation for basically this reason. (It’s somewhat hard to do so though, my brain is pretty bad at keeping track of uncertainty that doesn’t come from an underlying inside-view model.)