[Note: In this interview, Rohin was only considering risks arising because of AI systems that try to optimize for goals that are not our own, not other forms of existential risks from AI.]
“goals that are not our own” is ambiguous to me. Does it include a goal that someone currently thinks they have or behaves as if they have, but isn’t really part of their “actual” values? Does it include a goal that someone gets talked into by a superintelligent AI? Are you including risks that come from AI not being value-neutral, in other words, the AI being better at optimizing for some kinds of values over others, to the extent that the future is dominated by the the goals of a small group of humans?
Also, you’ve been using “adversarial optimization” a lot in this thread but a search on this site doesn’t show you as having defined or used it before, except in https://www.lesswrong.com/posts/9mscdgJ7ao3vbbrjs/an-70-agents-that-help-humans-who-are-still-learning-about but that part wasn’t even written by you so I’m not sure if you mean the same thing by it. If you have defined it somewhere, can you please link to it? (I suspect there may be some illusion of transparency going on where you think terms like “adversarial optimization” and “goals that are not our own” have clear and obvious meanings...)
I mostly want to punt on this question, because I’m confused about what “actual” values are. I could imagine operationalizations where I’d say > 90% chance (e.g. if our “actual” values are the exact thing we would settle on after a specific kind of reflection that we may not know about right now), and others where I’d assign ~0% chance (e.g. the extremes of a moral anti-realist view).
I think even with extreme moral anti-realism, there’s still a significant risk that AIs could learn values that are wrong enough (i.e., different enough from our values, or are otherwise misaligned enough) to cause an existential-level bad outcome, but not human extinction. Can you confirm that you really endorse the ~0% figure?
I expect that technical solutions are primarily important only for the narrow kind of AI risk (I’m more optimistic about social coordination for the general kind). So when I’m asked a question positing “without additional intervention by us doing safety research”, I tend to think of adversarial optimization, since that’s what I expect to be addressed by safety research.
Can I convince you that you should be uncertain enough about this, and that enough other people disagree with you about this (in particular that social coordination may be hard enough that we should try to solve a wider kind of AI risk via technical means), that more careful language to distinguish between different kinds of risk and different kinds of research would be a good idea to facilitate thinking and discussion? (I take your point that you weren’t expecting this interview to be make public, so I’m just trying to build a consensus about what should ideally happen in the future.)
“goals that are not our own” is ambiguous to me. Does it include a goal that someone currently thinks they have or behaves as if they have, but isn’t really part of their “actual” values? Does it include a goal that someone gets talked into by a superintelligent AI?
“goals that are our own” is supposed to mean our “actual” values, which I don’t know how to define, but shouldn’t include a goal that you are “incorrectly” persuaded of by a superintelligent AI. The best operationalization I have is the values that I’d settle on after some “reasonable” reflection process. There are multiple “reasonable” reflection processes; the output of any of them is fine. But even this isn’t exactly right, because there might be some values that I end up having in the world with AI, that I wouldn’t have come across with any reasonable reflection process because I wouldn’t have thought about the weird situations that occur once there is superintelligent AI, and I still want to say that those sorts of values could be fine.
Are you including risks that come from AI not being value-neutral, in other words, the AI being better at optimizing for some kinds of values over others, to the extent that the future is dominated by the the goals of a small group of humans?
I was not including those risks (if you mean a setting where there are N groups of humans with different values, but AI can only help M < N of them, and so those M values dominate the future instead of all N).
I suspect there may be some illusion of transparency going on where you think terms like “adversarial optimization” and “goals that are not our own” have clear and obvious meanings...
I don’t think “goals that are not our own” is philosophically obvious, but I think that it points to a fuzzy concept that cleaves reality at its joints, of which the central examples are quite clear. (The canonical example being the paperclip maximizer.) I agree that once you start really trying to identify the boundaries of the concept, things get very murky (e.g. what if an AI reports true information to you, causing you to adopt value X, and the AI is also aligned with value X? Note that since you can’t understand all information, the AI has necessarily selected what information to show you. I’m sure there’s a Stuart Armstrong post about this somewhere.)
By “adversarial optimization”, I mean that the AI system is “trying to accomplish” some goal X, while humans instead “want” some goal Y, and this causes conflict between the AI system and humans.
(I could make it sound more technical by saying that the AI system is optimizing some utility function, while humans are optimizing some other utility function, which leads to conflict between the two because of convergent instrumental subgoals. I don’t think this is more precise than the previous sentence.)
I think even with extreme moral anti-realism, there’s still a significant risk that AIs could learn values that are wrong enough (i.e., different enough from our values, or are otherwise misaligned enough) to cause an existential-level bad outcome, but not human extinction. Can you confirm that you really endorse the ~0% figure?
Oh, whoops, I accidentally estimated the answer to “(existential-level) bad outcome happens due to AI by default, without involving adversarial optimization”. I agree that you could get existential-level bad outcomes that aren’t human extinction due to adversarial optimization. I’m not sure how likely I find that—it seems like that depends on what the optimal policy for a superintelligent AI is, which, who knows if that involves literally killing all humans. (Obviously, to be consistent with earlier estimates, it must be ⇐ 10%.)
Can I convince you that you should be uncertain enough about this, and that enough other people disagree with you about this (in particular that social coordination may be hard enough that we should try to solve a wider kind of AI risk via technical means), that more careful language to distinguish between different kinds of risk and different kinds of research would be a good idea to facilitate thinking and discussion? (I take your point that you weren’t expecting this interview to be make public, so I’m just trying to build a consensus about what should ideally happen in the future.)
Yeah, I do try to do this already. The note I quoted above is one that I asked to be added post-conversation for basically this reason. (It’s somewhat hard to do so though, my brain is pretty bad at keeping track of uncertainty that doesn’t come from an underlying inside-view model.)
“goals that are not our own” is ambiguous to me. Does it include a goal that someone currently thinks they have or behaves as if they have, but isn’t really part of their “actual” values? Does it include a goal that someone gets talked into by a superintelligent AI? Are you including risks that come from AI not being value-neutral, in other words, the AI being better at optimizing for some kinds of values over others, to the extent that the future is dominated by the the goals of a small group of humans?
Also, you’ve been using “adversarial optimization” a lot in this thread but a search on this site doesn’t show you as having defined or used it before, except in https://www.lesswrong.com/posts/9mscdgJ7ao3vbbrjs/an-70-agents-that-help-humans-who-are-still-learning-about but that part wasn’t even written by you so I’m not sure if you mean the same thing by it. If you have defined it somewhere, can you please link to it? (I suspect there may be some illusion of transparency going on where you think terms like “adversarial optimization” and “goals that are not our own” have clear and obvious meanings...)
I think even with extreme moral anti-realism, there’s still a significant risk that AIs could learn values that are wrong enough (i.e., different enough from our values, or are otherwise misaligned enough) to cause an existential-level bad outcome, but not human extinction. Can you confirm that you really endorse the ~0% figure?
Can I convince you that you should be uncertain enough about this, and that enough other people disagree with you about this (in particular that social coordination may be hard enough that we should try to solve a wider kind of AI risk via technical means), that more careful language to distinguish between different kinds of risk and different kinds of research would be a good idea to facilitate thinking and discussion? (I take your point that you weren’t expecting this interview to be make public, so I’m just trying to build a consensus about what should ideally happen in the future.)
“goals that are our own” is supposed to mean our “actual” values, which I don’t know how to define, but shouldn’t include a goal that you are “incorrectly” persuaded of by a superintelligent AI. The best operationalization I have is the values that I’d settle on after some “reasonable” reflection process. There are multiple “reasonable” reflection processes; the output of any of them is fine. But even this isn’t exactly right, because there might be some values that I end up having in the world with AI, that I wouldn’t have come across with any reasonable reflection process because I wouldn’t have thought about the weird situations that occur once there is superintelligent AI, and I still want to say that those sorts of values could be fine.
I was not including those risks (if you mean a setting where there are N groups of humans with different values, but AI can only help M < N of them, and so those M values dominate the future instead of all N).
I don’t think “goals that are not our own” is philosophically obvious, but I think that it points to a fuzzy concept that cleaves reality at its joints, of which the central examples are quite clear. (The canonical example being the paperclip maximizer.) I agree that once you start really trying to identify the boundaries of the concept, things get very murky (e.g. what if an AI reports true information to you, causing you to adopt value X, and the AI is also aligned with value X? Note that since you can’t understand all information, the AI has necessarily selected what information to show you. I’m sure there’s a Stuart Armstrong post about this somewhere.)
By “adversarial optimization”, I mean that the AI system is “trying to accomplish” some goal X, while humans instead “want” some goal Y, and this causes conflict between the AI system and humans.
(I could make it sound more technical by saying that the AI system is optimizing some utility function, while humans are optimizing some other utility function, which leads to conflict between the two because of convergent instrumental subgoals. I don’t think this is more precise than the previous sentence.)
Oh, whoops, I accidentally estimated the answer to “(existential-level) bad outcome happens due to AI by default, without involving adversarial optimization”. I agree that you could get existential-level bad outcomes that aren’t human extinction due to adversarial optimization. I’m not sure how likely I find that—it seems like that depends on what the optimal policy for a superintelligent AI is, which, who knows if that involves literally killing all humans. (Obviously, to be consistent with earlier estimates, it must be ⇐ 10%.)
Yeah, I do try to do this already. The note I quoted above is one that I asked to be added post-conversation for basically this reason. (It’s somewhat hard to do so though, my brain is pretty bad at keeping track of uncertainty that doesn’t come from an underlying inside-view model.)