(Cross-posted from my website. PDF of the full series here. Audio version of the full series here; or search for “Joe Carlsmith Audio” on your podcast app.)
“With malice towards none; with charity towards all; with firmness in the right, as God gives us to see the right…”
I’ve written a series of essays that I’m calling “Otherness and control in the age of AGI.” The series examines a set of interconnected questions about how agents with different values should relate to one another, and about the ethics of seeking and sharing power. They’re old questions – but I think that we will have to grapple with them in new ways as increasingly powerful AI systems come online. And I think they’re core to some parts of the discourse about existential risk from misaligned AI (hereafter, “AI risk”).[1]
The series covers a lot of ground, but I’m hoping the individual essays can be read fairly well on their own. Here’s a brief summary of the essays that have been released thus far (I’ll update it as I release more):
The first essay, “Gentleness and the artificial Other,” discusses the possibility of “gentleness” towards various non-human Others – for example, animals, aliens, and AI systems. And it also highlights the possibility of “getting eaten,” in the way that Timothy Treadwell gets eaten by a bear in Werner Herzog’s Grizzly Man: that is, eaten in the midst of an attempt at gentleness.
The second essay, “Deep atheism and AI risk,” discusses what I call “deep atheism” – a fundamental mistrust both towards Nature, and towards “bare intelligence.” I take Eliezer Yudkowsky as a paradigmatic deep atheist, and I highlight the connection between his deep atheism and his concern about misaligned AI. I also connect deep atheism to the duality of “yang” (active, controlling) vs “yin” (receptive, letting-go). A lot of my concern, in the series, is about ways in which certain strands of the AI risk discourse can propel themselves, philosophically, towards ever-greater yang.
The third essay, “When ‘yang’ goes wrong,” expands on this concern. In particular: it discusses the sense in which deep atheism can prompt an aspiration to exert extreme levels of control over the universe; it highlights the sense in which both humans and AIs, on Yudkowsky’s narrative, are animated by this sort of aspiration; and it discusses some ways in which our civilization has built up wariness around control-seeking of this kind. I think we should be taking this sort of wariness quite seriously.
Pursuant to this goal, the fourth essay, “Does AI risk ‘other’ the AIs?”, examines Robin Hanson’s critique of the AI risk discourse – and in particular, his accusation that this discourse “others” the AIs, and seeks too much control over the values that steer the future. I find some aspects of Hanson’s critique uncompelling and implausible, but I do think he’s pointing at a real discomfort.
The fifth essay, “An even deeper atheism,” argues that this discomfort should deepen yet further when we bring some other Yudkowskian philosophical vibes into view – in particular, vibes related to the “fragility of value,” “extremal Goodhart,” and “the tails come apart.” These vibes, I suggest, create a certain momentum towards deeming more and more agents – including: human agents – “misaligned” in the sense of: not-to-be-trusted to optimize the universe very intensely according to their values-on-reflection. And even if we do not follow this momentum, I think it can remind us of the sense in which AI risk is substantially (though, not entirely) a generalization and intensification of the sort of “balance of power between agents with different values” problem we already deal with in the purely human world – a problem about which our existing ethical and political traditions already offer lots of guidance.
The sixth essay, “Being nicer than Clippy,” tries to draw on this guidance. In particular, it tries to point at the distinction between a paradigmatically “paperclip-y” way of being, and some broad and hazily-defined set of alternatives that I group under the label “niceness/liberalism/boundaries.” Too often, I think, a simplistic interpretation of the alignment discourse imagines that humans and AIs-with-different-values are both paperclippy at heart – except, only, with a different favored sort of “stuff.” I think this picture neglects core aspects of human ethics that are, themselves, about navigating precisely the sorts of differences-in-values that the possibility of misaligned AI forces us to grapple with. I think that attention to this part of human ethics can help us be better than the paperclippers we fear – not just in what we do with spare resources, but in how we relate to the distribution of power amongst a plurality of value systems more broadly. And I think it may have practical benefits as well, in navigating possible conflicts both between different humans, and between humans and AIs. That said, I don’t think that “niceness/liberalism/boundaries” is enough, on its own, to ensure a good future, or to allay all concern about trying to control that future over-much.
The seventh essay, “On the abolition of man,” examines another version of that concern: namely, C.S. Lewis’s argument (in his book The Abolition of Man) that attempts by moral anti-realists to influence the values of future people must necessarily be “tyrannical.” I mostly disagree with Lewis — and in particular, I think he makes a number of fairly basic philosophical mistakes related to e.g. compatibilism about freedom, to the difference between creating-Bob-instead-of-Alice vs. brainwashing-Alice-to-make-her-like-Bob, and to the sense in which moral anti-realists can retain their grip on morality. But I do think his discussion points at some difficult questions about the ethics of influencing the values of others, including AIs – questions the essay takes an initial stab at grappling with.
The eight essay, “On green,” examines a philosophical vibe that I (following others) call “green,” and which I think contrasts in interesting ways with “deep atheism.” Green is one of the five colors on the Magic the Gathering Color Wheel, which I’ve found (despite not playing Magic myself) an interesting way of classifying the sort of the energies that tend to animate people.[2] The colors, and their corresponding shticks-according-to-Joe, are: White = Morality; Blue = Knowledge; Black = Power; Red = Passion; and Green = … I haven’t found a single word that I think captures green, but associations include: environmentalism, tradition, spirituality, hippies, stereotypes of Native Americans, Yoda, humility, wholesomeness, health, and yin. The essay tries to bring the vibe that underlies these associations into clearer view, and to point at some ways that attempts by other colors to reconstruct green can miss parts of it. In particular, I focus on the way green cares about respect, in a sense that goes beyond “not trampling on the rights/interests of moral patients” (what I call “green-according-to-white”); and on the way green takes joy in (certain kinds of) yin, in a sense that contrasts with merely “accepting things you’re too weak to change” (what I call “green-according-to-black”).
The ninth essay, “On attunement,” continues the project of the previous essay, but with a focus on what I call “green-according-to-blue,” on which green is centrally about making sure that we act with enough knowledge. I think there’s something to this, but I also suggest that green cares especially about “attunement” – a kind of meaning-laden receptivity to the world – as opposed to more paradigmatically blue-like types of knowledge. What’s more, I think that attunement is core to certain kinds of ethical epistemology, including my own; and it plays a key role in my own vision, at least, of a “wise” future. And while attunement may, ultimately, be made out of red and blue, I think we should take it seriously on its own terms.
The tenth essay, “Loving a world you don’t trust,” closes the series with an effort to make sure I’ve given both yang and “deep atheism” their due, and been clear about my overall take. To this end, the first part of the essay praises certain types of yang directly, in an effort to avoid over-correction towards yin. The second part praises something quite nearby to deep atheism that I care about a lot – something I call “humanism.” And the third part tries to clarify the depth of atheism I ultimately endorse. In particular, I distinguish between trust in the Real, and various other attitudes towards it – attitudes like love, reverence, loyalty, and forgiveness. And I talk about ways these latter attitudes can still look the world’s horrors in the eye.
I’ll also note three caveats about the series as a whole. First, while I think it likely that ours is the age of AGI—still, maybe not. Maybe I won’t live to see the age that I wrote this series for. But I think that much of the content will be of interest regardless of your views on AGI timelines.
Second: the series is centrally an exercise in philosophy, but it also touches on some issues relevant to the technical challenge of ensuring that the AI systems we build do not kill all humans, and to the empirical question of whether our efforts in this respect will fail. And I confess to some worry about bringing the philosophical stuff too near to the technical/empirical stuff. In particular: my sense is that people are often eager, in discussions about AI risk, to argue at the level of grand ideological abstraction rather than brass-tacks empirics – and I worry that these essays will feed such temptations. This isn’t to say that philosophy is irrelevant to AI risk – to the contrary, part of my hope, in these essays, is to help us see more clearly the abstractions that move and shift underneath certain discussions of the issue. But we should be very clear about the distinction between affiliating with some philosophical vibe and making concrete predictions about the future. Ultimately, it’s the concrete-prediction thing that matters most;[3] and if the right concrete prediction is “advanced AIs have a substantive chance of killing all the humans,” you don’t need to do much philosophy to get upset, or to get to work. Indeed, particularly in AI, it’s easy to argue about philosophical questions over-much. Doing so can be distracting candy, especially if it lets you bounce off more technical problems. And if we fail on certain technical problems, we may well end up dead.
Third: even as the series focuses on philosophical stuff rather than technical/empirical stuff, it also focuses on a very particular strand of philosophical stuff – namely, a cluster of related philosophical assumptions and frames that I associate most centrally with Eliezer Yudkowsky, whose writings have done a lot to frame and popularize AI risk as an issue. And here, too, I worry about pushing the conversation in the wrong direction. That is: I think that Yudkowsky’s philosophical views are sufficiently influential, interesting, and fleshed-out that it’s worth interrogating them in depth. But I don’t want people to confuse their takes on Yudkowsky’s philosophical views (or his more technical/empirical views, or his vibe more broadly) for their takes on the severity of existential risk from AI more generally – and I worry these essays might prompt such a conflation. So please, remember: there are a very wide variety of ways to care about making sure that advanced AIs don’t kill everyone. Fundamentalist Christians can care about this; deep ecologists can care about this; solipsists can care about this; people who have no interest in philosophy at all can care about this. Indeed, in many respects, these essays aren’t centrally about AI risk in the sense of “let’s make sure that the AIs don’t kill everyone” (i.e., “AInotkilleveryoneism”) – rather, they’re about a set of broader questions about otherness and control that arise in the context of trying to ensure that the future goes well more generally. And what’s more, as I note in the series in various places, much of my interrogation of Yudkowsky’s views has to do with the sort of philosophical momentum they create in various directions, rather than with whether Yudkowsky in particular takes them there. In this sense, my concern is not ultimately with Yudkowsky’s views per se, but rather with a sort of abstracted existential narrative that I think Yudkowsky’s writings often channel and express – one that I think different conversations about advanced AI live within to different degrees, and which I hope to help us see more whole.
Dedicated to Walter Kaufmann.
Thanks to Katja Grace, Rebecca Kagan, Will MacAskill, Ketan Ramakrishnan, Anna Salamon,Carl Shulman,and many others over the years for conversation about these topics; thanks to Carl Shulman for written comments; and thanks to Sara Fish for formatting help. I am speaking only for myself and not for my employer.
There are lots of other risks from AI, too; but I want to focus on existential risk from misalignment, here, and I want the short phrase “AI risk” for the thing I’m going to be referring to repeatedly.
My relationship to the MtG Color Wheel is mostly via somewhat-reinterpreting Duncan Sabien’s presentation here, who credits Mark Rosewater for a lot of his understanding. My characterization won’t necessarily resonate with people who actually play Magic.
Otherness and control in the age of AGI
(Cross-posted from my website. PDF of the full series here. Audio version of the full series here; or search for “Joe Carlsmith Audio” on your podcast app.)
Lincoln’s Second Inaugural (image source here)I’ve written a series of essays that I’m calling “Otherness and control in the age of AGI.” The series examines a set of interconnected questions about how agents with different values should relate to one another, and about the ethics of seeking and sharing power. They’re old questions – but I think that we will have to grapple with them in new ways as increasingly powerful AI systems come online. And I think they’re core to some parts of the discourse about existential risk from misaligned AI (hereafter, “AI risk”).[1]
The series covers a lot of ground, but I’m hoping the individual essays can be read fairly well on their own. Here’s a brief summary of the essays that have been released thus far (I’ll update it as I release more):
The first essay, “Gentleness and the artificial Other,” discusses the possibility of “gentleness” towards various non-human Others – for example, animals, aliens, and AI systems. And it also highlights the possibility of “getting eaten,” in the way that Timothy Treadwell gets eaten by a bear in Werner Herzog’s Grizzly Man: that is, eaten in the midst of an attempt at gentleness.
The second essay, “Deep atheism and AI risk,” discusses what I call “deep atheism” – a fundamental mistrust both towards Nature, and towards “bare intelligence.” I take Eliezer Yudkowsky as a paradigmatic deep atheist, and I highlight the connection between his deep atheism and his concern about misaligned AI. I also connect deep atheism to the duality of “yang” (active, controlling) vs “yin” (receptive, letting-go). A lot of my concern, in the series, is about ways in which certain strands of the AI risk discourse can propel themselves, philosophically, towards ever-greater yang.
The third essay, “When ‘yang’ goes wrong,” expands on this concern. In particular: it discusses the sense in which deep atheism can prompt an aspiration to exert extreme levels of control over the universe; it highlights the sense in which both humans and AIs, on Yudkowsky’s narrative, are animated by this sort of aspiration; and it discusses some ways in which our civilization has built up wariness around control-seeking of this kind. I think we should be taking this sort of wariness quite seriously.
Pursuant to this goal, the fourth essay, “Does AI risk ‘other’ the AIs?”, examines Robin Hanson’s critique of the AI risk discourse – and in particular, his accusation that this discourse “others” the AIs, and seeks too much control over the values that steer the future. I find some aspects of Hanson’s critique uncompelling and implausible, but I do think he’s pointing at a real discomfort.
The fifth essay, “An even deeper atheism,” argues that this discomfort should deepen yet further when we bring some other Yudkowskian philosophical vibes into view – in particular, vibes related to the “fragility of value,” “extremal Goodhart,” and “the tails come apart.” These vibes, I suggest, create a certain momentum towards deeming more and more agents – including: human agents – “misaligned” in the sense of: not-to-be-trusted to optimize the universe very intensely according to their values-on-reflection. And even if we do not follow this momentum, I think it can remind us of the sense in which AI risk is substantially (though, not entirely) a generalization and intensification of the sort of “balance of power between agents with different values” problem we already deal with in the purely human world – a problem about which our existing ethical and political traditions already offer lots of guidance.
The sixth essay, “Being nicer than Clippy,” tries to draw on this guidance. In particular, it tries to point at the distinction between a paradigmatically “paperclip-y” way of being, and some broad and hazily-defined set of alternatives that I group under the label “niceness/liberalism/boundaries.” Too often, I think, a simplistic interpretation of the alignment discourse imagines that humans and AIs-with-different-values are both paperclippy at heart – except, only, with a different favored sort of “stuff.” I think this picture neglects core aspects of human ethics that are, themselves, about navigating precisely the sorts of differences-in-values that the possibility of misaligned AI forces us to grapple with. I think that attention to this part of human ethics can help us be better than the paperclippers we fear – not just in what we do with spare resources, but in how we relate to the distribution of power amongst a plurality of value systems more broadly. And I think it may have practical benefits as well, in navigating possible conflicts both between different humans, and between humans and AIs. That said, I don’t think that “niceness/liberalism/boundaries” is enough, on its own, to ensure a good future, or to allay all concern about trying to control that future over-much.
The seventh essay, “On the abolition of man,” examines another version of that concern: namely, C.S. Lewis’s argument (in his book The Abolition of Man) that attempts by moral anti-realists to influence the values of future people must necessarily be “tyrannical.” I mostly disagree with Lewis — and in particular, I think he makes a number of fairly basic philosophical mistakes related to e.g. compatibilism about freedom, to the difference between creating-Bob-instead-of-Alice vs. brainwashing-Alice-to-make-her-like-Bob, and to the sense in which moral anti-realists can retain their grip on morality. But I do think his discussion points at some difficult questions about the ethics of influencing the values of others, including AIs – questions the essay takes an initial stab at grappling with.
The eight essay, “On green,” examines a philosophical vibe that I (following others) call “green,” and which I think contrasts in interesting ways with “deep atheism.” Green is one of the five colors on the Magic the Gathering Color Wheel, which I’ve found (despite not playing Magic myself) an interesting way of classifying the sort of the energies that tend to animate people.[2] The colors, and their corresponding shticks-according-to-Joe, are: White = Morality; Blue = Knowledge; Black = Power; Red = Passion; and Green = … I haven’t found a single word that I think captures green, but associations include: environmentalism, tradition, spirituality, hippies, stereotypes of Native Americans, Yoda, humility, wholesomeness, health, and yin. The essay tries to bring the vibe that underlies these associations into clearer view, and to point at some ways that attempts by other colors to reconstruct green can miss parts of it. In particular, I focus on the way green cares about respect, in a sense that goes beyond “not trampling on the rights/interests of moral patients” (what I call “green-according-to-white”); and on the way green takes joy in (certain kinds of) yin, in a sense that contrasts with merely “accepting things you’re too weak to change” (what I call “green-according-to-black”).
The ninth essay, “On attunement,” continues the project of the previous essay, but with a focus on what I call “green-according-to-blue,” on which green is centrally about making sure that we act with enough knowledge. I think there’s something to this, but I also suggest that green cares especially about “attunement” – a kind of meaning-laden receptivity to the world – as opposed to more paradigmatically blue-like types of knowledge. What’s more, I think that attunement is core to certain kinds of ethical epistemology, including my own; and it plays a key role in my own vision, at least, of a “wise” future. And while attunement may, ultimately, be made out of red and blue, I think we should take it seriously on its own terms.
The tenth essay, “Loving a world you don’t trust,” closes the series with an effort to make sure I’ve given both yang and “deep atheism” their due, and been clear about my overall take. To this end, the first part of the essay praises certain types of yang directly, in an effort to avoid over-correction towards yin. The second part praises something quite nearby to deep atheism that I care about a lot – something I call “humanism.” And the third part tries to clarify the depth of atheism I ultimately endorse. In particular, I distinguish between trust in the Real, and various other attitudes towards it – attitudes like love, reverence, loyalty, and forgiveness. And I talk about ways these latter attitudes can still look the world’s horrors in the eye.
I’ll also note three caveats about the series as a whole. First, while I think it likely that ours is the age of AGI—still, maybe not. Maybe I won’t live to see the age that I wrote this series for. But I think that much of the content will be of interest regardless of your views on AGI timelines.
Second: the series is centrally an exercise in philosophy, but it also touches on some issues relevant to the technical challenge of ensuring that the AI systems we build do not kill all humans, and to the empirical question of whether our efforts in this respect will fail. And I confess to some worry about bringing the philosophical stuff too near to the technical/empirical stuff. In particular: my sense is that people are often eager, in discussions about AI risk, to argue at the level of grand ideological abstraction rather than brass-tacks empirics – and I worry that these essays will feed such temptations. This isn’t to say that philosophy is irrelevant to AI risk – to the contrary, part of my hope, in these essays, is to help us see more clearly the abstractions that move and shift underneath certain discussions of the issue. But we should be very clear about the distinction between affiliating with some philosophical vibe and making concrete predictions about the future. Ultimately, it’s the concrete-prediction thing that matters most;[3] and if the right concrete prediction is “advanced AIs have a substantive chance of killing all the humans,” you don’t need to do much philosophy to get upset, or to get to work. Indeed, particularly in AI, it’s easy to argue about philosophical questions over-much. Doing so can be distracting candy, especially if it lets you bounce off more technical problems. And if we fail on certain technical problems, we may well end up dead.
Third: even as the series focuses on philosophical stuff rather than technical/empirical stuff, it also focuses on a very particular strand of philosophical stuff – namely, a cluster of related philosophical assumptions and frames that I associate most centrally with Eliezer Yudkowsky, whose writings have done a lot to frame and popularize AI risk as an issue. And here, too, I worry about pushing the conversation in the wrong direction. That is: I think that Yudkowsky’s philosophical views are sufficiently influential, interesting, and fleshed-out that it’s worth interrogating them in depth. But I don’t want people to confuse their takes on Yudkowsky’s philosophical views (or his more technical/empirical views, or his vibe more broadly) for their takes on the severity of existential risk from AI more generally – and I worry these essays might prompt such a conflation. So please, remember: there are a very wide variety of ways to care about making sure that advanced AIs don’t kill everyone. Fundamentalist Christians can care about this; deep ecologists can care about this; solipsists can care about this; people who have no interest in philosophy at all can care about this. Indeed, in many respects, these essays aren’t centrally about AI risk in the sense of “let’s make sure that the AIs don’t kill everyone” (i.e., “AInotkilleveryoneism”) – rather, they’re about a set of broader questions about otherness and control that arise in the context of trying to ensure that the future goes well more generally. And what’s more, as I note in the series in various places, much of my interrogation of Yudkowsky’s views has to do with the sort of philosophical momentum they create in various directions, rather than with whether Yudkowsky in particular takes them there. In this sense, my concern is not ultimately with Yudkowsky’s views per se, but rather with a sort of abstracted existential narrative that I think Yudkowsky’s writings often channel and express – one that I think different conversations about advanced AI live within to different degrees, and which I hope to help us see more whole.
Dedicated to Walter Kaufmann.
Thanks to Katja Grace, Rebecca Kagan, Will MacAskill, Ketan Ramakrishnan, Anna Salamon, Carl Shulman, and many others over the years for conversation about these topics; thanks to Carl Shulman for written comments; and thanks to Sara Fish for formatting help. I am speaking only for myself and not for my employer.
There are lots of other risks from AI, too; but I want to focus on existential risk from misalignment, here, and I want the short phrase “AI risk” for the thing I’m going to be referring to repeatedly.
My relationship to the MtG Color Wheel is mostly via somewhat-reinterpreting Duncan Sabien’s presentation here, who credits Mark Rosewater for a lot of his understanding. My characterization won’t necessarily resonate with people who actually play Magic.
See here and here for a few of my attempts at more quantitative forecasts.