Nostalgebraist’s post is casual, trying to reify and respond to a “doomer” vibe, rather than responding to specific arguments by specific people. Now, I happen to self-identify as a “doomer” sometimes. (Is calling myself a “doomer” bad epistemics and bad PR? Eh, I guess. But also: it sounds cool.) But I too have plenty of disagreements with others in the “doomer” camp (cf: “Rationalist (n.) Someone who disagrees with Eliezer Yudkowsky”.). Maybe nostalgebraist and I have common ground? I dunno. Be that as it may, here are some responses to certain points he brings up.
1. The “notkilleveryoneism” pitch is not about longtermism, and that’s fine
Nostalgebraist is mostly focusing on longtermist considerations, and I’ll mostly do that too here. But on our way there, in the lead-in, nostalgebraist does pause to make a point about the term “notkilleveryoneism”:
They call their position “notkilleveryoneism,” to distinguish that position from other worries about AI which don’t touch on the we’re-all-gonna-die thing. And who on earth would want to be a not-notkilleveryoneist?
But they do not mean, by these regular-Joe words, the things that a regular Joe would mean by them.
We are, in fact, all going to die. Probably, eventually. AI or no AI.
In a hundred years, if not fifty. By old age, if nothing else. You know what I mean.…
OK, my understanding was:
(1) we doomers are unhappy about the possibility of AI killing all humans because we’re concerned that the resulting long-term AI future would be a future we don’t want; and
(2) we doomers are also unhappy about the possibility of AI killing all humans because we are human and we don’t want to get murdered by AIs. And also, some of us have children with dreams of growing up and having kids of their own and being a famous inventor or oh wait actually I’d rather work for Nintendo on their Zelda team or hmm wait does Nintendo hire famous inventors? …And all these lovely aspirations again would require not getting murdered by AIs.
If we think of the “notkilleveryoneism” term as part of a communication and outreach strategy, then it’s a strategy that appeals to Average Joe’s desire to not be murdered by AIs, and not to Average Joe’s desires about the long-term future.
And that’s fine! Average Joe has every right to not be murdered, and honestly it’s a safe bet that Average Joe doesn’t have carefully-considered coherent opinions about the long-term future anyway.
Sometimes there’s more than one reason to want a problem to be solved, and you can lead with the more intuitive one. I don’t think anyone is being disingenuous here (although see comment).
1.1 …But now let’s get back to the longtermist stuff
Anyway, that was kinda a digression from the longtermist stuff which forms the main subject of nostalgebraist’s post.
Suppose AI takes over, wipes out humanity, and colonizes the galaxy in a posthuman future. He and I agree that it’s at least conceivable that this long-term posthuman future would be a bad future, e.g. if the AI was a paperclip maximizer. And he and I agree that it’s also possible that it would be a good future, e.g. if there is a future full of life and love and beauty and adventure throughout the cosmos. Which will it be? Let’s dive into that discussion.
2. Cooperation does not require kindness
Here’s nostalgebraist:
I can perhaps imagine a world of artificial X-maximizers, each a superhuman genius, each with its own inane and simple goal.
What I really cannot imagine is a world in which these beings, for all their intelligence, cannot notice that ruthlessly undercutting one another at every turn is a suboptimal equilibrium, and that there is a better way.
Leaving aside sociopaths (more on which below), humans have (I claim) “innate drives”, some of which lead to them having feelings associated with friendship, compassion, generosity, etc., as ends in themselves.
If you look at the human world, you might think that these innate drives are absolutely essential for cooperation and coordination. But they’re not!
For example, look at companies. Companies do not have innate drives towards cooperation that lead to them intrinsically caring about the profits of other companies, as an end in themselves. Rather, company leaders systematically make decisions that maximize their own company’s success.[1] And yet, companies cooperate anyway, all the time! How? Well, maybe they draw out detailed contracts, and maybe they use collateral or escrow, and maybe they check each other’s audited account books, and maybe they ask around to see whether this company has a track record of partnering in good faith, and so on. There are selfish profit-maximizing reasons to be honest, to be cooperative, to negotiate in good faith, to bend over backwards, and so on.
So, cooperation and coordination is entirely possible and routine in the absence of true intrinsic altruism, i.e. in the absence of any intrinsic feeling that generosity is an end in itself.
I concede that true intrinsic altruism has some benefits that can’t be perfectly replaced by complex contracts and enforcement mechanisms. If nothing else, you have to lawyer up every time anything changes. Theoretically, if partnering companies could mutually agree to intrinsically care about each others’ profits (to a precisely-calibrated extent), then that would be a Pareto-improvement over the status quo. But I have two responses:
First, game theory is a bitch sometimes. Just because beings find themselves to be in a suboptimal equilibrium, doesn’t necessarily mean that this equilibrium won’t happen anyway. Maybe the so-called “suboptimal equilibrium” is in fact the only stable equilibrium.
Second, the above is probably moot, because it seems very likely to me that sufficiently advanced competing AIs would be able to cooperate quite well indeed by not-truly-altruistic contractual mechanisms. And maybe they could cooperate even better by doing something like “merging” (e.g. jointly designing a successor AI that they’re both happy to hand over their resources to). None of this would involve any intrinsic feelings of friendship and compassion anywhere in sight.
So again, beings that experience feelings of friendship, compassion, etc., as ends in themselves are not necessary for cooperative behavior to happen, and in any case, to the extent that those feelings help facilitate cooperative behavior, that doesn’t prove that they’ll be part of the future.
(Incidentally, to respond to another point raised by nostalgebraist, just as AIs without innate friendship emotions could nevertheless cooperate for strategic instrumental reasons, it is equally true that AIs without innate “curiosity” and “changeability” could nevertheless do explore-exploit behavior for strategic instrumental reasons. See e.g. discussion here.)
3. “Wanting some kind of feeling of friendship, compassion, or connection to exist at all in the distant future” seems (1) important, (2) not the “conditioners” thing, (3) not inevitable
I mentioned “leaving aside sociopaths” above. But hey, what about high-functioning sociopaths? They are evidently able to do extremely impressive things, far beyond current AI technology, for better or worse (usually worse).
Like, SBF was by all accounts a really sharp guy, and moreover he accomplished one of the great frauds of our era. I mean, I think of myself as a pretty smart guy who can get things done, but man, I would never be able to commit fraud at 1% as ambitious a scale as what SBF pulled off! By the same token, I’ve only known two sociopaths well in my life, and one of them skyrocketed through the ranks of his field—he’s currently the head of research at a major R1 university, with occasional spells as a government appointee in positions of immense power.
Granted, sociopaths have typical areas of incompetence too, like unusually-strong aversion to doing very tedious things that would advance their long-term plans. But I really think there isn’t any deep tie between those traits and their lack of guilt or compassion. Instead I think it’s an incidental correlation—I think they’re two different effects of the same neuroscientific root cause. I can’t prove that but you can read my opinions here.
So we have something close to an existence proof for the claim “it’s possible to have highly-intelligent and highly-competent agents that don’t have any kind of feeling of friendship, compassion, or connection as an innate drive”. It’s not only logically possible, but indeed something close to that actually happens regularly in the real world.
So here’s something I believe:
Claim: A long-term posthuman future of AIs that don’t have anything like feelings of friendship, compassion, or connection—making those things intrinsically desirable for their own sake, independent of their instrumental usefulness for facilitating coordination—would be a bad future that we should strive to avoid.[2]
This is a moral claim, so I can’t prove it. (See §5 below!) But it’s something I feel strongly about.
By making this claim, am I inappropriately micromanaging the future like CS Lewis’s “Conditioners”, or like nostalgebraist’s imagined “teacher”? I don’t think so, right?
Am I abusing my power, violating the wishes of all previous generations? Again, I don’t think so. I think my ancestors all the way back to the Pleistocene would be on board with this claim too.
Am I asserting a triviality, because my wish will definitely come true? I don’t think so! Again, human sociopaths exist. In fact, for one possible architecture of future AGI algorithm (brain-like model-based RL), I strongly believe that the default is that this claim will not happen, in the absence of specific effort including solving currently-unsolved technical problems. Speaking of which…
4. “Strong orthogonality” (= the counting argument for scheming) isn’t (or at least, shouldn’t be) a strong generic argument for doom, but rather one optional part of a discussion that gets into the weeds
The thing that nostalgebraist calls “a weak form of the orthogonality thesis” is what should be properly called “The Orthogonality Thesis”;
The thing that nostalgebraist calls “the strong form of orthogonality” should be given a different name—Joe Carlsmith’s “the counting argument for scheming” seems like a solid choice here.
But for the purpose of this specific post, to make it easier for readers to follow the discussion, I will hold my nose and go along with nostalgebraist’s terrible decision to use the terms “weak orthogonality” vs “strong orthogonality”.
OK, so let’s talk about “strong orthogonality”. Here’s his description:
The strong form of orthogonality is rarely articulated precisely, but says something like: all possible values are equally likely to arise in systems selected solely for high intelligence.
It is presumed here that superhuman AIs will be formed through such a process of selection. And then, that they will have values sampled in this way, “at random.”
From some distribution, over some space, I guess.
You might wonder what this distribution could possibly look like, or this space. You might (for instance) wonder if pathologically simple goals, like paperclip maximization, would really be very likely under this distribution, whatever it is.
In case you were wondering, these things have never been formalized, or even laid out precisely-but-informally. This was not thought necessary, it seems, before concluding that the strong orthogonality thesis was true.
That is: no one knows exactly what it is that is being affirmed, here. In practice it seems to squish and deform agreeably to fit the needs of the argument, or the intuitions of the one making it.
I don’t know what exactly this is in response to. For what it’s worth, I am very opposed to the strong orthogonality thesis as thus described.
But here’s a claim that I believe:
Claim: If there’s a way to build AGI, and there’s nothing in particular about its source code or training process that would lead to an intrinsic tendency to kindness as a terminal goal, then we should strongly expect such an intrinsic tendency to not arise—not towards other AGIs, and not towards humans.
Such AGIs would cooperate with humans when it’s in their selfish interest to do so, and then stab humans in the back as soon as the situation changes.
If you disagree with that claim above, then you presumably believe either:
“it’s plausible for feelings of kindness and compassion towards humans and/or other AIs to arise purely by coincidence, for no reason at all”, or
“a sufficiently smart AI will simply reason its way to having feelings of kindness and compassion, and seeing them as ends-in-themselves rather than useful strategies, by a purely endogenous process”.
I think the former is just patently absurd. And if anyone believes the latter, I think they should re-read §3 above and also §5 below. (But nostalgebraist presumably agrees? He views “weak orthogonality” as obviously true, right?)
The thing about the above claim is, it says “IF there’s nothing in particular about its source code or training process that would lead to an intrinsic tendency to kindness as a terminal goal…”. And that’s a very big “if”! It’s quite possible that there will be something about the source code or training process that offers at least prima facie reason to think that kindness might arise non-coincidentally and non-endogenously.
…And now we’re deep in the weeds. What are those reasons to think that kindness is gonna get into the AI? Do the arguments stand up to scrutiny? To answer that, we need to be talking about how the AI algorithms will work. And what training data / training environments they’ll use. And how they’ll be tested, and whether the tests will actually work. And these things in turn depend partly on what future human AI programmers will do, which in turn depends on those programmers’ knowledge and beliefs and incentives and selection-process and so on.
So if anyone is talking about “strong orthogonality” as a generic argument that AI alignment is hard, with no further structure fleshing out the story, then I’m opposed to that! But I question how common this is—I think it’s a bit of a strawman. Yes, people invoke “strong orthogonality” (counting arguments) sometimes, but I think (and hope) that they have a more fleshed-out story in mind behind the scenes (e.g. see this comment thread).
My own area of professional interest is the threat model where future AGI is not based on LLMs as we know them today, but rather based on model-based RL more like human brains. In this case, I think there’s a strong argument that we don’t get kindness by default, and moreover that we don’t yet have any good technical plan that would yield robust feelings of kindness. This argument does NOT involve any “strong orthogonality” a.k.a. counting argument, except in the minimal sense of the “Claim” above.
5. Yes you can make Hume’s Law / moral antirealism sound silly, but that doesn’t make it wrong.
For my part, I’m very far into the moral antirealism camp, going quite a bit further than Eliezer—you can read some of my alignment-discourse-relevant hot-takes here. (See also: a nice concise argument for Hume’s law / “weak orthogonality” by Eliezer here.) I’m a bit confused by nostalgebraist’s position, in that he considers (what he calls) “weak orthogonality” to be obviously true, but the rest of the post seems to contradict that in very strong terms:
The “human” of the doomer picture seems to me like a man who mouths the old platitude, “if I had been born in another country, I’d be waving a different flag” – and then goes out to enlist in his country’s army, and goes off to war, and goes ardently into battle, willing to kill in the name of that same flag.
Who shoots down the enemy soldiers while thinking, “if I had been born there, it would have been all-important for their side to win, and so I would have shot at the men on this side. However, I was born in my country, not theirs, and so it is all-important that my country should win, and that theirs should lose.
There is no reason for this. It could have been the other way around, and everything would be left exactly the same, except for the ‘values.’
I cannot argue with the enemy, for there is no argument in my favor. I can only shoot them down.
There is no reason for this. It is the most important thing, and there is no reason for it.
The thing that is precious has no intrinsic appeal. It must be forced on the others, at gunpoint, if they do not already accept it.
I cannot hold out the jewel and say, ‘look, look how it gleams? Don’t you see the value!’ They will not see the value, because there is no value to be seen.
There is nothing essentially “good” there, only the quality of being-worthy-of-protection-at-all-costs. And even that is a derived attribute: my jewel is only a jewel, after all, because it has been put into the jewel-box, where the thing-that-is-a-jewel can be found. But anything at all could be placed there.
How I wish I were allowed to give it up! But alas, it is all-important. Alas, it is the only important thing in the world! And so, I lay down my life for it, for our jewel and our flag – for the things that are loathsome and pointless, and worth infinitely more than any life.”
The last paragraph seems wildly confused—why on Earth would I wish to give up the very things that I care about most? And I have some terminological quibbles in various places.[3] But anyway, by and large, yes. I am biting this bullet. The above excerpt is a description of moral antirealism, and you can spend all day making it sound silly, but like it or not, I claim that moral antirealism is a fact of life.
Fun fact: People sometimes try appealing to sociopaths, trying to convince them to show kindness and generosity towards others, because it’s the right thing to do. The result of these interventions is that they don’t work. Quite the contrary, the sociopaths typically come to better understand the psychology of non-sociopaths, and use that knowledge to better manipulate and hurt them. It’s like a sociopath “finishing school”.[4]
If I had been born a sadistic sociopath, I would value causing pain and suffering. But I wasn’t, so I value the absence of pain and suffering. Laugh all you want, but I was born on the side opposed to pain and suffering, and I’m proudly waving my flag. I will hold my flag tight to the ends of the Earth. I don’t want to kill sadistic sociopaths, but I sure as heck don’t want sadistic sociopaths to succeed at their goals. If any readers feel the same way, then come join me in battle. I have extra flags in my car.
There are exceptions. If there are two firms where the CEOs or key decision-makers are dear friends, then some corporate decisions might get made for not-purely-selfish reasons. Relatedly, I’m mostly talking about large USA businesses. The USA has a long history of fair contract enforcement, widely-trusted institutions, etc., that enables this kind of cooperation. Other countries don’t have that, and then the process of forming a joint business venture involves decision-makers on both sides sharing drinks, meeting each others’ families, and so on—see The Culture Map by Erin Meyer, chapter 6.
Note that I’m not asserting the converse; I think this is necessary but not sufficient for a good future. I’m just trying to make a narrow maximally-uncontroversial claim in an attempt to find common ground.
For example, nostalgebraist seems to be defining the word “good” to mean something like “intrinsically motivating (upon reflection) to all intelligent beings”. But under that definition, I claim, there would be nothing whatsoever in the whole universe that is “good”. So instead I personally would define the word “good” to mean “a cluster of things probably including friendship and beauty and justice and the elimination of suffering”. The fact that I listed those four examples, and not some other very different set of four examples, is indeed profoundly connected to the fact that I’m a human with a human brain talking to other humans with human brains. So it goes. Again my meta-ethical take is here.
I’m not 100% sure, but I believe I read this in the fun pop-science book The Psychopath Test. Incidentally, there do seem to be interventions that appeal to sociopaths’ own self-interest—particularly their selfish interest in not being in prison—to help turn really destructive sociopaths into the regular everyday kind of sociopaths who are still awful to the people around them but at least they’re not murdering anyone. (Source.)
Response to nostalgebraist: proudly waving my moral-antirealist battle flag
@nostalgebraist has recently posted yet another thought-provoking post, this one on how we should feel about AI ruling a long-term posthuman future. [Previous discussion of this same post on lesswrong.] His post touches on some of the themes of Joe Carlsmith’s “Otherness and Control in the Age of AI” series—a series which I enthusiastically recommend—but nostalgebraist takes those ideas much further, in a way that makes me want to push back.
Nostalgebraist’s post is casual, trying to reify and respond to a “doomer” vibe, rather than responding to specific arguments by specific people. Now, I happen to self-identify as a “doomer” sometimes. (Is calling myself a “doomer” bad epistemics and bad PR? Eh, I guess. But also: it sounds cool.) But I too have plenty of disagreements with others in the “doomer” camp (cf: “Rationalist (n.) Someone who disagrees with Eliezer Yudkowsky”.). Maybe nostalgebraist and I have common ground? I dunno. Be that as it may, here are some responses to certain points he brings up.
1. The “notkilleveryoneism” pitch is not about longtermism, and that’s fine
Nostalgebraist is mostly focusing on longtermist considerations, and I’ll mostly do that too here. But on our way there, in the lead-in, nostalgebraist does pause to make a point about the term “notkilleveryoneism”:
OK, my understanding was:
(1) we doomers are unhappy about the possibility of AI killing all humans because we’re concerned that the resulting long-term AI future would be a future we don’t want; and
(2) we doomers are also unhappy about the possibility of AI killing all humans because we are human and we don’t want to get murdered by AIs. And also, some of us have children with dreams of growing up and having kids of their own and being a famous inventor or oh wait actually I’d rather work for Nintendo on their Zelda team or hmm wait does Nintendo hire famous inventors? …And all these lovely aspirations again would require not getting murdered by AIs.
If we think of the “notkilleveryoneism” term as part of a communication and outreach strategy, then it’s a strategy that appeals to Average Joe’s desire to not be murdered by AIs, and not to Average Joe’s desires about the long-term future.
And that’s fine! Average Joe has every right to not be murdered, and honestly it’s a safe bet that Average Joe doesn’t have carefully-considered coherent opinions about the long-term future anyway.
Sometimes there’s more than one reason to want a problem to be solved, and you can lead with the more intuitive one. I don’t think anyone is being disingenuous here (although see comment).
1.1 …But now let’s get back to the longtermist stuff
Anyway, that was kinda a digression from the longtermist stuff which forms the main subject of nostalgebraist’s post.
Suppose AI takes over, wipes out humanity, and colonizes the galaxy in a posthuman future. He and I agree that it’s at least conceivable that this long-term posthuman future would be a bad future, e.g. if the AI was a paperclip maximizer. And he and I agree that it’s also possible that it would be a good future, e.g. if there is a future full of life and love and beauty and adventure throughout the cosmos. Which will it be? Let’s dive into that discussion.
2. Cooperation does not require kindness
Here’s nostalgebraist:
Leaving aside sociopaths (more on which below), humans have (I claim) “innate drives”, some of which lead to them having feelings associated with friendship, compassion, generosity, etc., as ends in themselves.
If you look at the human world, you might think that these innate drives are absolutely essential for cooperation and coordination. But they’re not!
For example, look at companies. Companies do not have innate drives towards cooperation that lead to them intrinsically caring about the profits of other companies, as an end in themselves. Rather, company leaders systematically make decisions that maximize their own company’s success.[1] And yet, companies cooperate anyway, all the time! How? Well, maybe they draw out detailed contracts, and maybe they use collateral or escrow, and maybe they check each other’s audited account books, and maybe they ask around to see whether this company has a track record of partnering in good faith, and so on. There are selfish profit-maximizing reasons to be honest, to be cooperative, to negotiate in good faith, to bend over backwards, and so on.
So, cooperation and coordination is entirely possible and routine in the absence of true intrinsic altruism, i.e. in the absence of any intrinsic feeling that generosity is an end in itself.
I concede that true intrinsic altruism has some benefits that can’t be perfectly replaced by complex contracts and enforcement mechanisms. If nothing else, you have to lawyer up every time anything changes. Theoretically, if partnering companies could mutually agree to intrinsically care about each others’ profits (to a precisely-calibrated extent), then that would be a Pareto-improvement over the status quo. But I have two responses:
First, game theory is a bitch sometimes. Just because beings find themselves to be in a suboptimal equilibrium, doesn’t necessarily mean that this equilibrium won’t happen anyway. Maybe the so-called “suboptimal equilibrium” is in fact the only stable equilibrium.
Second, the above is probably moot, because it seems very likely to me that sufficiently advanced competing AIs would be able to cooperate quite well indeed by not-truly-altruistic contractual mechanisms. And maybe they could cooperate even better by doing something like “merging” (e.g. jointly designing a successor AI that they’re both happy to hand over their resources to). None of this would involve any intrinsic feelings of friendship and compassion anywhere in sight.
So again, beings that experience feelings of friendship, compassion, etc., as ends in themselves are not necessary for cooperative behavior to happen, and in any case, to the extent that those feelings help facilitate cooperative behavior, that doesn’t prove that they’ll be part of the future.
(Incidentally, to respond to another point raised by nostalgebraist, just as AIs without innate friendship emotions could nevertheless cooperate for strategic instrumental reasons, it is equally true that AIs without innate “curiosity” and “changeability” could nevertheless do explore-exploit behavior for strategic instrumental reasons. See e.g. discussion here.)
3. “Wanting some kind of feeling of friendship, compassion, or connection to exist at all in the distant future” seems (1) important, (2) not the “conditioners” thing, (3) not inevitable
I mentioned “leaving aside sociopaths” above. But hey, what about high-functioning sociopaths? They are evidently able to do extremely impressive things, far beyond current AI technology, for better or worse (usually worse).
Like, SBF was by all accounts a really sharp guy, and moreover he accomplished one of the great frauds of our era. I mean, I think of myself as a pretty smart guy who can get things done, but man, I would never be able to commit fraud at 1% as ambitious a scale as what SBF pulled off! By the same token, I’ve only known two sociopaths well in my life, and one of them skyrocketed through the ranks of his field—he’s currently the head of research at a major R1 university, with occasional spells as a government appointee in positions of immense power.
Granted, sociopaths have typical areas of incompetence too, like unusually-strong aversion to doing very tedious things that would advance their long-term plans. But I really think there isn’t any deep tie between those traits and their lack of guilt or compassion. Instead I think it’s an incidental correlation—I think they’re two different effects of the same neuroscientific root cause. I can’t prove that but you can read my opinions here.
So we have something close to an existence proof for the claim “it’s possible to have highly-intelligent and highly-competent agents that don’t have any kind of feeling of friendship, compassion, or connection as an innate drive”. It’s not only logically possible, but indeed something close to that actually happens regularly in the real world.
So here’s something I believe:
Claim: A long-term posthuman future of AIs that don’t have anything like feelings of friendship, compassion, or connection—making those things intrinsically desirable for their own sake, independent of their instrumental usefulness for facilitating coordination—would be a bad future that we should strive to avoid.[2]
This is a moral claim, so I can’t prove it. (See §5 below!) But it’s something I feel strongly about.
By making this claim, am I inappropriately micromanaging the future like CS Lewis’s “Conditioners”, or like nostalgebraist’s imagined “teacher”? I don’t think so, right?
Am I abusing my power, violating the wishes of all previous generations? Again, I don’t think so. I think my ancestors all the way back to the Pleistocene would be on board with this claim too.
Am I asserting a triviality, because my wish will definitely come true? I don’t think so! Again, human sociopaths exist. In fact, for one possible architecture of future AGI algorithm (brain-like model-based RL), I strongly believe that the default is that this claim will not happen, in the absence of specific effort including solving currently-unsolved technical problems. Speaking of which…
4. “Strong orthogonality” (= the counting argument for scheming) isn’t (or at least, shouldn’t be) a strong generic argument for doom, but rather one optional part of a discussion that gets into the weeds
I continue to strongly believe that:
The thing that nostalgebraist calls “a weak form of the orthogonality thesis” is what should be properly called “The Orthogonality Thesis”;
The thing that nostalgebraist calls “the strong form of orthogonality” should be given a different name—Joe Carlsmith’s “the counting argument for scheming” seems like a solid choice here.
But for the purpose of this specific post, to make it easier for readers to follow the discussion, I will hold my nose and go along with nostalgebraist’s terrible decision to use the terms “weak orthogonality” vs “strong orthogonality”.
OK, so let’s talk about “strong orthogonality”. Here’s his description:
I don’t know what exactly this is in response to. For what it’s worth, I am very opposed to the strong orthogonality thesis as thus described.
But here’s a claim that I believe:
Claim: If there’s a way to build AGI, and there’s nothing in particular about its source code or training process that would lead to an intrinsic tendency to kindness as a terminal goal, then we should strongly expect such an intrinsic tendency to not arise—not towards other AGIs, and not towards humans.
Such AGIs would cooperate with humans when it’s in their selfish interest to do so, and then stab humans in the back as soon as the situation changes.
If you disagree with that claim above, then you presumably believe either:
“it’s plausible for feelings of kindness and compassion towards humans and/or other AIs to arise purely by coincidence, for no reason at all”, or
“a sufficiently smart AI will simply reason its way to having feelings of kindness and compassion, and seeing them as ends-in-themselves rather than useful strategies, by a purely endogenous process”.
I think the former is just patently absurd. And if anyone believes the latter, I think they should re-read §3 above and also §5 below. (But nostalgebraist presumably agrees? He views “weak orthogonality” as obviously true, right?)
The thing about the above claim is, it says “IF there’s nothing in particular about its source code or training process that would lead to an intrinsic tendency to kindness as a terminal goal…”. And that’s a very big “if”! It’s quite possible that there will be something about the source code or training process that offers at least prima facie reason to think that kindness might arise non-coincidentally and non-endogenously.
…And now we’re deep in the weeds. What are those reasons to think that kindness is gonna get into the AI? Do the arguments stand up to scrutiny? To answer that, we need to be talking about how the AI algorithms will work. And what training data / training environments they’ll use. And how they’ll be tested, and whether the tests will actually work. And these things in turn depend partly on what future human AI programmers will do, which in turn depends on those programmers’ knowledge and beliefs and incentives and selection-process and so on.
So if anyone is talking about “strong orthogonality” as a generic argument that AI alignment is hard, with no further structure fleshing out the story, then I’m opposed to that! But I question how common this is—I think it’s a bit of a strawman. Yes, people invoke “strong orthogonality” (counting arguments) sometimes, but I think (and hope) that they have a more fleshed-out story in mind behind the scenes (e.g. see this comment thread).
Also, I think it’s insufficiently appreciated that arch-doomers like Nate Soares get a lot of their confidence in doom by doom-being-disjunctive, rather than from the technical alignment challenge in isolation. (This is very true for me.)
My own area of professional interest is the threat model where future AGI is not based on LLMs as we know them today, but rather based on model-based RL more like human brains. In this case, I think there’s a strong argument that we don’t get kindness by default, and moreover that we don’t yet have any good technical plan that would yield robust feelings of kindness. This argument does NOT involve any “strong orthogonality” a.k.a. counting argument, except in the minimal sense of the “Claim” above.
5. Yes you can make Hume’s Law / moral antirealism sound silly, but that doesn’t make it wrong.
For my part, I’m very far into the moral antirealism camp, going quite a bit further than Eliezer—you can read some of my alignment-discourse-relevant hot-takes here. (See also: a nice concise argument for Hume’s law / “weak orthogonality” by Eliezer here.) I’m a bit confused by nostalgebraist’s position, in that he considers (what he calls) “weak orthogonality” to be obviously true, but the rest of the post seems to contradict that in very strong terms:
The last paragraph seems wildly confused—why on Earth would I wish to give up the very things that I care about most? And I have some terminological quibbles in various places.[3] But anyway, by and large, yes. I am biting this bullet. The above excerpt is a description of moral antirealism, and you can spend all day making it sound silly, but like it or not, I claim that moral antirealism is a fact of life.
Fun fact: People sometimes try appealing to sociopaths, trying to convince them to show kindness and generosity towards others, because it’s the right thing to do. The result of these interventions is that they don’t work. Quite the contrary, the sociopaths typically come to better understand the psychology of non-sociopaths, and use that knowledge to better manipulate and hurt them. It’s like a sociopath “finishing school”.[4]
If I had been born a sadistic sociopath, I would value causing pain and suffering. But I wasn’t, so I value the absence of pain and suffering. Laugh all you want, but I was born on the side opposed to pain and suffering, and I’m proudly waving my flag. I will hold my flag tight to the ends of the Earth. I don’t want to kill sadistic sociopaths, but I sure as heck don’t want sadistic sociopaths to succeed at their goals. If any readers feel the same way, then come join me in battle. I have extra flags in my car.
There are exceptions. If there are two firms where the CEOs or key decision-makers are dear friends, then some corporate decisions might get made for not-purely-selfish reasons. Relatedly, I’m mostly talking about large USA businesses. The USA has a long history of fair contract enforcement, widely-trusted institutions, etc., that enables this kind of cooperation. Other countries don’t have that, and then the process of forming a joint business venture involves decision-makers on both sides sharing drinks, meeting each others’ families, and so on—see The Culture Map by Erin Meyer, chapter 6.
Note that I’m not asserting the converse; I think this is necessary but not sufficient for a good future. I’m just trying to make a narrow maximally-uncontroversial claim in an attempt to find common ground.
For example, nostalgebraist seems to be defining the word “good” to mean something like “intrinsically motivating (upon reflection) to all intelligent beings”. But under that definition, I claim, there would be nothing whatsoever in the whole universe that is “good”. So instead I personally would define the word “good” to mean “a cluster of things probably including friendship and beauty and justice and the elimination of suffering”. The fact that I listed those four examples, and not some other very different set of four examples, is indeed profoundly connected to the fact that I’m a human with a human brain talking to other humans with human brains. So it goes. Again my meta-ethical take is here.
I’m not 100% sure, but I believe I read this in the fun pop-science book The Psychopath Test. Incidentally, there do seem to be interventions that appeal to sociopaths’ own self-interest—particularly their selfish interest in not being in prison—to help turn really destructive sociopaths into the regular everyday kind of sociopaths who are still awful to the people around them but at least they’re not murdering anyone. (Source.)