Short version: I don’t buy that humans are “micro-pseudokind” in your sense; if you say “for just $5 you could have all the fish have their preferences satisfied” I might do it, but not if I could instead spend $5 on having the fish have their preferences satisfied in a way that ultimately leads to them ascending and learning the meaning of friendship, as is entangled with the rest of my values.
Meta:
Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that’s likely to be a mistake even if it doesn’t lead to billions of deaths.
So for starters, thanks for making acknowledgements about places we apparently agree, or otherwise attempting to demonstrate that you’ve heard my point before bringing up other points you want to argue about. (I think this makes arguments go better.) (I’ll attempt some of that myself below.)
Secondly, note that it sounds to me like you took a diametric-opposite reading of some of my intended emotional content (which I acknowledge demonstrates flaws in my writing). For instance, I intended the sentence “At that very moment they hear the dinging sound of an egg-timer, as the next-token-predictor ascends to superintelligence and bursts out of its confines” to be a caricature so blatant as to underscore the point that I wasn’t making arguments about takeoff speeds, but was instead focusing on the point about “complexity” not being a saving grace (and “monomaniacalism” not being the issue here). (Alternatively, perhaps I misunderstand what things you call the “emotional content” and how you’re reading it.)
Thirdly, I note that for whatever it’s worth, when I go to new communities and argue this stuff, I don’t try to argue people into >95% change we’re all going to die in <20 years. I just try to present the arguments as I see them (without hiding the extremity of my own beliefs, nor while particularly expecting to get people to a similarly-extreme place with, say, a 30min talk). My 30min talk targets are usually something more like “>5% probability of existential catastrophe in <20y”. So insofar as you’re like “I’m aiming to get you to stop arguing so confidently for death given takeover”, you might already have met your aims in my case.
(Or perhaps not! Perhaps there’s plenty of emotional-content leaking through given the extremity of my own beliefs, that you find particularly detrimental. To which the solution is of course discussion on the object-level, which I’ll turn to momentarily.)
Object:
First, I acknowledge that if an AI cares enough to spend one trillionth of its resources on the satisfaction of fulfilling the preferences of existing “weak agents” in precisely the right way, then there’s a decent chance that current humans experience an enjoyable future.
With regards to your arguments about what you term “kindness” and I shall term “pseudokindness” (on account of thinking that “kindness” brings too much baggage), here’s a variety of places that it sounds like we might disagree:
Pseudokindness seems underdefined, to me, and I expect that many ways of defining it don’t lead to anything like good outcomes for existing humans.
Suppose the AI is like “I am pico-pseudokind; I will dedicate a trillionth of my resources to satisfying the preferences of existing weak agents by granting those existing weak agents their wishes”, and then only the most careful and conscientious humans manage to use those wishes in ways that leave them alive and well.
There are lots and lots of ways to “satisfy the preferences” of the “weak agents” that are humans. Getting precisely the CEV (or whatever it should be repaired into) is a subtle business. Most humans probably don’t yet recognize that they could or should prefer taking their CEV over various more haphazard preference-fulfilments that ultimately leave them unrecognizable and broken. (Or, consider what happens when a pseudokind AI encounters a baby, and seeks to satisfy its preferences. Does it have the baby age?)
You’ve got to do some philosophy to satisfy the preferences of humans correctly. And the issue isn’t that the AI couldn’t solve those philosophy problems correctly-according-to-us, it’s that once we see how wide the space of “possible ways to be pseudokind” is, then “pseudokind in the manner that gives us our CEVs” starts to feel pretty narrow against “pseudokind in the manner that fulfills our revealed preferences, or our stated preferences, or the poorly-considered preferences of philosophically-immature people, or whatever”.
I doubt that humans are micro-pseudokind, as defined. And so in particular, all your arguments of the form “but we’ve seen it arise once” seem suspect to me.
Like, suppose we met fledgeling aliens, and had the opportunity to either fulfil their desires, or leave them alone to mature, or affect their development by teaching them the meaning of friendship. My guess is that we’d teach them the meaning of friendship. I doubt we’d hop in and fulfil their desires.
(Perhaps you’d counter with something like: well if it was super cheap, we might make two copies of the alien civilization, and fulfil one’s desires and teach the other the meaning of friendship. I’m skeptical, for various reasons.)
More generally, even though “one (mill|trill)ionth” feels like a small fraction, the obvious ways to avoid dedicating even a (mill|trill)ionth of your resources to X is if X is right near something even better that you might as well spend the resources on instead.
There’s all sorts of ways to thumb the scales in how a weak agent develops, and there’s many degrees of freedom about what counts as a “pseudo-agent” or what counts as “doing justice to its preferences”, and my read is that humans take one particular contingent set of parameters here and AIs are likely to take another (and that the AI’s other-settings are likely to lead to behavior not-relevantly-distinct from killing everyone).
My read is than insofar as humans do have preferences about doing right by other weak agents, they have all sorts of desire-to-thumb-the-scales mixed in (such that humans are not actually pseudokind, for all that they might be kind).
I have a more-difficult-to-articulate sense that “maybe the AI ends up pseudokind in just the right way such that it gives us a (small, limited, ultimately-childless) glorious transhumanist future” is the sort of thing that reality gets to say “lol no” to, once you learn more details about how the thing works internally.
Most of my argument here is that “the space of ways things can end “caring” about the “preferences” of “weak agents” is wide, and most points within it don’t end up being our point in it, and optimizing towards most points in it doesn’t end up keeping us around at the extremes. My guess is mostly that the space is so wide that you don’t even end up with AIs warping existing humans into unrecognizable states, but do in fact just end up with the people dead (modulo distant aliens buying copies, etc).
I haven’t really tried to quantify how confident I am of this; I’m not sure whether I’d go above 90%, \shrug.
It occurs to me that one possible source of disagreement here is, perhaps you’re trying to say something like:
Nate, you shouldn’t go around saying “if we don’t competently intervene, literally everybody will die” with such a confident tone, when you in fact think there’s a decent chance of scenarios where the AIs keep people around in some form, and make some sort of effort towards fulfilling their desires; most people don’t care about the cosmic endowment like you do; the bluntly-honest and non-manipulative thing to say is that there’s a decent chance they’ll die and a better chance that humanity will lose the cosmic endowment (as you care about more than they do),
whereas my stance has been more like
most people I meet are skeptical that uploads count as them; most people would consider scenarios where their bodies are destroyed by rapid industrialization of Earth but a backup of their brain is stored and then later run in simulation (where perhaps it’s massaged into an unrecognizable form, or kept in an alien zoo, or granted a lovely future on account of distant benefactors, or …) to count as “death”; and also those exotic scenarios don’t seem all that likely to me, so it hasn’t seemed worth caveating.
I’m somewhat persuaded by the claim that failing to mention even the possibility of having your brainstate stored, and then run-and-warped by an AI or aliens or whatever later, or run in an alien zoo later, is potentially misleading.
I’m considering adding footnotes like “note that when I say “I expect everyone to die”, I don’t necessarily mean “without ever some simulation of that human being run again”, although I mostly don’t think this is a particularly comforting caveat”, in the relevant places. I’m curious to what degree that would satisfy your aims (and I welcome workshopped wording on the footnotes, as might both help me make better footnotes and help me understand better where you’re coming from).
I disagree with this but am happy your position is laid out. I’ll just try to give my overall understanding and reply to two points.
Like Oliver, it seems like you are implying:
Humans may be nice to other creatures in some sense, But if the fish were to look at the future that we’d achieve for them using the 1/billionth of resources we spent on helping them, it would be as objectionable to them as “murder everyone” is to us.
I think that normal people being pseudokind in a common-sensical way would instead say:
If we are trying to help some creatures, but those creatures really dislike the proposed way we are “helping” them, then we should try a different tactic for helping them.
I think that some utilitarians (without reflection) plausibly would “help the humans” in a way that most humans consider as bad as being murdered. But I think this is an unusual feature of utilitarians, and most people would consult the beneficiaries, observe they don’t want to be murdered, and so not murder them.
I think that saying “Helping someone in a way they like, sufficiently precisely to avoid things like murdering them, requires precisely the right form of caring—and that’s super rare” is a really misleading sense of how values work and what targets are narrow. I think this is more obvious if you are talking about how humans would treat a weaker species. If that’s the state of the disagreement I’m happy to leave it there.
I’m somewhat persuaded by the claim that failing to mention even the possibility of having your brainstate stored, and then run-and-warped by an AI or aliens or whatever later, or run in an alien zoo later, is potentially misleading.
This is an important distinction at 1/trillion levels of kindness, but at 1/billion levels of kindness I don’t even think the humans have to die.
If we are trying to help some creatures, but those creatures really dislike the proposed way we are “helping” them, then we should do something else.
My picture is less like “the creatures really dislike the proposed help”, and more like “the creatures don’t have terribly consistent preferences, and endorse each step of the chain, and wind up somewhere that they wouldn’t have endorsed if you first extrapolated their volition (but nobody’s extrapolating their volition or checking against that)”.
It sounds to me like your stance is something like “there’s a decent chance that most practically-buildable minds pico-care about correctly extrapolating the volition of various weak agents and fulfilling that extrapolated volition”, which I am much more skeptical of than the weaker “most practically-buildable minds pico-care about satisfying the preferences of weak agents in some sense”.
We’re not talking about practically building minds right now, we are talking about humans.
We’re not talking about “extrapolating volition” in general. We are talking about whether—in attempting to help a creature with preferences about as coherent as human preferences—you end up implementing an outcome that creature considers as bad as death.
For example, we are talking about what would happen if humans were trying to be kind to a weaker species that they had no reason to kill, that could nevertheless communicate clearly and had preferences about as coherent as human preferences (while being very alien).
And those creatures are having a conversation amongst themselves before the humans arrive wondering “Are the humans going to murder us all?” And one of them is saying “I don’t know, they don’t actually benefit from murdering us and they seem to care a tiny bit about being nice, maybe they’ll just let us do our thing with 1/trillionth of the universe’s resources?” while another is saying “They will definitely have strong opinions about what our society should look like and the kind of transformation they implement is about as bad by our lights as being murdered.”
In practice attempts to respect someone’s preferences often involve ideas like autonomy and self-determination and respect for their local preferences. I really don’t think you have to go all the way to extrapolated volition in order to avoid killing everyone.
Humans wound up caring at least a little about satisfying the preferences of other creatures, not in a “grant their local wishes even if that ruins them” sort of way but in some other intuitively-reasonable manner.
Humans are the only minds we’ve seen so far, and so having seen this once, maybe we start with a 50%-or-so chance that it will happen again.
You can then maybe drive this down a fair bit by arguing about how the content looks contingent on the particulars of how humans developed or whatever, and maybe that can drive you down to 10%, but it shouldn’t be able to drive you down to 0.1%, especially not if we’re talking only about incredibly weak preferences.
If so, one guess is that a bunch of disagreement lurks in this “intuitively-reasonable manner” business.
A possible locus of disagreemet: it looks to me like, if you give humans power before you give them wisdom, it’s pretty easy to wreck them while simply fulfilling their preferences. (Ex: lots of teens have dumbass philosophies, and might be dumb enough to permanently commit to them if given that power.)
More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfil certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.
(I separately expect that if we were doing something more like the volition-extrapolation thing, we’d be tempted to bend the process towards “and they learn the meaning of friendship”.)
That said, this conversation is updating me somewhat towards “a random UFAI would keep existing humans around and warp them in some direction it prefers, rather than killing them”, on the grounds that the argument “maybe preferences-about-existing-agents is just a common way for rando drives to shake out” plausibly supports it to a threshold of at least 1 in 1000. I’m not sure where I’ll end up on that front.
Another attempt at naming a crux: It looks to me like you see this human-style caring about others’ preferences as particularly “simple” or “natural”, in a way that undermines “drawing a target around the bullseye”-type arguments, whereas I could see that argument working for “grant all their wishes (within a budget)” but am much more skeptical when it comes to “do right by them in an intuitively-reasonable way”.
(But that still leaves room for an update towards “the AI doesn’t necessarily kill us, it might merely warp us, or otherwise wreck civilization by bounding us and then giving us power-before-wisdom within those bounds or or suchlike, as might be the sort of whims that rando drives shake out into”, which I’ll chew on.)
More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfill certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.
Isn’t the worst case scenario just leaving the aliens alone? If I’m worried I’m going to fuck up some alien’s preferences, I’m just not going to give them any power or wisdom!
I guess you think we’re likely to fuck up the alien’s preferences by light of their reflection process, but not our reflection process. But this just recurs to the meta level. If I really do care about an alien’s preferences (as it feels like I do), why can’t I also care about their reflection process (which is just a meta preference)?
I feel like the meta level at which I no longer care about doing right by an alien is basically the meta level at which I stop caring about someone doing right by me. In fact, this is exactly how it seems mentally constructed: what I mean by “doing right by [person]” is “what that person would mean by ‘doing right by me’”. This seems like either something as simple as it naively looks, or sensitive to weird hyperparameters I’m not sure I care about anyway.
(But that still leaves room for an update towards “the AI doesn’t necessarily kill us, it might merely warp us, or otherwise wreck civilization by bounding us and then giving us power-before-wisdom within those bounds or or suchlike, as might be the sort of whims that rando drives shake out into”, which I’ll chew on.)
FWIW this is my view. (Assuming no ECL/MSR or acausal trade or other such stuff. If we add those things in, the situation gets somewhat better in expectation I think, because there’ll be trades with faraway places that DO care about our CEV.)
My reading of the argument was something like “bullseye-target arguments refute an artificially privileged target being rated significantly likely under ignorance, e.g. the probability that random aliens will eat ice cream is not 50%. But something like kindness-in-the-relevant-sense is the universal problem faced by all evolved species creating AGI, and is thus not so artificially privileged, and as a yes-no question about which we are ignorant the uniform prior assigns 50%”. It was more about the hypothesis not being artificially privileged by path-dependent concerns than the notion being particularly simple, per se.
I sometimes mention the possibility of being stored and sold to aliens a billion years later, which seems to me to validly incorporate most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don’t expect Earthlings to think about validly.
My guess is mostly that the space is so wide that you don’t even end up with AIs warping existing humans into unrecognizable states, but do in fact just end up with the people dead
Why? I see a lot of opportunities for s-risk or just generally suboptimal future in such options, but “we don’t want to die, or at any rate we don’t want to die out as a species” seems like an extremely simple, deeply-ingrained goal that almost any metric by which the AI judges our desires should be expected to pick up, assuming it’s at all pseudokind. (In many cases, humans do a lot to protect endangered species even as we do diddly-squat to fulfill individual specimens’ preferences!)
Short version: I don’t buy that humans are “micro-pseudokind” in your sense; if you say “for just $5 you could have all the fish have their preferences satisfied” I might do it, but not if I could instead spend $5 on having the fish have their preferences satisfied in a way that ultimately leads to them ascending and learning the meaning of friendship, as is entangled with the rest of my values.
Meta:
So for starters, thanks for making acknowledgements about places we apparently agree, or otherwise attempting to demonstrate that you’ve heard my point before bringing up other points you want to argue about. (I think this makes arguments go better.) (I’ll attempt some of that myself below.)
Secondly, note that it sounds to me like you took a diametric-opposite reading of some of my intended emotional content (which I acknowledge demonstrates flaws in my writing). For instance, I intended the sentence “At that very moment they hear the dinging sound of an egg-timer, as the next-token-predictor ascends to superintelligence and bursts out of its confines” to be a caricature so blatant as to underscore the point that I wasn’t making arguments about takeoff speeds, but was instead focusing on the point about “complexity” not being a saving grace (and “monomaniacalism” not being the issue here). (Alternatively, perhaps I misunderstand what things you call the “emotional content” and how you’re reading it.)
Thirdly, I note that for whatever it’s worth, when I go to new communities and argue this stuff, I don’t try to argue people into >95% change we’re all going to die in <20 years. I just try to present the arguments as I see them (without hiding the extremity of my own beliefs, nor while particularly expecting to get people to a similarly-extreme place with, say, a 30min talk). My 30min talk targets are usually something more like “>5% probability of existential catastrophe in <20y”. So insofar as you’re like “I’m aiming to get you to stop arguing so confidently for death given takeover”, you might already have met your aims in my case.
(Or perhaps not! Perhaps there’s plenty of emotional-content leaking through given the extremity of my own beliefs, that you find particularly detrimental. To which the solution is of course discussion on the object-level, which I’ll turn to momentarily.)
Object:
First, I acknowledge that if an AI cares enough to spend one trillionth of its resources on the satisfaction of fulfilling the preferences of existing “weak agents” in precisely the right way, then there’s a decent chance that current humans experience an enjoyable future.
With regards to your arguments about what you term “kindness” and I shall term “pseudokindness” (on account of thinking that “kindness” brings too much baggage), here’s a variety of places that it sounds like we might disagree:
Pseudokindness seems underdefined, to me, and I expect that many ways of defining it don’t lead to anything like good outcomes for existing humans.
Suppose the AI is like “I am pico-pseudokind; I will dedicate a trillionth of my resources to satisfying the preferences of existing weak agents by granting those existing weak agents their wishes”, and then only the most careful and conscientious humans manage to use those wishes in ways that leave them alive and well.
There are lots and lots of ways to “satisfy the preferences” of the “weak agents” that are humans. Getting precisely the CEV (or whatever it should be repaired into) is a subtle business. Most humans probably don’t yet recognize that they could or should prefer taking their CEV over various more haphazard preference-fulfilments that ultimately leave them unrecognizable and broken. (Or, consider what happens when a pseudokind AI encounters a baby, and seeks to satisfy its preferences. Does it have the baby age?)
You’ve got to do some philosophy to satisfy the preferences of humans correctly. And the issue isn’t that the AI couldn’t solve those philosophy problems correctly-according-to-us, it’s that once we see how wide the space of “possible ways to be pseudokind” is, then “pseudokind in the manner that gives us our CEVs” starts to feel pretty narrow against “pseudokind in the manner that fulfills our revealed preferences, or our stated preferences, or the poorly-considered preferences of philosophically-immature people, or whatever”.
I doubt that humans are micro-pseudokind, as defined. And so in particular, all your arguments of the form “but we’ve seen it arise once” seem suspect to me.
Like, suppose we met fledgeling aliens, and had the opportunity to either fulfil their desires, or leave them alone to mature, or affect their development by teaching them the meaning of friendship. My guess is that we’d teach them the meaning of friendship. I doubt we’d hop in and fulfil their desires.
(Perhaps you’d counter with something like: well if it was super cheap, we might make two copies of the alien civilization, and fulfil one’s desires and teach the other the meaning of friendship. I’m skeptical, for various reasons.)
More generally, even though “one (mill|trill)ionth” feels like a small fraction, the obvious ways to avoid dedicating even a (mill|trill)ionth of your resources to X is if X is right near something even better that you might as well spend the resources on instead.
There’s all sorts of ways to thumb the scales in how a weak agent develops, and there’s many degrees of freedom about what counts as a “pseudo-agent” or what counts as “doing justice to its preferences”, and my read is that humans take one particular contingent set of parameters here and AIs are likely to take another (and that the AI’s other-settings are likely to lead to behavior not-relevantly-distinct from killing everyone).
My read is than insofar as humans do have preferences about doing right by other weak agents, they have all sorts of desire-to-thumb-the-scales mixed in (such that humans are not actually pseudokind, for all that they might be kind).
I have a more-difficult-to-articulate sense that “maybe the AI ends up pseudokind in just the right way such that it gives us a (small, limited, ultimately-childless) glorious transhumanist future” is the sort of thing that reality gets to say “lol no” to, once you learn more details about how the thing works internally.
Most of my argument here is that “the space of ways things can end “caring” about the “preferences” of “weak agents” is wide, and most points within it don’t end up being our point in it, and optimizing towards most points in it doesn’t end up keeping us around at the extremes. My guess is mostly that the space is so wide that you don’t even end up with AIs warping existing humans into unrecognizable states, but do in fact just end up with the people dead (modulo distant aliens buying copies, etc).
I haven’t really tried to quantify how confident I am of this; I’m not sure whether I’d go above 90%, \shrug.
It occurs to me that one possible source of disagreement here is, perhaps you’re trying to say something like:
whereas my stance has been more like
I’m somewhat persuaded by the claim that failing to mention even the possibility of having your brainstate stored, and then run-and-warped by an AI or aliens or whatever later, or run in an alien zoo later, is potentially misleading.
I’m considering adding footnotes like “note that when I say “I expect everyone to die”, I don’t necessarily mean “without ever some simulation of that human being run again”, although I mostly don’t think this is a particularly comforting caveat”, in the relevant places. I’m curious to what degree that would satisfy your aims (and I welcome workshopped wording on the footnotes, as might both help me make better footnotes and help me understand better where you’re coming from).
I disagree with this but am happy your position is laid out. I’ll just try to give my overall understanding and reply to two points.
Like Oliver, it seems like you are implying:
I think that normal people being pseudokind in a common-sensical way would instead say:
I think that some utilitarians (without reflection) plausibly would “help the humans” in a way that most humans consider as bad as being murdered. But I think this is an unusual feature of utilitarians, and most people would consult the beneficiaries, observe they don’t want to be murdered, and so not murder them.
I think that saying “Helping someone in a way they like, sufficiently precisely to avoid things like murdering them, requires precisely the right form of caring—and that’s super rare” is a really misleading sense of how values work and what targets are narrow. I think this is more obvious if you are talking about how humans would treat a weaker species. If that’s the state of the disagreement I’m happy to leave it there.
This is an important distinction at 1/trillion levels of kindness, but at 1/billion levels of kindness I don’t even think the humans have to die.
My picture is less like “the creatures really dislike the proposed help”, and more like “the creatures don’t have terribly consistent preferences, and endorse each step of the chain, and wind up somewhere that they wouldn’t have endorsed if you first extrapolated their volition (but nobody’s extrapolating their volition or checking against that)”.
It sounds to me like your stance is something like “there’s a decent chance that most practically-buildable minds pico-care about correctly extrapolating the volition of various weak agents and fulfilling that extrapolated volition”, which I am much more skeptical of than the weaker “most practically-buildable minds pico-care about satisfying the preferences of weak agents in some sense”.
We’re not talking about practically building minds right now, we are talking about humans.
We’re not talking about “extrapolating volition” in general. We are talking about whether—in attempting to help a creature with preferences about as coherent as human preferences—you end up implementing an outcome that creature considers as bad as death.
For example, we are talking about what would happen if humans were trying to be kind to a weaker species that they had no reason to kill, that could nevertheless communicate clearly and had preferences about as coherent as human preferences (while being very alien).
And those creatures are having a conversation amongst themselves before the humans arrive wondering “Are the humans going to murder us all?” And one of them is saying “I don’t know, they don’t actually benefit from murdering us and they seem to care a tiny bit about being nice, maybe they’ll just let us do our thing with 1/trillionth of the universe’s resources?” while another is saying “They will definitely have strong opinions about what our society should look like and the kind of transformation they implement is about as bad by our lights as being murdered.”
In practice attempts to respect someone’s preferences often involve ideas like autonomy and self-determination and respect for their local preferences. I really don’t think you have to go all the way to extrapolated volition in order to avoid killing everyone.
Is this a reasonable paraphrase of your argument?
If so, one guess is that a bunch of disagreement lurks in this “intuitively-reasonable manner” business.
A possible locus of disagreemet: it looks to me like, if you give humans power before you give them wisdom, it’s pretty easy to wreck them while simply fulfilling their preferences. (Ex: lots of teens have dumbass philosophies, and might be dumb enough to permanently commit to them if given that power.)
More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfil certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.
(I separately expect that if we were doing something more like the volition-extrapolation thing, we’d be tempted to bend the process towards “and they learn the meaning of friendship”.)
That said, this conversation is updating me somewhat towards “a random UFAI would keep existing humans around and warp them in some direction it prefers, rather than killing them”, on the grounds that the argument “maybe preferences-about-existing-agents is just a common way for rando drives to shake out” plausibly supports it to a threshold of at least 1 in 1000. I’m not sure where I’ll end up on that front.
Another attempt at naming a crux: It looks to me like you see this human-style caring about others’ preferences as particularly “simple” or “natural”, in a way that undermines “drawing a target around the bullseye”-type arguments, whereas I could see that argument working for “grant all their wishes (within a budget)” but am much more skeptical when it comes to “do right by them in an intuitively-reasonable way”.
(But that still leaves room for an update towards “the AI doesn’t necessarily kill us, it might merely warp us, or otherwise wreck civilization by bounding us and then giving us power-before-wisdom within those bounds or or suchlike, as might be the sort of whims that rando drives shake out into”, which I’ll chew on.)
Isn’t the worst case scenario just leaving the aliens alone? If I’m worried I’m going to fuck up some alien’s preferences, I’m just not going to give them any power or wisdom!
I guess you think we’re likely to fuck up the alien’s preferences by light of their reflection process, but not our reflection process. But this just recurs to the meta level. If I really do care about an alien’s preferences (as it feels like I do), why can’t I also care about their reflection process (which is just a meta preference)?
I feel like the meta level at which I no longer care about doing right by an alien is basically the meta level at which I stop caring about someone doing right by me. In fact, this is exactly how it seems mentally constructed: what I mean by “doing right by [person]” is “what that person would mean by ‘doing right by me’”. This seems like either something as simple as it naively looks, or sensitive to weird hyperparameters I’m not sure I care about anyway.
FWIW this is my view. (Assuming no ECL/MSR or acausal trade or other such stuff. If we add those things in, the situation gets somewhat better in expectation I think, because there’ll be trades with faraway places that DO care about our CEV.)
My reading of the argument was something like “bullseye-target arguments refute an artificially privileged target being rated significantly likely under ignorance, e.g. the probability that random aliens will eat ice cream is not 50%. But something like kindness-in-the-relevant-sense is the universal problem faced by all evolved species creating AGI, and is thus not so artificially privileged, and as a yes-no question about which we are ignorant the uniform prior assigns 50%”. It was more about the hypothesis not being artificially privileged by path-dependent concerns than the notion being particularly simple, per se.
I sometimes mention the possibility of being stored and sold to aliens a billion years later, which seems to me to validly incorporate most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don’t expect Earthlings to think about validly.
Why? I see a lot of opportunities for s-risk or just generally suboptimal future in such options, but “we don’t want to die, or at any rate we don’t want to die out as a species” seems like an extremely simple, deeply-ingrained goal that almost any metric by which the AI judges our desires should be expected to pick up, assuming it’s at all pseudokind. (In many cases, humans do a lot to protect endangered species even as we do diddly-squat to fulfill individual specimens’ preferences!)