I think some of the confusion here comes from my using “kind” to refer to “respecting the preferences of existing weak agents,” I don’t have a better handle but could have just used a made up word.
Yeah, sorry, I noticed the same thing a few minutes ago, that I was probably at least somewhat misled by the more standard meaning of kindness.
Tabooing “kindness” I am saying something like:
Yes, I don’t think extrapolated current humans assign approximately any value to the exact preference of “respecting the preferences of existing weak agents” and I don’t really believe that you would on-reflection endorse that preference either.
Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions, like ‘agent’ being a meaningful concept in the first place, or ‘existing’ or ‘weak’ or ‘preferences’, all of which I expect I would think are probably terribly confused concepts to use after I had understood the real concepts that carve reality more at its joints, and this means this sentence sounds deceptively simple or robust, but really doesn’t feel like the kind of thing whose meaning will stay simple as an AI does more conceptual refinement.
I called this utilitarian optimization but it might have been more charitable to call it “impartial” optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It’s also “utilitarian” in the sense that it’s willing to spare nothing (or at least not 1/trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like “utilitarian” is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.
The reason why I objected to this characterization is that I was trying to point at a more general thing than the “impartialness”. Like, to paraphrase what this sentence sounds like to me, it’s more as if someone from a pre-modern era was arguing about future civilizations and said “It’s weird that your conception of future humans are willing to do nothing for the gods that live in the sky, and the spirits that make our plants grow”.
Like, after a bunch of ontological reflection and empirical data gathering, “gods” is just really not a good abstraction for things I care about anymore. I don’t think “impartiality” is what is causing me to not care about gods, it’s just that the concept of “gods” seems fake and doesn’t carve reality at its joints anymore. It’s also not the case that I don’t care at all about ancient gods anymore (they are pretty cool and I like the aesthetic), but they way I care about them is very different from how I care about other humans.
Not caring about gods doesn’t feel “harsh” or “utilitarian” or in some sense like I have decided to abandon any part of my values. This is what I expect it to feel like for a future human to look back at our meta-preferences for many types of other beings, and also what it feels like for AIs that maybe have some initial version of ‘caring about others’ when they are at similar capability levels to humans.
This again isn’t capturing my objection perfectly, but maybe helps point to it better.
Yes, I don’t think extrapolated current humans assign approximately any value to the exact preference of “respecting the preferences of existing weak agents” and I don’t really believe that you would on-reflection endorse that preference either.
I am quite confident that I do, and it tends to infuriate my friends who get cranky that I feel a moral obligation to respect the artistic intent of bacterial genomes: all bacteria should go vegan, yet survive, and eat food equivalent to their previous.
Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions,
I feel pretty uncertain of what assumptions are hiding in your “optimize strongly against X” statements. Historically this just seems hard to tease out, and wouldn’t be surprised if I were just totally misreading you here.
I think that a realistic “respecting preferences of weak agents”-shard doesn’t bid for plans which maximally activate the “respect preferences of weak agents” internal evaluation metric, or even do some tight bounded approximation thereof.
A “respect weak preferences” shard might also guide the AI’s value and ontology reformation process.
A nice person isn’t being maximally nice, nor do they wish to be; they are nicely being nice.
I do agree (insofar as I understand you enough to agree) that we should worry about some “strong optimization over the AI’s concepts, later in AI developmental timeline.” But I think different kinds of “heavy optimization” lead to different kinds of alignment concerns.
When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else).
Or at least that many/most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans ‘want’ to be preserved (at least according to a conventional notion of preferences).
I think this empirical view seems pretty implausible.
That said, I think it’s quite plausible that upon reflection, I’d want to ‘wink out’ any existing copies of myself in favor of using resources better things. But this is partially because I personally (in my current state) would endorse such a thing: if my extrapolated volition thought it would be better to not exist (in favor of other resource usage), my current self would accept that. And, I think it currently seems unlikely that upon reflection, I’d want to end all human lives (in particular, I think I probably would want to keep humans alive who had preferences against non-existence). This applies regardless of trade; it’s important to note this to avoid a ‘perpetual motion machine’ type argument.
Beyond this, I think that most or many humans or aliens would, upon reflection, want to preserve currently existing humans or aliens who had a preference against non-existence. (Again, regardless of trade.)
Additionally, I think it’s quite plausible that most or many humans or aliens will enact various trades or precommitments prior to reflecting (which is probably ill-advised, but it will happen regardless). So current preferences which aren’t stable under reflection might have a significant influence overall.
This feels like it is not really understanding my point, though maybe best to move this to some higher-bandwidth medium if the point is that hard to get across.
Giving it one last try: What I am saying is that I don’t think “conventional notion of preferences” is a particularly well-defined concept, and neither are a lot of other concepts you are using in order to make your predictions here. What it means to care about the preferences of others is a thing with a lot of really messy details that tend to blow up in different ways when you think harder about it and are less anchored on the status-quo.
I don’t think you currently know in what ways you would care about the preferences of others after a lot of reflection (barring game-theoretic considerations which I think we can figure out a bit more in-advance, but I am bracketing that whole angle in this discussion, though I totally agree those are important and relevant). I do think you will of course endorse the way you care about other people’s preferences after you’ve done a lot of reflection (otherwise something went wrong in your reflection process), but I don’t think you would endorse what AIs would do, and my guess is you also wouldn’t endorse what a lot of other humans would do when they undergo reflection here.
Like, what I am saying is that while there might be a relatively broad basin of conditions that give rise to something that locally looks like caring about other beings, the space of caring about other beings is deep and wide, and if you have an AI that cares about other beings preferences in some way you don’t endorse, this doesn’t actually get you anything. And I think the arguments that the concept of “caring about others” that an AI might have (though my best guess is that it won’t even have anything that is locally well-described by that) will hold up after a lot of reflection seem much weaker to me than the arguments that it will have that preference at roughly human capability and ethical-reflection levels (which seems plausible to me, though still overall unlikely).
Zeroth approximation of pseudokindness is strict nonintervention, reifying the patient-in-environment as a closed computation and letting it run indefinitely, with some allocation of compute. Interaction with the outside world creates vulnerability to external influence, but then again so does incautious closed computation, as we currently observe with AI x-risk, which is not something beamed in from outer space.
Formulation of the kinds of external influences that are appropriate for a particular patient-in-environment is exactly the topic of membranes/boundaries, this task can be taken as the defining desideratum for the topic. Specifically, the question of which environments can be put in contact with a particular membrane without corrupting it, hence why I think membranes are relevant to pseudokindness. Naturality of the membranes/boundaries abstraction is linked to naturality of the pseudokindness abstraction.
In contrast, the language of preferences/optimization seems to be the wrong frame for formulating pseudokindness, it wants to discuss ways of intervening and influencing, of not leaving value on the table, rather than ways of offering acceptable options that avoid manipulation. It might be possible to translate pseudokindness back into the language of preferences, but this translation would induce a kind of deontological prior on preferences that makes the more probable preferences look rather surprising/unnatural from a more preferences-first point of view.
Yeah, sorry, I noticed the same thing a few minutes ago, that I was probably at least somewhat misled by the more standard meaning of kindness.
Tabooing “kindness” I am saying something like:
Yes, I don’t think extrapolated current humans assign approximately any value to the exact preference of “respecting the preferences of existing weak agents” and I don’t really believe that you would on-reflection endorse that preference either.
Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions, like ‘agent’ being a meaningful concept in the first place, or ‘existing’ or ‘weak’ or ‘preferences’, all of which I expect I would think are probably terribly confused concepts to use after I had understood the real concepts that carve reality more at its joints, and this means this sentence sounds deceptively simple or robust, but really doesn’t feel like the kind of thing whose meaning will stay simple as an AI does more conceptual refinement.
The reason why I objected to this characterization is that I was trying to point at a more general thing than the “impartialness”. Like, to paraphrase what this sentence sounds like to me, it’s more as if someone from a pre-modern era was arguing about future civilizations and said “It’s weird that your conception of future humans are willing to do nothing for the gods that live in the sky, and the spirits that make our plants grow”.
Like, after a bunch of ontological reflection and empirical data gathering, “gods” is just really not a good abstraction for things I care about anymore. I don’t think “impartiality” is what is causing me to not care about gods, it’s just that the concept of “gods” seems fake and doesn’t carve reality at its joints anymore. It’s also not the case that I don’t care at all about ancient gods anymore (they are pretty cool and I like the aesthetic), but they way I care about them is very different from how I care about other humans.
Not caring about gods doesn’t feel “harsh” or “utilitarian” or in some sense like I have decided to abandon any part of my values. This is what I expect it to feel like for a future human to look back at our meta-preferences for many types of other beings, and also what it feels like for AIs that maybe have some initial version of ‘caring about others’ when they are at similar capability levels to humans.
This again isn’t capturing my objection perfectly, but maybe helps point to it better.
I am quite confident that I do, and it tends to infuriate my friends who get cranky that I feel a moral obligation to respect the artistic intent of bacterial genomes: all bacteria should go vegan, yet survive, and eat food equivalent to their previous.
I feel pretty uncertain of what assumptions are hiding in your “optimize strongly against X” statements. Historically this just seems hard to tease out, and wouldn’t be surprised if I were just totally misreading you here.
That said, your writing makes me wonder “where is the heavy optimization [over the value definitions] coming from?”, since I think the preference-shards themselves are the things steering the optimization power. For example, the shards are not optimizing over themselves to find adversarial examples to themselves. Related statements:
I think that a realistic “respecting preferences of weak agents”-shard doesn’t bid for plans which maximally activate the “respect preferences of weak agents” internal evaluation metric, or even do some tight bounded approximation thereof.
A “respect weak preferences” shard might also guide the AI’s value and ontology reformation process.
A nice person isn’t being maximally nice, nor do they wish to be; they are nicely being nice.
I do agree (insofar as I understand you enough to agree) that we should worry about some “strong optimization over the AI’s concepts, later in AI developmental timeline.” But I think different kinds of “heavy optimization” lead to different kinds of alignment concerns.
When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else).
Or at least that many/most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans ‘want’ to be preserved (at least according to a conventional notion of preferences).
I think this empirical view seems pretty implausible.
That said, I think it’s quite plausible that upon reflection, I’d want to ‘wink out’ any existing copies of myself in favor of using resources better things. But this is partially because I personally (in my current state) would endorse such a thing: if my extrapolated volition thought it would be better to not exist (in favor of other resource usage), my current self would accept that. And, I think it currently seems unlikely that upon reflection, I’d want to end all human lives (in particular, I think I probably would want to keep humans alive who had preferences against non-existence). This applies regardless of trade; it’s important to note this to avoid a ‘perpetual motion machine’ type argument.
Beyond this, I think that most or many humans or aliens would, upon reflection, want to preserve currently existing humans or aliens who had a preference against non-existence. (Again, regardless of trade.)
Additionally, I think it’s quite plausible that most or many humans or aliens will enact various trades or precommitments prior to reflecting (which is probably ill-advised, but it will happen regardless). So current preferences which aren’t stable under reflection might have a significant influence overall.
This feels like it is not really understanding my point, though maybe best to move this to some higher-bandwidth medium if the point is that hard to get across.
Giving it one last try: What I am saying is that I don’t think “conventional notion of preferences” is a particularly well-defined concept, and neither are a lot of other concepts you are using in order to make your predictions here. What it means to care about the preferences of others is a thing with a lot of really messy details that tend to blow up in different ways when you think harder about it and are less anchored on the status-quo.
I don’t think you currently know in what ways you would care about the preferences of others after a lot of reflection (barring game-theoretic considerations which I think we can figure out a bit more in-advance, but I am bracketing that whole angle in this discussion, though I totally agree those are important and relevant). I do think you will of course endorse the way you care about other people’s preferences after you’ve done a lot of reflection (otherwise something went wrong in your reflection process), but I don’t think you would endorse what AIs would do, and my guess is you also wouldn’t endorse what a lot of other humans would do when they undergo reflection here.
Like, what I am saying is that while there might be a relatively broad basin of conditions that give rise to something that locally looks like caring about other beings, the space of caring about other beings is deep and wide, and if you have an AI that cares about other beings preferences in some way you don’t endorse, this doesn’t actually get you anything. And I think the arguments that the concept of “caring about others” that an AI might have (though my best guess is that it won’t even have anything that is locally well-described by that) will hold up after a lot of reflection seem much weaker to me than the arguments that it will have that preference at roughly human capability and ethical-reflection levels (which seems plausible to me, though still overall unlikely).
Zeroth approximation of pseudokindness is strict nonintervention, reifying the patient-in-environment as a closed computation and letting it run indefinitely, with some allocation of compute. Interaction with the outside world creates vulnerability to external influence, but then again so does incautious closed computation, as we currently observe with AI x-risk, which is not something beamed in from outer space.
Formulation of the kinds of external influences that are appropriate for a particular patient-in-environment is exactly the topic of membranes/boundaries, this task can be taken as the defining desideratum for the topic. Specifically, the question of which environments can be put in contact with a particular membrane without corrupting it, hence why I think membranes are relevant to pseudokindness. Naturality of the membranes/boundaries abstraction is linked to naturality of the pseudokindness abstraction.
In contrast, the language of preferences/optimization seems to be the wrong frame for formulating pseudokindness, it wants to discuss ways of intervening and influencing, of not leaving value on the table, rather than ways of offering acceptable options that avoid manipulation. It might be possible to translate pseudokindness back into the language of preferences, but this translation would induce a kind of deontological prior on preferences that makes the more probable preferences look rather surprising/unnatural from a more preferences-first point of view.
Thanks for writing this. I also think what we want from psuedokindness is captured from membranes/boundaries.