Might write a longer reply at some point, but the reason why I don’t expect “kindness” in AIs (as you define it here) is that I don’t expect “kindness” to be the kind of concept that is robust to cosmic levels of optimization pressure applied to it, and I expect will instead come apart when you apply various reflective principles and eliminate any status-quo bias, even if it exists in an AI mind (and I also think it is quite plausible that it is completely absent).
Like, different versions of kindness might or might not put almost all of their considerateness on all the different types of minds that could hypothetically exist, instead of the minds that currently exist right now. Indeed, I expect it’s more likely than not that I myself will end up in that moral equilibrium, and won’t be interested in extending any special consideration to systems that happened to have been alive in 2022, instead of the systems that could have been alive and seem cooler to me to extend consideration towards.
Another way to say the same thing is that if AI extends consideration towards something human-like, I expect that it will use some superstimuli-human-ideal as a reference point, which will be a much more ideal thing to be kind towards than current humans by its own lights (for an LLM this might be cognitive processes much more optimized for producing internet text than current humans, though that is really very speculative, and is more trying to illustrate the core idea of a superstimuli-human). I currently think few superstimuli-humans like this would still qualify by my lights to count as “human” (though it might by the lights of the AI).
I do find the game-theoretic and acausal trade case against AI killing literally everyone stronger, though it does depend on the chance of us solving alignment in the first place, and so feels a bit recursive in these conversations (like, in order for us to be able to negotiate with the AIs, there needs to be some chance we end up in control of the cosmic endowment in the first place, otherwise we don’t have anything to bargain with).
Humans might respect the preferences of weak agents right now, but if they thought about it for longer they’d pretty robustly just want to completely destroy the existing agents (including a hypothetical alien creator) and replace them with something better. No reason to honor that kind of arbitrary path dependence.
If so, it seems like you wouldn’t be making an argument about AI or aliens at all, but rather an empirical claim about what would happen if humans were to think for a long time (and become more the people we wished to be and so on).
That seems like an important angle that my comment didn’t address at all. I personally don’t believe that humans would collectively stamp out 99% of their kindness to existing agents (in favor of utilitarian optimization) if you gave them enough time to reflect. That sounds like a longer discussion. I also think that if you expressed the argument in this form to a normal person they would be skeptical about the strong claims about human nature (and would be skeptical of doomer expertise on that topic), and so if this ends up being the crux it’s worth being aware of where the conversation goes and my bottom line recommendation of more epistemic humility may still be justified.
It’s hard to distinguish human kindness from arguably decision-theoretic reasoning like “our positions could have been reversed, would I want them to do the same to me?” but I don’t think the distinction between kindness and common-sense morality and decision theory is particularly important here except insofar as we want to avoid double-counting.
(This does call to mind another important argument that I didn’t discuss in my original comment: “kindness is primarily a product of moral norms produced by cultural accumulation and domestication, and there will be no analogous process amongst AI systems.” I have the same reaction as to the evolutionary psychology explanations. Evidently the resulting kindness extends beyond the actual participants in that cultural process, so I think you need to be making more detailed guesses about minds and culture and so on to have a strong a priori view between AI and humans.)
Humans might respect the preferences of weak agents right now, but if they thought about it for longer they’d pretty robustly just want to completely destroy the existing agents (including a hypothetical alien creator) and replace them with something better. No reason to honor that kind of arbitrary path dependence.
No, this doesn’t feel accurate. What I am saying is more something like:
The way humans think about the question of “preferences for weak agents” and “kindness” feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of “having a continuous stream of consciousness with a good past and good future is important” to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.
The way this comes apart seems very chaotic to me, and dependent enough on the exact metaethical and cultural and environmental starting conditions that I wouldn’t be that surprised if I disagree even with other humans on their resulting conceptualization of “kindness” (and e.g. one endpoint might be that I end up not having a special preference for currently-alive beings, but there are thousands, maybe millions of ways for this concept to fray apart under optimization pressure).
In other words, I think it’s plausible that at something like human level of capabilities and within a roughly human ontology (which AIs might at least partially share, though how much is quite uncertain to me), the concept of kindness as assigning value to the extrapolated preferences of beings that currently exist might be a thing that an AI could share. But I expect it to not hold up under reflection, and much greater power, and predictable ontological changes (that I expect any AI go to through as it reaches superintelligence), so that the resulting reflectively stable and optimized idea of kindness will not meaningfully results in current humans genuine preferences being fulfilled (by my own lights of what it means to extrapolate and fulfill someone’s preferences). The space of possibilities in which this concept could fray apart seems quite great, and many of the endpoints are unlikely to align with my endpoints of this concept.
Edit (some more thoughts): The thing you said feels related to that in that I think my own pretty huge uncertainty about how I will relate to kindness on reflection is evidence that I think iterating on that concept will be quite chaotic and different for different minds.
I do want to push back on “in favor of utilitarian optimization”. That is not what I am saying, or at least it feels somewhat misleading.
I am saying that I think it’s pretty likely that upon reflection I no longer think that my “kindness” goals are meaningfully achieved by caring about the beings alive in 2022, and that it would be more kind, by my own lights, to not give special consideration to beings who happened to be alive right now. This isn’t about “trading off kindness in favor of utilitarian optimization”, it’s saying that when you point towards the thing in me that generates an instinct towards kindness, I can imagine that as I more fully realize what that instinct cashes out to in terms of preferences, that it will not result in actually giving consideration to e.g. rats that are currently alive, or would give consideration to some archetype of a rat that is actually not really that much like a rat, because I don’t even really know what it means for a rat to want something, and similarly the way the AI relates to the question of “do humans want things” will feel similarly underdetermined (and again, these are just concrete examples of how the concept could come apart, not trying to be an exhaustive list of ways the concept could fall apart).
I think some of the confusion here comes from my using “kind” to refer to “respecting the preferences of existing weak agents,” I don’t have a better handle but could have just used a made up word.
I don’t quite understand your objection to my summary—it seems like you are saying that notions like “kindness” (that might currently lead you to respect the preferences of existing agents) will come apart and change in unpredictable ways as agents deliberate. The result is that smart minds will predictably stop respecting the preferences of existing agents, up to and including killing them all to replace them with something that more efficiently satisfies other values (including whatever kind of form “kindness” may end up taking, e.g. kindness towards all the possible minds who otherwise won’t get to exist).
I called this utilitarian optimization but it might have been more charitable to call it “impartial” optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It’s also “utilitarian” in the sense that it’s willing to spare nothing (or at least not 1/trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like “utilitarian” is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.
I think some of the confusion here comes from my using “kind” to refer to “respecting the preferences of existing weak agents,” I don’t have a better handle but could have just used a made up word.
Yeah, sorry, I noticed the same thing a few minutes ago, that I was probably at least somewhat misled by the more standard meaning of kindness.
Tabooing “kindness” I am saying something like:
Yes, I don’t think extrapolated current humans assign approximately any value to the exact preference of “respecting the preferences of existing weak agents” and I don’t really believe that you would on-reflection endorse that preference either.
Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions, like ‘agent’ being a meaningful concept in the first place, or ‘existing’ or ‘weak’ or ‘preferences’, all of which I expect I would think are probably terribly confused concepts to use after I had understood the real concepts that carve reality more at its joints, and this means this sentence sounds deceptively simple or robust, but really doesn’t feel like the kind of thing whose meaning will stay simple as an AI does more conceptual refinement.
I called this utilitarian optimization but it might have been more charitable to call it “impartial” optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It’s also “utilitarian” in the sense that it’s willing to spare nothing (or at least not 1/trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like “utilitarian” is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.
The reason why I objected to this characterization is that I was trying to point at a more general thing than the “impartialness”. Like, to paraphrase what this sentence sounds like to me, it’s more as if someone from a pre-modern era was arguing about future civilizations and said “It’s weird that your conception of future humans are willing to do nothing for the gods that live in the sky, and the spirits that make our plants grow”.
Like, after a bunch of ontological reflection and empirical data gathering, “gods” is just really not a good abstraction for things I care about anymore. I don’t think “impartiality” is what is causing me to not care about gods, it’s just that the concept of “gods” seems fake and doesn’t carve reality at its joints anymore. It’s also not the case that I don’t care at all about ancient gods anymore (they are pretty cool and I like the aesthetic), but they way I care about them is very different from how I care about other humans.
Not caring about gods doesn’t feel “harsh” or “utilitarian” or in some sense like I have decided to abandon any part of my values. This is what I expect it to feel like for a future human to look back at our meta-preferences for many types of other beings, and also what it feels like for AIs that maybe have some initial version of ‘caring about others’ when they are at similar capability levels to humans.
This again isn’t capturing my objection perfectly, but maybe helps point to it better.
Yes, I don’t think extrapolated current humans assign approximately any value to the exact preference of “respecting the preferences of existing weak agents” and I don’t really believe that you would on-reflection endorse that preference either.
I am quite confident that I do, and it tends to infuriate my friends who get cranky that I feel a moral obligation to respect the artistic intent of bacterial genomes: all bacteria should go vegan, yet survive, and eat food equivalent to their previous.
Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions,
I feel pretty uncertain of what assumptions are hiding in your “optimize strongly against X” statements. Historically this just seems hard to tease out, and wouldn’t be surprised if I were just totally misreading you here.
I think that a realistic “respecting preferences of weak agents”-shard doesn’t bid for plans which maximally activate the “respect preferences of weak agents” internal evaluation metric, or even do some tight bounded approximation thereof.
A “respect weak preferences” shard might also guide the AI’s value and ontology reformation process.
A nice person isn’t being maximally nice, nor do they wish to be; they are nicely being nice.
I do agree (insofar as I understand you enough to agree) that we should worry about some “strong optimization over the AI’s concepts, later in AI developmental timeline.” But I think different kinds of “heavy optimization” lead to different kinds of alignment concerns.
When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else).
Or at least that many/most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans ‘want’ to be preserved (at least according to a conventional notion of preferences).
I think this empirical view seems pretty implausible.
That said, I think it’s quite plausible that upon reflection, I’d want to ‘wink out’ any existing copies of myself in favor of using resources better things. But this is partially because I personally (in my current state) would endorse such a thing: if my extrapolated volition thought it would be better to not exist (in favor of other resource usage), my current self would accept that. And, I think it currently seems unlikely that upon reflection, I’d want to end all human lives (in particular, I think I probably would want to keep humans alive who had preferences against non-existence). This applies regardless of trade; it’s important to note this to avoid a ‘perpetual motion machine’ type argument.
Beyond this, I think that most or many humans or aliens would, upon reflection, want to preserve currently existing humans or aliens who had a preference against non-existence. (Again, regardless of trade.)
Additionally, I think it’s quite plausible that most or many humans or aliens will enact various trades or precommitments prior to reflecting (which is probably ill-advised, but it will happen regardless). So current preferences which aren’t stable under reflection might have a significant influence overall.
This feels like it is not really understanding my point, though maybe best to move this to some higher-bandwidth medium if the point is that hard to get across.
Giving it one last try: What I am saying is that I don’t think “conventional notion of preferences” is a particularly well-defined concept, and neither are a lot of other concepts you are using in order to make your predictions here. What it means to care about the preferences of others is a thing with a lot of really messy details that tend to blow up in different ways when you think harder about it and are less anchored on the status-quo.
I don’t think you currently know in what ways you would care about the preferences of others after a lot of reflection (barring game-theoretic considerations which I think we can figure out a bit more in-advance, but I am bracketing that whole angle in this discussion, though I totally agree those are important and relevant). I do think you will of course endorse the way you care about other people’s preferences after you’ve done a lot of reflection (otherwise something went wrong in your reflection process), but I don’t think you would endorse what AIs would do, and my guess is you also wouldn’t endorse what a lot of other humans would do when they undergo reflection here.
Like, what I am saying is that while there might be a relatively broad basin of conditions that give rise to something that locally looks like caring about other beings, the space of caring about other beings is deep and wide, and if you have an AI that cares about other beings preferences in some way you don’t endorse, this doesn’t actually get you anything. And I think the arguments that the concept of “caring about others” that an AI might have (though my best guess is that it won’t even have anything that is locally well-described by that) will hold up after a lot of reflection seem much weaker to me than the arguments that it will have that preference at roughly human capability and ethical-reflection levels (which seems plausible to me, though still overall unlikely).
Zeroth approximation of pseudokindness is strict nonintervention, reifying the patient-in-environment as a closed computation and letting it run indefinitely, with some allocation of compute. Interaction with the outside world creates vulnerability to external influence, but then again so does incautious closed computation, as we currently observe with AI x-risk, which is not something beamed in from outer space.
Formulation of the kinds of external influences that are appropriate for a particular patient-in-environment is exactly the topic of membranes/boundaries, this task can be taken as the defining desideratum for the topic. Specifically, the question of which environments can be put in contact with a particular membrane without corrupting it, hence why I think membranes are relevant to pseudokindness. Naturality of the membranes/boundaries abstraction is linked to naturality of the pseudokindness abstraction.
In contrast, the language of preferences/optimization seems to be the wrong frame for formulating pseudokindness, it wants to discuss ways of intervening and influencing, of not leaving value on the table, rather than ways of offering acceptable options that avoid manipulation. It might be possible to translate pseudokindness back into the language of preferences, but this translation would induce a kind of deontological prior on preferences that makes the more probable preferences look rather surprising/unnatural from a more preferences-first point of view.
If the result of an optimization process will be predictably horrifying to the agents which are applying that optimization process to themselves, then they will simply not do so.
In other words: AIs which feel anything in the vicinity of kindness before applying cosmic amounts of optimization pressure to themselves will try to steer that optimization pressure towards something which is recognizably kind at the end.
And I don’t think there’s any good argument for why AIs will lack any scrap of kindness with very high confidence at the point where they’re just starting to recursively self-improve.
Meta: I feel pretty annoyed by the phenomenon of which this current conversation is an instance, because when people keep saying things that I strongly disagree with which will be taken as representing a movement that I’m associated with, the high-integrity (and possibly also strategically optimal) thing to do is to publicly repudiate those claims*, which seems like a bad outcome for everyone. I model it as an epistemic prisoner’s dilemma with the following squares:
D, D: doomers talk a lot about “everyone dies with >90% confidence”, non-doomers publicly repudiate those arguments C, D: doomers talk a lot about “everyone dies with >90% confidence”, non-doomers let those arguments become the public face of AI alignment despite strongly disagreeing with them D, C: doomers apply higher epistemic standards on this issue (from the perspective of non-doomers); non-doomers keep applying pressure to doomers to “sanitize” even more aspects of their communication C,C: doomers apply higher epistemic standards on this issue (from the perspective of non-doomers); non-doomers support doomers making their arguments
I model us as being in the C, D square and I would like to move to the C, C square so I don’t have to spend my time arguing about epistemic standards or repudiating arguments from people who are also trying to prevent AI xrisk. I expect that this is basically the same point that Paul is making when he says “if we can’t get on the same page about our predictions I’m at at least aiming to get folks to stop arguing so confidently for death given takeover”.
I expect that you’re worried about ending up in the D, C square, so in order to mitigate that concern I’m open to making trades on other issues where doomers and non-doomers disagree; I expect you’d know better than I do what trades would be valuable for you here. (One example of me making such a trade in the past was including a week on agent foundations in the AGISF curriculum despite inside-view not thinking it was a good thing to spend time on.) For example, I am open to being louder in other cases where we both agree that someone else is making a bad argument (but which don’t currently meet my threshold for “the high-integrity thing is to make a public statement repudiating that argument”).
* my intuition here is based on the idea that not repudiating those claims is implicitly committing a multi-person motte and bailey (but I can’t find the link to the post which outlines that idea). I expect you (Habyrka) agree with this point in the abstract because of previous cases where you regretted not repudiating things that leading EAs were saying, although I presume that you think this case is disanalogous.
Meta: I feel pretty annoyed by the phenomenon of which this current conversation is an instance, because when people keep saying things that I strongly disagree with which will be taken as representing a movement that I’m associated with, the high-integrity (and possibly also strategically optimal) thing to do is to publicly repudiate those claims*, which seems like a bad outcome for everyone.
For what it’s worth, I think you should just say that you disagree with it? I don’t really understand why this would be a “bad outcome for everyone”. Just list out the parts you agree on, and list the parts you disagree on. Coalitions should mostly be based on epistemological principles and ethical principles anyways, not object-level conclusions, so at least in my model of the world repudiating my statements if you disagree with them is exactly what I want my allies to do.
If you on the other hand think the kind of errors you are seeing are evidence about some kind of deeper epistemological problems, or ethical problems, such that you no longer want to be in an actual coalition with the relevant people (or think that the costs of being perceived to be in some trade-coalition with them would outweigh the benefits of actually being in that coalition), I think it makes sense to socially distance yourself from the relevant people, though I think your public statements should mostly just accurately reflect how much you are indeed deferring to individuals, how much trust you are putting into them, how much you are engaging in reputation-trades with them, etc.
When I say “repudiate” I mean a combination of publicly disagreeing + distancing. I presume you agree that this is suboptimal for both of us, and my comment above is an attempt to find a trade that avoids this suboptimal outcome.
Note that I’m fine to be in coalitions with people when I think their epistemologies have problems, as long as their strategies are not sensitively dependent on those problems. (E.g. presumably some of the signatories of the recent CAIS statement are theists, and I’m fine with that as long as they don’t start making arguments that AI safety is important because of theism.) So my request is that you make your strategies less sensitively dependent on the parts of your epistemology that I have problems with (and I’m open to doing the same the other way around in exchange).
If the result of an optimization process will be predictably horrifying to the agents which are applying that optimization process to themselves, then they will simply not do so.
In other words: AIs which feel anything in the vicinity of kindness before applying cosmic amounts of optimization pressure to themselves will try to steer that optimization pressure towards something which is recognizably kind at the end.
And I don’t think there’s any good argument for why AIs will lack any scrap of kindness with very high confidence at the point where they’re just starting to recursively self-improve.
This feels like it somewhat misunderstands my point. I don’t expect the reflection process I will go through to feel predictably horrifying from the inside. But I do expect the reflection process the AI will go through to feel horrifying to me (because the AI does not share all my metaethical assumptions, and preferences over reflection, and environmental circumstances, and principles by which I trade off values between different parts of me).
This feels like a pretty common experience. Many people in EA seem to quite deeply endorse various things like hedonic utilitarianism, in a way where the reflection process that led them to that opinion feels deeply horrifying to me. Of course it didn’t feel deeply horrifying to them (or at least it didn’t on the dimensions that were relevant to their process of meta-ethical reflection), otherwise they wouldn’t have done it.
a much more ideal thing to be kind towards than current humans
Relevant sense of kindness is towards things that happen to already exist, because they already exist. Not filling some fraction of the universe with expression-of-kindness, brought into existence de novo, that’s a different thing.
Might write a longer reply at some point, but the reason why I don’t expect “kindness” in AIs (as you define it here) is that I don’t expect “kindness” to be the kind of concept that is robust to cosmic levels of optimization pressure applied to it, and I expect will instead come apart when you apply various reflective principles and eliminate any status-quo bias, even if it exists in an AI mind (and I also think it is quite plausible that it is completely absent).
Like, different versions of kindness might or might not put almost all of their considerateness on all the different types of minds that could hypothetically exist, instead of the minds that currently exist right now. Indeed, I expect it’s more likely than not that I myself will end up in that moral equilibrium, and won’t be interested in extending any special consideration to systems that happened to have been alive in 2022, instead of the systems that could have been alive and seem cooler to me to extend consideration towards.
Another way to say the same thing is that if AI extends consideration towards something human-like, I expect that it will use some superstimuli-human-ideal as a reference point, which will be a much more ideal thing to be kind towards than current humans by its own lights (for an LLM this might be cognitive processes much more optimized for producing internet text than current humans, though that is really very speculative, and is more trying to illustrate the core idea of a superstimuli-human). I currently think few superstimuli-humans like this would still qualify by my lights to count as “human” (though it might by the lights of the AI).
I do find the game-theoretic and acausal trade case against AI killing literally everyone stronger, though it does depend on the chance of us solving alignment in the first place, and so feels a bit recursive in these conversations (like, in order for us to be able to negotiate with the AIs, there needs to be some chance we end up in control of the cosmic endowment in the first place, otherwise we don’t have anything to bargain with).
Is this a fair summary?
If so, it seems like you wouldn’t be making an argument about AI or aliens at all, but rather an empirical claim about what would happen if humans were to think for a long time (and become more the people we wished to be and so on).
That seems like an important angle that my comment didn’t address at all. I personally don’t believe that humans would collectively stamp out 99% of their kindness to existing agents (in favor of utilitarian optimization) if you gave them enough time to reflect. That sounds like a longer discussion. I also think that if you expressed the argument in this form to a normal person they would be skeptical about the strong claims about human nature (and would be skeptical of doomer expertise on that topic), and so if this ends up being the crux it’s worth being aware of where the conversation goes and my bottom line recommendation of more epistemic humility may still be justified.
It’s hard to distinguish human kindness from arguably decision-theoretic reasoning like “our positions could have been reversed, would I want them to do the same to me?” but I don’t think the distinction between kindness and common-sense morality and decision theory is particularly important here except insofar as we want to avoid double-counting.
(This does call to mind another important argument that I didn’t discuss in my original comment: “kindness is primarily a product of moral norms produced by cultural accumulation and domestication, and there will be no analogous process amongst AI systems.” I have the same reaction as to the evolutionary psychology explanations. Evidently the resulting kindness extends beyond the actual participants in that cultural process, so I think you need to be making more detailed guesses about minds and culture and so on to have a strong a priori view between AI and humans.)
No, this doesn’t feel accurate. What I am saying is more something like:
The way humans think about the question of “preferences for weak agents” and “kindness” feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of “having a continuous stream of consciousness with a good past and good future is important” to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.
The way this comes apart seems very chaotic to me, and dependent enough on the exact metaethical and cultural and environmental starting conditions that I wouldn’t be that surprised if I disagree even with other humans on their resulting conceptualization of “kindness” (and e.g. one endpoint might be that I end up not having a special preference for currently-alive beings, but there are thousands, maybe millions of ways for this concept to fray apart under optimization pressure).
In other words, I think it’s plausible that at something like human level of capabilities and within a roughly human ontology (which AIs might at least partially share, though how much is quite uncertain to me), the concept of kindness as assigning value to the extrapolated preferences of beings that currently exist might be a thing that an AI could share. But I expect it to not hold up under reflection, and much greater power, and predictable ontological changes (that I expect any AI go to through as it reaches superintelligence), so that the resulting reflectively stable and optimized idea of kindness will not meaningfully results in current humans genuine preferences being fulfilled (by my own lights of what it means to extrapolate and fulfill someone’s preferences). The space of possibilities in which this concept could fray apart seems quite great, and many of the endpoints are unlikely to align with my endpoints of this concept.
Edit (some more thoughts): The thing you said feels related to that in that I think my own pretty huge uncertainty about how I will relate to kindness on reflection is evidence that I think iterating on that concept will be quite chaotic and different for different minds.
I do want to push back on “in favor of utilitarian optimization”. That is not what I am saying, or at least it feels somewhat misleading.
I am saying that I think it’s pretty likely that upon reflection I no longer think that my “kindness” goals are meaningfully achieved by caring about the beings alive in 2022, and that it would be more kind, by my own lights, to not give special consideration to beings who happened to be alive right now. This isn’t about “trading off kindness in favor of utilitarian optimization”, it’s saying that when you point towards the thing in me that generates an instinct towards kindness, I can imagine that as I more fully realize what that instinct cashes out to in terms of preferences, that it will not result in actually giving consideration to e.g. rats that are currently alive, or would give consideration to some archetype of a rat that is actually not really that much like a rat, because I don’t even really know what it means for a rat to want something, and similarly the way the AI relates to the question of “do humans want things” will feel similarly underdetermined (and again, these are just concrete examples of how the concept could come apart, not trying to be an exhaustive list of ways the concept could fall apart).
I think some of the confusion here comes from my using “kind” to refer to “respecting the preferences of existing weak agents,” I don’t have a better handle but could have just used a made up word.
I don’t quite understand your objection to my summary—it seems like you are saying that notions like “kindness” (that might currently lead you to respect the preferences of existing agents) will come apart and change in unpredictable ways as agents deliberate. The result is that smart minds will predictably stop respecting the preferences of existing agents, up to and including killing them all to replace them with something that more efficiently satisfies other values (including whatever kind of form “kindness” may end up taking, e.g. kindness towards all the possible minds who otherwise won’t get to exist).
I called this utilitarian optimization but it might have been more charitable to call it “impartial” optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It’s also “utilitarian” in the sense that it’s willing to spare nothing (or at least not 1/trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like “utilitarian” is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.
Yeah, sorry, I noticed the same thing a few minutes ago, that I was probably at least somewhat misled by the more standard meaning of kindness.
Tabooing “kindness” I am saying something like:
Yes, I don’t think extrapolated current humans assign approximately any value to the exact preference of “respecting the preferences of existing weak agents” and I don’t really believe that you would on-reflection endorse that preference either.
Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions, like ‘agent’ being a meaningful concept in the first place, or ‘existing’ or ‘weak’ or ‘preferences’, all of which I expect I would think are probably terribly confused concepts to use after I had understood the real concepts that carve reality more at its joints, and this means this sentence sounds deceptively simple or robust, but really doesn’t feel like the kind of thing whose meaning will stay simple as an AI does more conceptual refinement.
The reason why I objected to this characterization is that I was trying to point at a more general thing than the “impartialness”. Like, to paraphrase what this sentence sounds like to me, it’s more as if someone from a pre-modern era was arguing about future civilizations and said “It’s weird that your conception of future humans are willing to do nothing for the gods that live in the sky, and the spirits that make our plants grow”.
Like, after a bunch of ontological reflection and empirical data gathering, “gods” is just really not a good abstraction for things I care about anymore. I don’t think “impartiality” is what is causing me to not care about gods, it’s just that the concept of “gods” seems fake and doesn’t carve reality at its joints anymore. It’s also not the case that I don’t care at all about ancient gods anymore (they are pretty cool and I like the aesthetic), but they way I care about them is very different from how I care about other humans.
Not caring about gods doesn’t feel “harsh” or “utilitarian” or in some sense like I have decided to abandon any part of my values. This is what I expect it to feel like for a future human to look back at our meta-preferences for many types of other beings, and also what it feels like for AIs that maybe have some initial version of ‘caring about others’ when they are at similar capability levels to humans.
This again isn’t capturing my objection perfectly, but maybe helps point to it better.
I am quite confident that I do, and it tends to infuriate my friends who get cranky that I feel a moral obligation to respect the artistic intent of bacterial genomes: all bacteria should go vegan, yet survive, and eat food equivalent to their previous.
I feel pretty uncertain of what assumptions are hiding in your “optimize strongly against X” statements. Historically this just seems hard to tease out, and wouldn’t be surprised if I were just totally misreading you here.
That said, your writing makes me wonder “where is the heavy optimization [over the value definitions] coming from?”, since I think the preference-shards themselves are the things steering the optimization power. For example, the shards are not optimizing over themselves to find adversarial examples to themselves. Related statements:
I think that a realistic “respecting preferences of weak agents”-shard doesn’t bid for plans which maximally activate the “respect preferences of weak agents” internal evaluation metric, or even do some tight bounded approximation thereof.
A “respect weak preferences” shard might also guide the AI’s value and ontology reformation process.
A nice person isn’t being maximally nice, nor do they wish to be; they are nicely being nice.
I do agree (insofar as I understand you enough to agree) that we should worry about some “strong optimization over the AI’s concepts, later in AI developmental timeline.” But I think different kinds of “heavy optimization” lead to different kinds of alignment concerns.
When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else).
Or at least that many/most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans ‘want’ to be preserved (at least according to a conventional notion of preferences).
I think this empirical view seems pretty implausible.
That said, I think it’s quite plausible that upon reflection, I’d want to ‘wink out’ any existing copies of myself in favor of using resources better things. But this is partially because I personally (in my current state) would endorse such a thing: if my extrapolated volition thought it would be better to not exist (in favor of other resource usage), my current self would accept that. And, I think it currently seems unlikely that upon reflection, I’d want to end all human lives (in particular, I think I probably would want to keep humans alive who had preferences against non-existence). This applies regardless of trade; it’s important to note this to avoid a ‘perpetual motion machine’ type argument.
Beyond this, I think that most or many humans or aliens would, upon reflection, want to preserve currently existing humans or aliens who had a preference against non-existence. (Again, regardless of trade.)
Additionally, I think it’s quite plausible that most or many humans or aliens will enact various trades or precommitments prior to reflecting (which is probably ill-advised, but it will happen regardless). So current preferences which aren’t stable under reflection might have a significant influence overall.
This feels like it is not really understanding my point, though maybe best to move this to some higher-bandwidth medium if the point is that hard to get across.
Giving it one last try: What I am saying is that I don’t think “conventional notion of preferences” is a particularly well-defined concept, and neither are a lot of other concepts you are using in order to make your predictions here. What it means to care about the preferences of others is a thing with a lot of really messy details that tend to blow up in different ways when you think harder about it and are less anchored on the status-quo.
I don’t think you currently know in what ways you would care about the preferences of others after a lot of reflection (barring game-theoretic considerations which I think we can figure out a bit more in-advance, but I am bracketing that whole angle in this discussion, though I totally agree those are important and relevant). I do think you will of course endorse the way you care about other people’s preferences after you’ve done a lot of reflection (otherwise something went wrong in your reflection process), but I don’t think you would endorse what AIs would do, and my guess is you also wouldn’t endorse what a lot of other humans would do when they undergo reflection here.
Like, what I am saying is that while there might be a relatively broad basin of conditions that give rise to something that locally looks like caring about other beings, the space of caring about other beings is deep and wide, and if you have an AI that cares about other beings preferences in some way you don’t endorse, this doesn’t actually get you anything. And I think the arguments that the concept of “caring about others” that an AI might have (though my best guess is that it won’t even have anything that is locally well-described by that) will hold up after a lot of reflection seem much weaker to me than the arguments that it will have that preference at roughly human capability and ethical-reflection levels (which seems plausible to me, though still overall unlikely).
Zeroth approximation of pseudokindness is strict nonintervention, reifying the patient-in-environment as a closed computation and letting it run indefinitely, with some allocation of compute. Interaction with the outside world creates vulnerability to external influence, but then again so does incautious closed computation, as we currently observe with AI x-risk, which is not something beamed in from outer space.
Formulation of the kinds of external influences that are appropriate for a particular patient-in-environment is exactly the topic of membranes/boundaries, this task can be taken as the defining desideratum for the topic. Specifically, the question of which environments can be put in contact with a particular membrane without corrupting it, hence why I think membranes are relevant to pseudokindness. Naturality of the membranes/boundaries abstraction is linked to naturality of the pseudokindness abstraction.
In contrast, the language of preferences/optimization seems to be the wrong frame for formulating pseudokindness, it wants to discuss ways of intervening and influencing, of not leaving value on the table, rather than ways of offering acceptable options that avoid manipulation. It might be possible to translate pseudokindness back into the language of preferences, but this translation would induce a kind of deontological prior on preferences that makes the more probable preferences look rather surprising/unnatural from a more preferences-first point of view.
Thanks for writing this. I also think what we want from psuedokindness is captured from membranes/boundaries.
Possibly relevant?
If the result of an optimization process will be predictably horrifying to the agents which are applying that optimization process to themselves, then they will simply not do so.
In other words: AIs which feel anything in the vicinity of kindness before applying cosmic amounts of optimization pressure to themselves will try to steer that optimization pressure towards something which is recognizably kind at the end.
And I don’t think there’s any good argument for why AIs will lack any scrap of kindness with very high confidence at the point where they’re just starting to recursively self-improve.
Meta: I feel pretty annoyed by the phenomenon of which this current conversation is an instance, because when people keep saying things that I strongly disagree with which will be taken as representing a movement that I’m associated with, the high-integrity (and possibly also strategically optimal) thing to do is to publicly repudiate those claims*, which seems like a bad outcome for everyone. I model it as an epistemic prisoner’s dilemma with the following squares:
D, D: doomers talk a lot about “everyone dies with >90% confidence”, non-doomers publicly repudiate those arguments
C, D: doomers talk a lot about “everyone dies with >90% confidence”, non-doomers let those arguments become the public face of AI alignment despite strongly disagreeing with them
D, C: doomers apply higher epistemic standards on this issue (from the perspective of non-doomers); non-doomers keep applying pressure to doomers to “sanitize” even more aspects of their communication
C,C: doomers apply higher epistemic standards on this issue (from the perspective of non-doomers); non-doomers support doomers making their arguments
I model us as being in the C, D square and I would like to move to the C, C square so I don’t have to spend my time arguing about epistemic standards or repudiating arguments from people who are also trying to prevent AI xrisk. I expect that this is basically the same point that Paul is making when he says “if we can’t get on the same page about our predictions I’m at at least aiming to get folks to stop arguing so confidently for death given takeover”.
I expect that you’re worried about ending up in the D, C square, so in order to mitigate that concern I’m open to making trades on other issues where doomers and non-doomers disagree; I expect you’d know better than I do what trades would be valuable for you here. (One example of me making such a trade in the past was including a week on agent foundations in the AGISF curriculum despite inside-view not thinking it was a good thing to spend time on.) For example, I am open to being louder in other cases where we both agree that someone else is making a bad argument (but which don’t currently meet my threshold for “the high-integrity thing is to make a public statement repudiating that argument”).
* my intuition here is based on the idea that not repudiating those claims is implicitly committing a multi-person motte and bailey (but I can’t find the link to the post which outlines that idea). I expect you (Habyrka) agree with this point in the abstract because of previous cases where you regretted not repudiating things that leading EAs were saying, although I presume that you think this case is disanalogous.
For what it’s worth, I think you should just say that you disagree with it? I don’t really understand why this would be a “bad outcome for everyone”. Just list out the parts you agree on, and list the parts you disagree on. Coalitions should mostly be based on epistemological principles and ethical principles anyways, not object-level conclusions, so at least in my model of the world repudiating my statements if you disagree with them is exactly what I want my allies to do.
If you on the other hand think the kind of errors you are seeing are evidence about some kind of deeper epistemological problems, or ethical problems, such that you no longer want to be in an actual coalition with the relevant people (or think that the costs of being perceived to be in some trade-coalition with them would outweigh the benefits of actually being in that coalition), I think it makes sense to socially distance yourself from the relevant people, though I think your public statements should mostly just accurately reflect how much you are indeed deferring to individuals, how much trust you are putting into them, how much you are engaging in reputation-trades with them, etc.
When I say “repudiate” I mean a combination of publicly disagreeing + distancing. I presume you agree that this is suboptimal for both of us, and my comment above is an attempt to find a trade that avoids this suboptimal outcome.
Note that I’m fine to be in coalitions with people when I think their epistemologies have problems, as long as their strategies are not sensitively dependent on those problems. (E.g. presumably some of the signatories of the recent CAIS statement are theists, and I’m fine with that as long as they don’t start making arguments that AI safety is important because of theism.) So my request is that you make your strategies less sensitively dependent on the parts of your epistemology that I have problems with (and I’m open to doing the same the other way around in exchange).
This feels like it somewhat misunderstands my point. I don’t expect the reflection process I will go through to feel predictably horrifying from the inside. But I do expect the reflection process the AI will go through to feel horrifying to me (because the AI does not share all my metaethical assumptions, and preferences over reflection, and environmental circumstances, and principles by which I trade off values between different parts of me).
This feels like a pretty common experience. Many people in EA seem to quite deeply endorse various things like hedonic utilitarianism, in a way where the reflection process that led them to that opinion feels deeply horrifying to me. Of course it didn’t feel deeply horrifying to them (or at least it didn’t on the dimensions that were relevant to their process of meta-ethical reflection), otherwise they wouldn’t have done it.
Relevant sense of kindness is towards things that happen to already exist, because they already exist. Not filling some fraction of the universe with expression-of-kindness, brought into existence de novo, that’s a different thing.