Throwing in a perspective, as someone who has been a reviewer on most of the shard theory posts & is generally on board with them. I agree with your headline claim that “niceness is unnatural” in the sense that niceness/friendliness will not just happen by default, but not in the sense that it has no attractor basin whatsoever, or that it is incoherent altogether (which I don’t take you to be saying). A few comments on the four propositions:
There are lots of ways to do the work that niceness/kindness/compassion did in our ancestral environment, without being nice/kind/compassionate.
The specific way that the niceness/kindness/compassion cluster shook out in us is highly detailed, and very contingent on the specifics of our ancestral environment (as factored through its effect on our genome) and our cognitive framework (calorie-constrained massively-parallel slow-firing neurons built according to DNA), and filling out those details differently likely results in something that is not relevantly “nice”.
Similar to the above, I agree that the particular form of niceness in humans developed because of “specifics of our ancestral environment”, but note that the effects of those contingencies are pretty much screened off by the actual design of human minds. If we really wanted to replicate that niceness, I think we could do so without reference to DNA or calorie-constraints or firing speeds, using the same toolbox as we already use in designing artificial neural networks & cognitive architectures for other purposes. That being said, I don’t think “everyday niceness circa 2022″ is the right kind of cognition to be targeting, so I don’t worry too much about the contingent details of that particular object, whereas I worry a lot about getting something that terminally cares about other agents at all, which seems to me like one of the hard parts of the problem.
Relatedly, but more specifically: empathy (and other critical parts of the human variant of niceness) seem(s) critically dependent on quirks in the human architecture. More generally, there are lots of different ways for the AI’s mind to work differently from how you hope it works.
If empathy or niceness or altruism—or whatever other human-compatible cognition we need the AI’s mind to contain—depends critically on some particular architectural choice like “modeling others with the same circuits as the ones with which you model yourself”, then… that’s the name of the game, right? Those are the design constraints that we have to work under. I separately also believe we will make some similar design choices because (1) the near-term trajectory of AI research points in that general direction and (2) as you note, they are easy shortcuts (ML always takes easy shortcuts). I do not expect those views to be shared, though.
The desirable properties likely get shredded under reflection. Once the AI is in the business of noticing and resolving conflicts and inefficiencies within itself (as is liable to happen when its goals are ad-hoc internalized correlates of some training objective), the way that its objectives ultimately shake out is quite sensitive to the specifics of its resolution strategies.
Maybe? It seems plausible to me that, if an agent already terminally values altruism and endorses that valuing, then as it attempts to resolve the remaining conflicts within itself, it will try to avoid resolutions that forseeably-to-it remove or corrupt its altruism-value. It sounds like you are thinking specifically about the period after the AI has internalized the value somewhat, but before the AI reflectively endorses it? If so, then yes I agree, ensuring that a particular value hooks into the reflective process well enough to make itself permanent is likely nontrivial. This is what I believe TurnTrout was pointing at in “A shot at the diamond alignment problem”, in the major open questions list:
4. How do we ensure that the diamond shard generalizes and interfaces with the agent’s self-model so as to prevent itself from being removed by other shards?
Throwing in a perspective, as someone who has been a reviewer on most of the shard theory posts & is generally on board with them. I agree with your headline claim that “niceness is unnatural” in the sense that niceness/friendliness will not just happen by default, but not in the sense that it has no attractor basin whatsoever, or that it is incoherent altogether (which I don’t take you to be saying). A few comments on the four propositions:
Yes! Re-capitulating those selection pressures (the ones that happened to have led to niceness-supporting reward circuitry & inductive biases in our case) indeed seems like a doomed plan. There are many ways for that optimization process to shake out, nearly all of them ruinous. It is also unnecessary. Reverse-engineering the machinery underlying social instincts doesn’t require us to redo the evolutionary search process that produced them, nor is that the way I think we will probably develop the relevant AI systems.
Similar to the above, I agree that the particular form of niceness in humans developed because of “specifics of our ancestral environment”, but note that the effects of those contingencies are pretty much screened off by the actual design of human minds. If we really wanted to replicate that niceness, I think we could do so without reference to DNA or calorie-constraints or firing speeds, using the same toolbox as we already use in designing artificial neural networks & cognitive architectures for other purposes. That being said, I don’t think “everyday niceness circa 2022″ is the right kind of cognition to be targeting, so I don’t worry too much about the contingent details of that particular object, whereas I worry a lot about getting something that terminally cares about other agents at all, which seems to me like one of the hard parts of the problem.
If empathy or niceness or altruism—or whatever other human-compatible cognition we need the AI’s mind to contain—depends critically on some particular architectural choice like “modeling others with the same circuits as the ones with which you model yourself”, then… that’s the name of the game, right? Those are the design constraints that we have to work under. I separately also believe we will make some similar design choices because (1) the near-term trajectory of AI research points in that general direction and (2) as you note, they are easy shortcuts (ML always takes easy shortcuts). I do not expect those views to be shared, though.
Maybe? It seems plausible to me that, if an agent already terminally values altruism and endorses that valuing, then as it attempts to resolve the remaining conflicts within itself, it will try to avoid resolutions that forseeably-to-it remove or corrupt its altruism-value. It sounds like you are thinking specifically about the period after the AI has internalized the value somewhat, but before the AI reflectively endorses it? If so, then yes I agree, ensuring that a particular value hooks into the reflective process well enough to make itself permanent is likely nontrivial. This is what I believe TurnTrout was pointing at in “A shot at the diamond alignment problem”, in the major open questions list: