In other words, the fact that prosocial algorithms in brains recursively accommodate the fact that prosocial algorithms in brains are sometimes suboptimal (e.g., the capacity to think thoughts like “loving one’s neighbor is a double-edged sword”) is, I claim, a highly desirable property of prosocial brain algorithms. This ability to critically inspect one’s own values may well be the most important prosocial algorithm to pin down!
I think this is probably a convergent result of shard ecosystems doing reflection and planning on themselves, and less a set of human-specific algorithms (although maybe you didn’t mean to claim that). That is, I think that if you get the “base shards” ~right (i.e. they care about people in a variety of ways which I’m going to handwave for now, because I don’t know them more precisely yet), then the base shards will tend to end up doing value reflection. Another way of stating this position (originally argued to me by Quintin) is: “moral philosophy is weakly convergent in the type of its process (of reflecting on imperfections), but not in its reflective equilibria (i.e. the actual values which get settled on).”
Agree with the broader point of:
In essence, the ‘human moral reasoning ain’t all that’ critique ignores that human moral reasoning is itself responsible for generating this critique!
Thanks for the comment! I do think that, at present, the only working example we have of an agent able explicitly self-inspect its own values is in the human case, even if getting the base shards ‘right’ in the prosocial sense would likely entail that they will already be doing self-reflection. Am I misunderstanding your point here?
I think this is probably a convergent result of shard ecosystems doing reflection and planning on themselves, and less a set of human-specific algorithms (although maybe you didn’t mean to claim that). That is, I think that if you get the “base shards” ~right (i.e. they care about people in a variety of ways which I’m going to handwave for now, because I don’t know them more precisely yet), then the base shards will tend to end up doing value reflection. Another way of stating this position (originally argued to me by Quintin) is: “moral philosophy is weakly convergent in the type of its process (of reflecting on imperfections), but not in its reflective equilibria (i.e. the actual values which get settled on).”
Agree with the broader point of:
Also, see Quintin’s comment about how variance in human altruism is a good portent for alignment.
Thanks for the comment! I do think that, at present, the only working example we have of an agent able explicitly self-inspect its own values is in the human case, even if getting the base shards ‘right’ in the prosocial sense would likely entail that they will already be doing self-reflection. Am I misunderstanding your point here?