I would be the last person to dismiss the potential relevance understanding value formation and management in the human brain might have for AI alignment research, but I think there are good reasons to assume that the solutions our evolution has resulted in would be complex and not sufficiently robust. Humans are [Mesa-Optimizers](https://www.alignmentforum.org/tag/mesa-optimization) and the evidence is solid that as a consequence, our alignment with the implicit underlying utility function (reproductive fitness) is rather brittle (i.e. sex with contraceptives, opiate abuse etc. are examples of such “failure points”). Like others have expressed here before me I would also argue that human alignment has to perform in a very narrow environment which is shared with many very similar agents that are all on (roughly) the same power level. The solutions the human evolution has produced to ensure human semi-alignment is therefore to a significant degree not just a purely neurological one but also a social one. Whatever these solutions are we should not expect that they will generalize well or that they would be reliable in a very different environment like one of an intelligent actor who has an absolute power monopoly.
This suggests that researching the human mind alone would not yield a technology that is robust enough to use when we have only exactly one shot at getting it right. We need solutions to the aforementioned abstractions and toy models because we probably should try to find a way to build a system that is theoretically safe and not just “probably safe in a narrow environment”.
I would be the last person to dismiss the potential relevance understanding value formation and management in the human brain might have for AI alignment research, but I think there are good reasons to assume that the solutions our evolution has resulted in would be complex and not sufficiently robust.
Humans are [Mesa-Optimizers](https://www.alignmentforum.org/tag/mesa-optimization) and the evidence is solid that as a consequence, our alignment with the implicit underlying utility function (reproductive fitness) is rather brittle (i.e. sex with contraceptives, opiate abuse etc. are examples of such “failure points”).
Like others have expressed here before me I would also argue that human alignment has to perform in a very narrow environment which is shared with many very similar agents that are all on (roughly) the same power level. The solutions the human evolution has produced to ensure human semi-alignment is therefore to a significant degree not just a purely neurological one but also a social one.
Whatever these solutions are we should not expect that they will generalize well or that they would be reliable in a very different environment like one of an intelligent actor who has an absolute power monopoly.
This suggests that researching the human mind alone would not yield a technology that is robust enough to use when we have only exactly one shot at getting it right. We need solutions to the aforementioned abstractions and toy models because we probably should try to find a way to build a system that is theoretically safe and not just “probably safe in a narrow environment”.