I think that “the value alignment problem” is not something that currently has a universally acknowledged and precise definition and a lot of the work that is currently being done is to get less confused about what is meant by this.
From what I see, in your proof you have started from a particular meaning of this term and then went on to show it is impossible.
Which means that human values, or at least the individual non-morality-based values don’t converge, which means that you can’t design an artificial superintelligence that contains a term for all human values
Here you observe that if “the value alignment problem” means to construct something which has the values of all humans at the same time, it is impossible because there exist humans with contradictory values. So you propose the new definition “to construct something with all human moral values”. You continue to observe that the four moral values you give are also contradictory, so this is also impossible.
And even if somehow you could program an intelligence to optimize for those four competing utility functions at the same time,
So now we are looking at the definition “to program for the four different utility functions at the same time”. As has been observed in a different comment, this is somewhat underspecified and there might be different ways to interpret and implement it. For one such way you predict
that would just cause it to optimize for conflict resolution, and then it would just tile the universe with tiny artificial conflicts between artificial agents for it to resolve as quickly and efficiently as possible without letting those agents do anything themselves.
It seems to me that the scenario behind this course of events would be: we build an AI, give it the four moralities and noticing their internal contradictions, it analyzes them to find that they serve the purpose of conflict resolution. Then it proceeds to make this its new, consistent goal and builds these tiny conflict scenarios. I’m not saying that this is implausible, but I don’t think it is a course of events without alternatives (and these would depend on the way the AI is built to resolve conflicting goals).
To summarize, I think out of the possible specifications of “the value alignment problem”, you picked three (all human values, all human moral values, “optimizing the four moralities”) and showed that the first two are impossible and the third leads to undesired consequences (under some further assumptions).
However, I think there are many things which people would consider a solution of “the value alignment problem” and which don’t satisfy one of these three descriptions. Maybe there is a subset of the human values without contradiction, such that most people would be reasonably happy with the result of a superhuman AI optimizing these values. Maybe the result of an AI maximizing only the “Maximize Flourishing”-morality would lead to a decent future. I would be the first to admit that those scenarios I describe are themselves severely underspecified, just vaguely waving at a subset of the possibility space, but I imagine that these subsets could contain things we would call “a solution of the value alignment problem”.
Except that for humans, life is a journey, not a destination. If you make a maximize flourishing optimizer you would need to rigorously define what you meant by flourishing, which requires a rigorous definition of a general human utility function, which doesnt and cannot exist. Human values are instrumental all the way down. Some values are just more instrumental than others—that is the mechanism which allows for human values to be over 4d experiences rather than 3d states. I mean, what other mechanism could result in that for a human mind? This is a natural implication of “adaptation executors not fitness maximizers”.
And I will note that humans tend to care a lot about their own freedom and self determination. Basically the only way for an intelligence to be “friendly” is for it to first solve scarcity and then be inactive most of the time, only waking up to prevent atrocities like murder torture or rape or to deal with the latest existential threat. In other words, not an optimization process at all, because it would have an arbitrary stopping point where it does not itself raise human values any further.
I think that “the value alignment problem” is not something that currently has a universally acknowledged and precise definition and a lot of the work that is currently being done is to get less confused about what is meant by this.
From what I see, in your proof you have started from a particular meaning of this term and then went on to show it is impossible.
Here you observe that if “the value alignment problem” means to construct something which has the values of all humans at the same time, it is impossible because there exist humans with contradictory values. So you propose the new definition “to construct something with all human moral values”. You continue to observe that the four moral values you give are also contradictory, so this is also impossible.
So now we are looking at the definition “to program for the four different utility functions at the same time”. As has been observed in a different comment, this is somewhat underspecified and there might be different ways to interpret and implement it. For one such way you predict
It seems to me that the scenario behind this course of events would be: we build an AI, give it the four moralities and noticing their internal contradictions, it analyzes them to find that they serve the purpose of conflict resolution. Then it proceeds to make this its new, consistent goal and builds these tiny conflict scenarios. I’m not saying that this is implausible, but I don’t think it is a course of events without alternatives (and these would depend on the way the AI is built to resolve conflicting goals).
To summarize, I think out of the possible specifications of “the value alignment problem”, you picked three (all human values, all human moral values, “optimizing the four moralities”) and showed that the first two are impossible and the third leads to undesired consequences (under some further assumptions).
However, I think there are many things which people would consider a solution of “the value alignment problem” and which don’t satisfy one of these three descriptions. Maybe there is a subset of the human values without contradiction, such that most people would be reasonably happy with the result of a superhuman AI optimizing these values. Maybe the result of an AI maximizing only the “Maximize Flourishing”-morality would lead to a decent future. I would be the first to admit that those scenarios I describe are themselves severely underspecified, just vaguely waving at a subset of the possibility space, but I imagine that these subsets could contain things we would call “a solution of the value alignment problem”.
Except that for humans, life is a journey, not a destination. If you make a maximize flourishing optimizer you would need to rigorously define what you meant by flourishing, which requires a rigorous definition of a general human utility function, which doesnt and cannot exist. Human values are instrumental all the way down. Some values are just more instrumental than others—that is the mechanism which allows for human values to be over 4d experiences rather than 3d states. I mean, what other mechanism could result in that for a human mind? This is a natural implication of “adaptation executors not fitness maximizers”.
And I will note that humans tend to care a lot about their own freedom and self determination. Basically the only way for an intelligence to be “friendly” is for it to first solve scarcity and then be inactive most of the time, only waking up to prevent atrocities like murder torture or rape or to deal with the latest existential threat. In other words, not an optimization process at all, because it would have an arbitrary stopping point where it does not itself raise human values any further.