What do you mean by “understanding of the nature of human values”?
If both aligned AIs are properly reflective and understand science properly, they understand their respective toolboxes of modelling human values (or even values of arbitrary black-box intelligent systems), are what they are: just toolboxes and models without special metaphysical status.
They may discuss their respective models of values, but there is no reason to be “in war” because both models are presumably well-aligned with humans and their predictions coincide in a large proportion of cases and diverge only in very obscure cases (like the trolley problem or other infamous thought experiments in ethics specifically designed to test the edges of axiological and ethical models) or when the models are “rolled out” very far into the future. For the latter case, as I gestured to in the post as well, I think the “alignment” frame is actually not useful and we should rather think in terms of control theory, game theory, theory of evolution, etc. Friendly AIs should understand this, and actually not even try to simulate a very far future using their value models of people. (And yes, this is the reason why I think the concept of coherent extrapolation volition actually doesn’t make sense.)
Maybe an interesting thing to note here is that if both AIs were aligned to humans independently, let’s say to cover 98% of human value complexity, but with different methods, their default mutual alignment on the first encounter (if you don’t permit any online re-alignment, such as possible even with LLMs during prompting, though to a limited extent) is expected to be lower, let’s say only 97%. But I don’t see why this should be a problem.
What do you mean by “understanding of the nature of human values”?
If both aligned AIs are properly reflective and understand science properly, they understand their respective toolboxes of modelling human values (or even values of arbitrary black-box intelligent systems), are what they are: just toolboxes and models without special metaphysical status.
They may discuss their respective models of values, but there is no reason to be “in war” because both models are presumably well-aligned with humans and their predictions coincide in a large proportion of cases and diverge only in very obscure cases (like the trolley problem or other infamous thought experiments in ethics specifically designed to test the edges of axiological and ethical models) or when the models are “rolled out” very far into the future. For the latter case, as I gestured to in the post as well, I think the “alignment” frame is actually not useful and we should rather think in terms of control theory, game theory, theory of evolution, etc. Friendly AIs should understand this, and actually not even try to simulate a very far future using their value models of people. (And yes, this is the reason why I think the concept of coherent extrapolation volition actually doesn’t make sense.)
Maybe an interesting thing to note here is that if both AIs were aligned to humans independently, let’s say to cover 98% of human value complexity, but with different methods, their default mutual alignment on the first encounter (if you don’t permit any online re-alignment, such as possible even with LLMs during prompting, though to a limited extent) is expected to be lower, let’s say only 97%. But I don’t see why this should be a problem.
I meant that a situation is possible when two AIs use completely different alignment methods and also come to different results.