Given this assumption, the human utility function(s) either do or don’t significantly depend on human evolutionary history. I’m just going to assume they do for now.
There seems to be a missing possibility here that I take fairly seriously, which is that human values depend on (collective) life history. That is: human values are substantially determined by collective life history, and rather than converging to some attractor this is a path dependent process. Maybe you can even trace the path taken back to evolutionary history, but it’s substantially mediated by life history.
Under this view, the utility of the future wrt human values depends substantially on whether, in the future, people learn to be very sensitive to outcome differences. But “people are sensitive to outcome differences and happy with the outcome” does not seem better to me than “people are insensitive to outcome differences and happy with the outcome” (this is a first impression; I could be persuaded otherwise), even though it’s higher utility, whereas “people are unhappy with the outcome” does seem worse than “people are happy with the outcome”.
Under this view, I don’t think this follows:
there is some dependence of human values on human evolutionary history, so that a default unaligned AGI would not converge to the same values
My reasoning is that a “default AGI” will have its values contingent on a process which overlaps with the collective life history that determines human values. This is a different situation to values directly determined by evolutionary history, where the process that determines human values is temporally distant and therefore perhaps more-or-less random from the point of view of the AGI. So there’s a compelling reason to believe in value differences in the “evolution history directly determines values” case that’s absent in the “life history determines values” case.
Different values are still totally plausible, of course—I’m objecting to the view that we know they’ll be different.
(Maybe you think this is all an example of humans not really having values, but that doesn’t seem right to me).
I think it’s possible human values depend on life history too, but that seems to add additional complexity and make alignment harder. If the effects of life history very much dominate those of evolutionary history, then maybe neglecting evolutionary history would be more acceptable, making the problem easier.
But I don’t think default AGI would be especially path dependent on human collective life history. Human society changes over time as humans supersede old cultures (see section on subversion). AGI would be a much bigger shift than the normal societal shifts and so would drift from human culture more rapidly. Partially due to different conceptual ontology and so on. The legacy concepts of humans would be a pretty inefficient system for AGIs to keep using. Like how scientists aren’t alchemists anymore, but a bigger shift than that.
(Note, LLMs still rely a lot on human concepts rather than having independent ontology and agency, so this is more about future AI systems)
If people now don’t have strong views about exactly what they want the world to look like in 1000 years but people in 1000 years do have strong views then I think we should defer to future people to evaluate the “human utility” of future states. You seem to be suggesting that we should take the views of people today, although I might be misunderstanding.
Edit: or maybe you’re saying that the AGI trajectory will be ~random from the point of view of the human trajectory due to a different ontology. Maybe, but different ontology → different conclusions is less obvious to me than different data → different conclusions. If there’s almost no mutual information between the different data then the conclusions have to be different, but sometimes you could come to the same conclusions under different ontologies w/data from the same process.
To the extent people now don’t care about the long-term future there isn’t much to do in terms of long-term alignment. People right now who care about what happens 2000 years from now probably have roughly similar preferences to people 1000 years from now who aren’t significantly biologically changed or cognitively enhanced, because some component of what people care about is biological.
I’m not saying it would be random so much as not very dependent on the original history of humans used to train early AGI iterations. It would have different data history but part of that is because of different measurements, e.g. scientific measuring tools. Different ontology means that value laden things people might care about like “having good relationships with other humans” are not meaningful things to future AIs in terms of their world model, not something they would care much by default (they aren’t even modeling the world in those terms), and it would be hard to encode a utility function so they care about it despite the ontological difference.
There seems to be a missing possibility here that I take fairly seriously, which is that human values depend on (collective) life history. That is: human values are substantially determined by collective life history, and rather than converging to some attractor this is a path dependent process. Maybe you can even trace the path taken back to evolutionary history, but it’s substantially mediated by life history.
Under this view, the utility of the future wrt human values depends substantially on whether, in the future, people learn to be very sensitive to outcome differences. But “people are sensitive to outcome differences and happy with the outcome” does not seem better to me than “people are insensitive to outcome differences and happy with the outcome” (this is a first impression; I could be persuaded otherwise), even though it’s higher utility, whereas “people are unhappy with the outcome” does seem worse than “people are happy with the outcome”.
Under this view, I don’t think this follows:
My reasoning is that a “default AGI” will have its values contingent on a process which overlaps with the collective life history that determines human values. This is a different situation to values directly determined by evolutionary history, where the process that determines human values is temporally distant and therefore perhaps more-or-less random from the point of view of the AGI. So there’s a compelling reason to believe in value differences in the “evolution history directly determines values” case that’s absent in the “life history determines values” case.
Different values are still totally plausible, of course—I’m objecting to the view that we know they’ll be different.
(Maybe you think this is all an example of humans not really having values, but that doesn’t seem right to me).
I think it’s possible human values depend on life history too, but that seems to add additional complexity and make alignment harder. If the effects of life history very much dominate those of evolutionary history, then maybe neglecting evolutionary history would be more acceptable, making the problem easier.
But I don’t think default AGI would be especially path dependent on human collective life history. Human society changes over time as humans supersede old cultures (see section on subversion). AGI would be a much bigger shift than the normal societal shifts and so would drift from human culture more rapidly. Partially due to different conceptual ontology and so on. The legacy concepts of humans would be a pretty inefficient system for AGIs to keep using. Like how scientists aren’t alchemists anymore, but a bigger shift than that.
(Note, LLMs still rely a lot on human concepts rather than having independent ontology and agency, so this is more about future AI systems)
If people now don’t have strong views about exactly what they want the world to look like in 1000 years but people in 1000 years do have strong views then I think we should defer to future people to evaluate the “human utility” of future states. You seem to be suggesting that we should take the views of people today, although I might be misunderstanding.
Edit: or maybe you’re saying that the AGI trajectory will be ~random from the point of view of the human trajectory due to a different ontology. Maybe, but different ontology → different conclusions is less obvious to me than different data → different conclusions. If there’s almost no mutual information between the different data then the conclusions have to be different, but sometimes you could come to the same conclusions under different ontologies w/data from the same process.
To the extent people now don’t care about the long-term future there isn’t much to do in terms of long-term alignment. People right now who care about what happens 2000 years from now probably have roughly similar preferences to people 1000 years from now who aren’t significantly biologically changed or cognitively enhanced, because some component of what people care about is biological.
I’m not saying it would be random so much as not very dependent on the original history of humans used to train early AGI iterations. It would have different data history but part of that is because of different measurements, e.g. scientific measuring tools. Different ontology means that value laden things people might care about like “having good relationships with other humans” are not meaningful things to future AIs in terms of their world model, not something they would care much by default (they aren’t even modeling the world in those terms), and it would be hard to encode a utility function so they care about it despite the ontological difference.