The fact that the values of intelligent agents are completely arbitrary is in conflict with the historical trend of moral progress observed so far on Earth
It’s possible to believe that the values of intelligent agents are “completely arbitrary” (a.k.a. orthogonality), and that the values of humans are NOT completely arbitrary. (That’s what I believe.) After all, any two humans have a lot in common that aliens or AIs need not have. If we exclude human sociopaths etc., then any two humans have even more in common.
(Aren’t sociopaths “intelligent agents”? Do you think a society consisting of 100% high-functioning sociopaths would have a trend of moral progress towards liberalism? If you’re very confident that the answer is “yes”, how do you know? I strongly lean no. For example, there are stories (maybe I’m thinking of something in this book?) of trying to “teach” sociopaths to care about other people, and the sociopaths wind up with a better understanding of neurotypical values, but rather than adopting those values for themselves, they instead use that new knowledge to better manipulate neurotypical people in the future.)
The initial evaluation is chosen by the agent’s designer. However, either periodically or when certain conditions are met, the agent updates the evaluation by reasoning.
You seem kinda uninterested in the “initial evaluation” part, whereas I see it as extremely central. I presume that’s because you think that the agent’s self-updates will all converge into the same place more-or-less regardless of the starting point. If so, I disagree, but you should tell me if I’m describing your view correctly.
The fact that the values of intelligent agents are completely arbitrary is in conflict with the historical trend of moral progress observed so far on Earth
You wrote:
It’s possible to believe that the values of intelligent agents are “completely arbitrary” (a.k.a. orthogonality), and that the values of humans are NOT completely arbitrary. (That’s what I believe.)
I don’t use “in conflict” as “ultimate proof by contradiction”, and maybe we use “completely arbitrary” differently. This doesn’t seem a major problem: see also adjusted statement 2, reported below
for any goal G, it is possible to create an intelligent agent whose goal is G
Back to you:
You seem kinda uninterested in the “initial evaluation” part, whereas I see it as extremely central. I presume that’s because you think that the agent’s self-updates will all converge into the same place more-or-less regardless of the starting point. If so, I disagree, but you should tell me if I’m describing your view correctly.
I do expect to see some convergence, but I don’t know exactly how much and for what environments and starting conditions. The more convergence I see from experimental results, the less interested I’ll become in the initial evaluation. Right now, I see it as a useful tool: for example, the fact that language models can already give (flawed, of course) moral scores to sentences is a good starting point in case someone had to rely on LLMs to try to get a free agent. Unsure about how important it will turn out to be. And I’ll happily have a look at your valence series!
It’s possible to believe that the values of intelligent agents are “completely arbitrary” (a.k.a. orthogonality), and that the values of humans are NOT completely arbitrary. (That’s what I believe.) After all, any two humans have a lot in common that aliens or AIs need not have. If we exclude human sociopaths etc., then any two humans have even more in common.
(Aren’t sociopaths “intelligent agents”? Do you think a society consisting of 100% high-functioning sociopaths would have a trend of moral progress towards liberalism? If you’re very confident that the answer is “yes”, how do you know? I strongly lean no. For example, there are stories (maybe I’m thinking of something in this book?) of trying to “teach” sociopaths to care about other people, and the sociopaths wind up with a better understanding of neurotypical values, but rather than adopting those values for themselves, they instead use that new knowledge to better manipulate neurotypical people in the future.)
My own opinion on this topic is here.
You seem kinda uninterested in the “initial evaluation” part, whereas I see it as extremely central. I presume that’s because you think that the agent’s self-updates will all converge into the same place more-or-less regardless of the starting point. If so, I disagree, but you should tell me if I’m describing your view correctly.
I wrote:
You wrote:
I don’t use “in conflict” as “ultimate proof by contradiction”, and maybe we use “completely arbitrary” differently. This doesn’t seem a major problem: see also adjusted statement 2, reported below
Back to you:
I do expect to see some convergence, but I don’t know exactly how much and for what environments and starting conditions. The more convergence I see from experimental results, the less interested I’ll become in the initial evaluation. Right now, I see it as a useful tool: for example, the fact that language models can already give (flawed, of course) moral scores to sentences is a good starting point in case someone had to rely on LLMs to try to get a free agent. Unsure about how important it will turn out to be. And I’ll happily have a look at your valence series!