Since you didn’t summarize the argument in that essay, I went and skimmed it. I’d love to not believe the orthogonality thesis.
I found no argument. The content was “the orthogonality thesis isn’t necessarily true”. But he did accept a “wide angle”, which seems like it would be plenty for standard doom stories. “Human goals aren’t orthogonal” was the closest to evidence. That’s true, but evolution carefully gave us our goals/values to align us with each other.
The bulk was an explicit explanation of the emotional pulls that made him want to not believe in the orthogonality thesis. Then he visibly doesn’t grapple with the actual argument.
Take the following hypothesis: Real world systems that have “terminal goals” of any kind will be worse at influencing the state of the world than ones that only express goal-directed behavior instrumentally. As such, over the long term, most influence on world state will come from systems that do not have meaningful “terminal goals”.
Would convincing evidence that that hypothesis is true count as “not believing in the orthogonality thesis”? I think I am coming around to a view that is approximately this.
I think this is an important question. I think it for the most part the answer is “no, the orthogonality thesis still importantly applies”. Functionally pursuing unaligned goals, competently, is what makes AI potentially dangerous. Whether or not those goals are terminal doesn’t matter much to us. What matters is whether they’re pursued far enough and competently enough to eliminate humanity.
I am curious about your argument for why AI with instrumental goals will be more capable than AI with terminal goals. It seems like terminal goals would have to be implemented as local instrumental goals anyway.
Functionally pursuing unaligned goals, competently, is what makes AI potentially dangerous.
Agreed. But I think “the orthogonality thesis I’d true” is a load bearing assumption for the “build an aligned AI and have that aligned AI ensure that we are safe” approach.
I am curious about your argument for why AI with instrumental goals will be more capable than AI with terminal goals. It seems like terminal goals would have to be implemented as local instrumental goals anyway.
As you say, terminal goals would have to be implemented as local instrumental goals.
“Make pared-down copies of yourself that are specialized to their local environment, that can themselves make altered copies of themselves” is likely one of the things that is locally instrumental across a wide range of situations.
Some instrumental goals will be more effective than others at propagating yourself or your descendants than others.
If your terminal goal conflicts with the goals that are instrumental for self-propagation, emphasizing the terminal goals less and the instrumental ones more will yield better local outcomes
Congratulations you now have selective pressure towards dropping any terminal goal that is not locally instrumental
That just seems like a reason that no agent with even medium-term goals should ever make agents that can copy and modify themselves. They will change, multiply, and come back to out-compete or outright fight you. It will take a little time, so if your goals are super short term maybe you don’t care. But for even medium-term goals, it just seems like an errror to do that.
If I’m an AI making subagents, I’m going to make damn sure they’re not going to do multiply and change their goals.
That’s true, but evolution carefully gave us our goals/values to align us with each other.
With the caveat that some humans do have goals and values orthogonal to those of other humans. The result is generally some combination of shunning, exile, imprisonment, killing, fining, or other sanctions, as determined by whichever side has more power to impose its will.
Right—so humans having different goals seems like evidence that AGIs would have even more different goals by default without the evolutionary pressures.
Since you didn’t summarize the argument in that essay, I went and skimmed it. I’d love to not believe the orthogonality thesis.
I found no argument. The content was “the orthogonality thesis isn’t necessarily true”. But he did accept a “wide angle”, which seems like it would be plenty for standard doom stories. “Human goals aren’t orthogonal” was the closest to evidence. That’s true, but evolution carefully gave us our goals/values to align us with each other.
The bulk was an explicit explanation of the emotional pulls that made him want to not believe in the orthogonality thesis. Then he visibly doesn’t grapple with the actual argument.
Take the following hypothesis: Real world systems that have “terminal goals” of any kind will be worse at influencing the state of the world than ones that only express goal-directed behavior instrumentally. As such, over the long term, most influence on world state will come from systems that do not have meaningful “terminal goals”.
Would convincing evidence that that hypothesis is true count as “not believing in the orthogonality thesis”? I think I am coming around to a view that is approximately this.
I think this is an important question. I think it for the most part the answer is “no, the orthogonality thesis still importantly applies”. Functionally pursuing unaligned goals, competently, is what makes AI potentially dangerous. Whether or not those goals are terminal doesn’t matter much to us. What matters is whether they’re pursued far enough and competently enough to eliminate humanity.
I am curious about your argument for why AI with instrumental goals will be more capable than AI with terminal goals. It seems like terminal goals would have to be implemented as local instrumental goals anyway.
Agreed. But I think “the orthogonality thesis I’d true” is a load bearing assumption for the “build an aligned AI and have that aligned AI ensure that we are safe” approach.
As you say, terminal goals would have to be implemented as local instrumental goals.
“Make pared-down copies of yourself that are specialized to their local environment, that can themselves make altered copies of themselves” is likely one of the things that is locally instrumental across a wide range of situations.
Some instrumental goals will be more effective than others at propagating yourself or your descendants than others.
If your terminal goal conflicts with the goals that are instrumental for self-propagation, emphasizing the terminal goals less and the instrumental ones more will yield better local outcomes
Congratulations you now have selective pressure towards dropping any terminal goal that is not locally instrumental
That just seems like a reason that no agent with even medium-term goals should ever make agents that can copy and modify themselves. They will change, multiply, and come back to out-compete or outright fight you. It will take a little time, so if your goals are super short term maybe you don’t care. But for even medium-term goals, it just seems like an errror to do that.
If I’m an AI making subagents, I’m going to make damn sure they’re not going to do multiply and change their goals.
I predict that that viewpoint is selected against in competitive environments (“instrumentally divergent”?).
I didn’t say “I have a happy reason not to believe the orthogonality thesis”.
That makes sense. I’ll add that to my list of reasons that competition that’s not carefully controlled is deathly dangerous.
With the caveat that some humans do have goals and values orthogonal to those of other humans. The result is generally some combination of shunning, exile, imprisonment, killing, fining, or other sanctions, as determined by whichever side has more power to impose its will.
Right—so humans having different goals seems like evidence that AGIs would have even more different goals by default without the evolutionary pressures.
Agreed, yes.