Take the following hypothesis: Real world systems that have “terminal goals” of any kind will be worse at influencing the state of the world than ones that only express goal-directed behavior instrumentally. As such, over the long term, most influence on world state will come from systems that do not have meaningful “terminal goals”.
Would convincing evidence that that hypothesis is true count as “not believing in the orthogonality thesis”? I think I am coming around to a view that is approximately this.
I think this is an important question. I think it for the most part the answer is “no, the orthogonality thesis still importantly applies”. Functionally pursuing unaligned goals, competently, is what makes AI potentially dangerous. Whether or not those goals are terminal doesn’t matter much to us. What matters is whether they’re pursued far enough and competently enough to eliminate humanity.
I am curious about your argument for why AI with instrumental goals will be more capable than AI with terminal goals. It seems like terminal goals would have to be implemented as local instrumental goals anyway.
Functionally pursuing unaligned goals, competently, is what makes AI potentially dangerous.
Agreed. But I think “the orthogonality thesis I’d true” is a load bearing assumption for the “build an aligned AI and have that aligned AI ensure that we are safe” approach.
I am curious about your argument for why AI with instrumental goals will be more capable than AI with terminal goals. It seems like terminal goals would have to be implemented as local instrumental goals anyway.
As you say, terminal goals would have to be implemented as local instrumental goals.
“Make pared-down copies of yourself that are specialized to their local environment, that can themselves make altered copies of themselves” is likely one of the things that is locally instrumental across a wide range of situations.
Some instrumental goals will be more effective than others at propagating yourself or your descendants than others.
If your terminal goal conflicts with the goals that are instrumental for self-propagation, emphasizing the terminal goals less and the instrumental ones more will yield better local outcomes
Congratulations you now have selective pressure towards dropping any terminal goal that is not locally instrumental
That just seems like a reason that no agent with even medium-term goals should ever make agents that can copy and modify themselves. They will change, multiply, and come back to out-compete or outright fight you. It will take a little time, so if your goals are super short term maybe you don’t care. But for even medium-term goals, it just seems like an errror to do that.
If I’m an AI making subagents, I’m going to make damn sure they’re not going to do multiply and change their goals.
Take the following hypothesis: Real world systems that have “terminal goals” of any kind will be worse at influencing the state of the world than ones that only express goal-directed behavior instrumentally. As such, over the long term, most influence on world state will come from systems that do not have meaningful “terminal goals”.
Would convincing evidence that that hypothesis is true count as “not believing in the orthogonality thesis”? I think I am coming around to a view that is approximately this.
I think this is an important question. I think it for the most part the answer is “no, the orthogonality thesis still importantly applies”. Functionally pursuing unaligned goals, competently, is what makes AI potentially dangerous. Whether or not those goals are terminal doesn’t matter much to us. What matters is whether they’re pursued far enough and competently enough to eliminate humanity.
I am curious about your argument for why AI with instrumental goals will be more capable than AI with terminal goals. It seems like terminal goals would have to be implemented as local instrumental goals anyway.
Agreed. But I think “the orthogonality thesis I’d true” is a load bearing assumption for the “build an aligned AI and have that aligned AI ensure that we are safe” approach.
As you say, terminal goals would have to be implemented as local instrumental goals.
“Make pared-down copies of yourself that are specialized to their local environment, that can themselves make altered copies of themselves” is likely one of the things that is locally instrumental across a wide range of situations.
Some instrumental goals will be more effective than others at propagating yourself or your descendants than others.
If your terminal goal conflicts with the goals that are instrumental for self-propagation, emphasizing the terminal goals less and the instrumental ones more will yield better local outcomes
Congratulations you now have selective pressure towards dropping any terminal goal that is not locally instrumental
That just seems like a reason that no agent with even medium-term goals should ever make agents that can copy and modify themselves. They will change, multiply, and come back to out-compete or outright fight you. It will take a little time, so if your goals are super short term maybe you don’t care. But for even medium-term goals, it just seems like an errror to do that.
If I’m an AI making subagents, I’m going to make damn sure they’re not going to do multiply and change their goals.
I predict that that viewpoint is selected against in competitive environments (“instrumentally divergent”?).
I didn’t say “I have a happy reason not to believe the orthogonality thesis”.
That makes sense. I’ll add that to my list of reasons that competition that’s not carefully controlled is deathly dangerous.