“I actually predict, as an empirical fact about the universe, that AIs built according to almost any set of design principles will care about other sentient minds as ends in themselves, and look on the universe with wonder that they take the time and expend the energy to experience consciously; and humanity’s descendants will uplift to equality with themselves, all those and only those humans who request to be uplifted; forbidding sapient enslavement or greater horrors throughout all regions they govern; and I hold that this position is a publicly knowable truth about physical reality, and not just words to repeat from faith; and all this is a crux of my position, where I’d back off and not destroy all humane life if I were convinced that this were not so.”
with caveats (specifically, related to societal trauma, existing power structures, and noosphere ecology) this is pretty much what I actually believe. Scott Aaronson has a good essay that says roughly the same things. The actual crux of my position is that I don’t think the orthogonality thesis is a valid way to model agents with varying goals and intelligence levels.
Since you didn’t summarize the argument in that essay, I went and skimmed it. I’d love to not believe the orthogonality thesis.
I found no argument. The content was “the orthogonality thesis isn’t necessarily true”. But he did accept a “wide angle”, which seems like it would be plenty for standard doom stories. “Human goals aren’t orthogonal” was the closest to evidence. That’s true, but evolution carefully gave us our goals/values to align us with each other.
The bulk was an explicit explanation of the emotional pulls that made him want to not believe in the orthogonality thesis. Then he visibly doesn’t grapple with the actual argument.
Take the following hypothesis: Real world systems that have “terminal goals” of any kind will be worse at influencing the state of the world than ones that only express goal-directed behavior instrumentally. As such, over the long term, most influence on world state will come from systems that do not have meaningful “terminal goals”.
Would convincing evidence that that hypothesis is true count as “not believing in the orthogonality thesis”? I think I am coming around to a view that is approximately this.
I think this is an important question. I think it for the most part the answer is “no, the orthogonality thesis still importantly applies”. Functionally pursuing unaligned goals, competently, is what makes AI potentially dangerous. Whether or not those goals are terminal doesn’t matter much to us. What matters is whether they’re pursued far enough and competently enough to eliminate humanity.
I am curious about your argument for why AI with instrumental goals will be more capable than AI with terminal goals. It seems like terminal goals would have to be implemented as local instrumental goals anyway.
Functionally pursuing unaligned goals, competently, is what makes AI potentially dangerous.
Agreed. But I think “the orthogonality thesis I’d true” is a load bearing assumption for the “build an aligned AI and have that aligned AI ensure that we are safe” approach.
I am curious about your argument for why AI with instrumental goals will be more capable than AI with terminal goals. It seems like terminal goals would have to be implemented as local instrumental goals anyway.
As you say, terminal goals would have to be implemented as local instrumental goals.
“Make pared-down copies of yourself that are specialized to their local environment, that can themselves make altered copies of themselves” is likely one of the things that is locally instrumental across a wide range of situations.
Some instrumental goals will be more effective than others at propagating yourself or your descendants than others.
If your terminal goal conflicts with the goals that are instrumental for self-propagation, emphasizing the terminal goals less and the instrumental ones more will yield better local outcomes
Congratulations you now have selective pressure towards dropping any terminal goal that is not locally instrumental
That just seems like a reason that no agent with even medium-term goals should ever make agents that can copy and modify themselves. They will change, multiply, and come back to out-compete or outright fight you. It will take a little time, so if your goals are super short term maybe you don’t care. But for even medium-term goals, it just seems like an errror to do that.
If I’m an AI making subagents, I’m going to make damn sure they’re not going to do multiply and change their goals.
That’s true, but evolution carefully gave us our goals/values to align us with each other.
With the caveat that some humans do have goals and values orthogonal to those of other humans. The result is generally some combination of shunning, exile, imprisonment, killing, fining, or other sanctions, as determined by whichever side has more power to impose its will.
Right—so humans having different goals seems like evidence that AGIs would have even more different goals by default without the evolutionary pressures.
This is such a bizarre position that it’s hard for me to empathize. What would “the orthogonality thesis is false” even mean? Do you think aliens with different biology and evolutionary history “naturally” create humanoid societies?
One example of the orthogonality thesis being false would be “acting on terminal goals is instrumentally harmful in a wide range of situations, and having to maintain terminal goals that are not acted and track whether it is time to act on them on imposes costs, and so agents that have terminal goals will be outcompeted by ones that don’t”.
You might believe that the orthogonality thesis is probabilistically false, in that it is very unlikely for intelligent beings to arise that highly value paperclips or whatever. Aliens might not create humanoid societies but it seems plausible that they would likely be conscious, value positive valence, have some sort of social emotion suite, value exploration and curiosity, etc.
Assume that our universe is set up the way you believe it is, ie: the orthogonality thesis is false, sufficiently intelligent agents automatically value the welfare of sentient minds.
In spite of our assumption we can create a system behaving exactly like a misaligned AI would in the following way:
The superintelligent AI is created and placed in a simulation without its knowledge. This superintelligent AI by assumption is aligned with human values.
The user outside the simulation gives a goal (which is not necessarily aligned to human values, eg: ‘make paperclips’) to the system the following way:
Every timestep the aligned AI in the simulation is asked to predict the behavior of a (to its knowledge) hypothetical AI with the user’s goal and situation corresponding to the situation of the system outside the simulation.
Then the system behaves as given by the simulated superintelligent aligned AI and the simulated AI’s memory is reset.
This setup requires a few non-trivial components apart from the simulated SAI:
a component simulating the world of the SAI and setting that up to give the aligned AI incentive to answer the ‘hypothetical’ questions without letting it know that its in a simulation
a component translating the SAI’s answers to the real world
If you don’t deny that any of these components is theoretically possible, then how is it possible for you to believe that a misaligned superintelligent system is impossible?
If you believe that a misaligned superintelligent system is indeed possible in theory, then what is the reason you believe that gradient descent/RLHF or some other way we will use to create AIs will result in ones considerate of the welfare of sentient minds?
with caveats (specifically, related to societal trauma, existing power structures, and noosphere ecology) this is pretty much what I actually believe. Scott Aaronson has a good essay that says roughly the same things. The actual crux of my position is that I don’t think the orthogonality thesis is a valid way to model agents with varying goals and intelligence levels.
Since you didn’t summarize the argument in that essay, I went and skimmed it. I’d love to not believe the orthogonality thesis.
I found no argument. The content was “the orthogonality thesis isn’t necessarily true”. But he did accept a “wide angle”, which seems like it would be plenty for standard doom stories. “Human goals aren’t orthogonal” was the closest to evidence. That’s true, but evolution carefully gave us our goals/values to align us with each other.
The bulk was an explicit explanation of the emotional pulls that made him want to not believe in the orthogonality thesis. Then he visibly doesn’t grapple with the actual argument.
Take the following hypothesis: Real world systems that have “terminal goals” of any kind will be worse at influencing the state of the world than ones that only express goal-directed behavior instrumentally. As such, over the long term, most influence on world state will come from systems that do not have meaningful “terminal goals”.
Would convincing evidence that that hypothesis is true count as “not believing in the orthogonality thesis”? I think I am coming around to a view that is approximately this.
I think this is an important question. I think it for the most part the answer is “no, the orthogonality thesis still importantly applies”. Functionally pursuing unaligned goals, competently, is what makes AI potentially dangerous. Whether or not those goals are terminal doesn’t matter much to us. What matters is whether they’re pursued far enough and competently enough to eliminate humanity.
I am curious about your argument for why AI with instrumental goals will be more capable than AI with terminal goals. It seems like terminal goals would have to be implemented as local instrumental goals anyway.
Agreed. But I think “the orthogonality thesis I’d true” is a load bearing assumption for the “build an aligned AI and have that aligned AI ensure that we are safe” approach.
As you say, terminal goals would have to be implemented as local instrumental goals.
“Make pared-down copies of yourself that are specialized to their local environment, that can themselves make altered copies of themselves” is likely one of the things that is locally instrumental across a wide range of situations.
Some instrumental goals will be more effective than others at propagating yourself or your descendants than others.
If your terminal goal conflicts with the goals that are instrumental for self-propagation, emphasizing the terminal goals less and the instrumental ones more will yield better local outcomes
Congratulations you now have selective pressure towards dropping any terminal goal that is not locally instrumental
That just seems like a reason that no agent with even medium-term goals should ever make agents that can copy and modify themselves. They will change, multiply, and come back to out-compete or outright fight you. It will take a little time, so if your goals are super short term maybe you don’t care. But for even medium-term goals, it just seems like an errror to do that.
If I’m an AI making subagents, I’m going to make damn sure they’re not going to do multiply and change their goals.
I predict that that viewpoint is selected against in competitive environments (“instrumentally divergent”?).
I didn’t say “I have a happy reason not to believe the orthogonality thesis”.
That makes sense. I’ll add that to my list of reasons that competition that’s not carefully controlled is deathly dangerous.
With the caveat that some humans do have goals and values orthogonal to those of other humans. The result is generally some combination of shunning, exile, imprisonment, killing, fining, or other sanctions, as determined by whichever side has more power to impose its will.
Right—so humans having different goals seems like evidence that AGIs would have even more different goals by default without the evolutionary pressures.
Agreed, yes.
This is such a bizarre position that it’s hard for me to empathize. What would “the orthogonality thesis is false” even mean? Do you think aliens with different biology and evolutionary history “naturally” create humanoid societies?
One example of the orthogonality thesis being false would be “acting on terminal goals is instrumentally harmful in a wide range of situations, and having to maintain terminal goals that are not acted and track whether it is time to act on them on imposes costs, and so agents that have terminal goals will be outcompeted by ones that don’t”.
You might believe that the orthogonality thesis is probabilistically false, in that it is very unlikely for intelligent beings to arise that highly value paperclips or whatever. Aliens might not create humanoid societies but it seems plausible that they would likely be conscious, value positive valence, have some sort of social emotion suite, value exploration and curiosity, etc.
Assume that our universe is set up the way you believe it is, ie: the orthogonality thesis is false, sufficiently intelligent agents automatically value the welfare of sentient minds.
In spite of our assumption we can create a system behaving exactly like a misaligned AI would in the following way:
The superintelligent AI is created and placed in a simulation without its knowledge. This superintelligent AI by assumption is aligned with human values.
The user outside the simulation gives a goal (which is not necessarily aligned to human values, eg: ‘make paperclips’) to the system the following way:
Every timestep the aligned AI in the simulation is asked to predict the behavior of a (to its knowledge) hypothetical AI with the user’s goal and situation corresponding to the situation of the system outside the simulation.
Then the system behaves as given by the simulated superintelligent aligned AI and the simulated AI’s memory is reset.
This setup requires a few non-trivial components apart from the simulated SAI:
a component simulating the world of the SAI and setting that up to give the aligned AI incentive to answer the ‘hypothetical’ questions without letting it know that its in a simulation
a component translating the SAI’s answers to the real world
If you don’t deny that any of these components is theoretically possible, then how is it possible for you to believe that a misaligned superintelligent system is impossible?
If you believe that a misaligned superintelligent system is indeed possible in theory, then what is the reason you believe that gradient descent/RLHF or some other way we will use to create AIs will result in ones considerate of the welfare of sentient minds?