Using a superintelligence to optimize some explicit goal, i.e. giving it the ‘wrapper structure’, is an obvious and tractable way to direct it
Is it obvious and tractable? MIRI doesn’t currently seem to think so. Given that, it might be worth considering some alternative possibilities. Especially those that don’t involve the creation of a superpowered wrapper mind as a failure state.
and some superintelligences kill everyone anyways
The arguments for why superintelligences will kill everyone tend to route through those intelligences being or becoming wrapper minds. So if we had an AI architecture that was not a wrapper mind, or especially likely to become one, that might defuse some of those arguments’ force.
Is it obvious and tractable? MIRI doesn’t currently seem to think so.
I disagree, MIRI thinks it’s obvious and tractable to predict the “end result” of creating a superintelligence with the wrong value function. It’s just not good.
The arguments for why superintelligences will kill everyone tend to route through those intelligences being or becoming wrapper minds. So if we had an AI architecture that was not a wrapper mind, or especially likely to become one, that might defuse some of those arguments’ force.
This is the kind of logic I talk about when I say “sound more like attempts to obfuscate the problem than serious proposals designed to verifiably prevent the end of the world”. It’s like the reasoning goes:
Mathematicians keep pointing out the ways superintelligences with explicit goals cause bad outcomes.
If we add complications, such as implicit goals, to the superintelligences, they can’t reason as analytically about them
Therefore superintelligences with implicit goals are “safe”
The analytical reasoning is not the problem. The problem is humans have this very particular habitat they need for their survival and if we have a superintelligence running around waving their magic wand, casting random spells at things, we will probably not be able to survive. The problem is also that we need a surefire way of getting this superintelligence to wave its magic wand and prevent the other superintelligences from spawning. “What if we flip random bits in the superintelligence’s code” is not a solution either, for the same reason.
I disagree, MIRI thinks it’s obvious and tractable to predict the “end result” of creating a superintelligence
But they don’t think it’s an obvious and tractable means of “directing” such a superintelligence towards our actual goals, which is what the sentence I was quoting was about.
It’s like the reasoning goes
I didn’t say any of that. I would rather summarize my position as:
Mathematicians keep pointing out the ways superintelligences with explicit goals will lead to bad outcomes. They also claim that any powerful cognitive system will tend to have such goals
But we seem to have lots of examples of powerful cognitive systems which don’t behave like explicit goal maximizers
Therefore, perhaps we should try to design superintelligences which are also not explicit goal maximizers. And also re-analyze the conditions under which the mathematicians’ purported theorems hold so we can have a better picture of which cognitive systems will act like explicit goal maximizers under which circumstances.
I think my real objection is that MIRI kind of agrees with the idea “don’t attempt to make a pure utility maximizer with a static loss function on the first try” and thus has tried to build systems that aren’t pure utility maximizers, like ones that are instead corrigible or have “chill”. They just kinda don’t work so far and anybody suggesting that they haven’t looked is being a bit silly.
Instead, I wish someone suggesting this would actually concretely describe the properties they hope to gain by removing a value function, as I suspect the real answer is… corrigibility or chill. Saying “oh this pure utillity maximizer thing looks really hard let’s explore the space of all possible agent designs instead” isn’t really helpful—what are you looking to find and why is it safer?
Is it obvious and tractable? MIRI doesn’t currently seem to think so. Given that, it might be worth considering some alternative possibilities. Especially those that don’t involve the creation of a superpowered wrapper mind as a failure state.
The arguments for why superintelligences will kill everyone tend to route through those intelligences being or becoming wrapper minds. So if we had an AI architecture that was not a wrapper mind, or especially likely to become one, that might defuse some of those arguments’ force.
I disagree, MIRI thinks it’s obvious and tractable to predict the “end result” of creating a superintelligence with the wrong value function. It’s just not good.
This is the kind of logic I talk about when I say “sound more like attempts to obfuscate the problem than serious proposals designed to verifiably prevent the end of the world”. It’s like the reasoning goes:
Mathematicians keep pointing out the ways superintelligences with explicit goals cause bad outcomes.
If we add complications, such as implicit goals, to the superintelligences, they can’t reason as analytically about them
Therefore superintelligences with implicit goals are “safe”
The analytical reasoning is not the problem. The problem is humans have this very particular habitat they need for their survival and if we have a superintelligence running around waving their magic wand, casting random spells at things, we will probably not be able to survive. The problem is also that we need a surefire way of getting this superintelligence to wave its magic wand and prevent the other superintelligences from spawning. “What if we flip random bits in the superintelligence’s code” is not a solution either, for the same reason.
But they don’t think it’s an obvious and tractable means of “directing” such a superintelligence towards our actual goals, which is what the sentence I was quoting was about.
I didn’t say any of that. I would rather summarize my position as:
Mathematicians keep pointing out the ways superintelligences with explicit goals will lead to bad outcomes. They also claim that any powerful cognitive system will tend to have such goals
But we seem to have lots of examples of powerful cognitive systems which don’t behave like explicit goal maximizers
Therefore, perhaps we should try to design superintelligences which are also not explicit goal maximizers. And also re-analyze the conditions under which the mathematicians’ purported theorems hold so we can have a better picture of which cognitive systems will act like explicit goal maximizers under which circumstances.
I think my real objection is that MIRI kind of agrees with the idea “don’t attempt to make a pure utility maximizer with a static loss function on the first try” and thus has tried to build systems that aren’t pure utility maximizers, like ones that are instead corrigible or have “chill”. They just kinda don’t work so far and anybody suggesting that they haven’t looked is being a bit silly.
Instead, I wish someone suggesting this would actually concretely describe the properties they hope to gain by removing a value function, as I suspect the real answer is… corrigibility or chill. Saying “oh this pure utillity maximizer thing looks really hard let’s explore the space of all possible agent designs instead” isn’t really helpful—what are you looking to find and why is it safer?