Content Warning: Do not listen to anything I say about the technical problem of alignment. I am generally incapable of reasoning about these sorts of things and am almost certainly wrong.
For MIRI and people who think like MIRI does, the big question is: “how do we align an superintelligence [which is assumed to have the wrapper structure]?”
I think the big question MIRI asks at this point is “how do we prevent someone from using a superintelligence to kill everyone”. In other words, how not to end up there. Some superintelligences that would kill everyone have the wrapper structure, some superintelligences end up modifying themselves to have ‘the wrapper structure’, and some superintelligences kill everyone in their default state.
The reason people in alignment focus on giving a superintelligence some explicit goal, i.e. giving it the ‘wrapper structure’, is that it’s an obvious and mathematically tractable way to direct it. The reason we need to direct it in the first place is so we can prevent other organizations and people from creating those superintelligences that ruin everything. There are other ways to direct them, of course, but when I hear those other ways explicitly described, they either turn out to also kill us or end up seeming like attempts to obfuscate the problem rather than serious proposals designed to verifiably prevent the end of the world.
You say:
Are wrapper-minds inevitable? I can’t imagine that they are.
But someone is eventually going to try to run one unless we do something about it. So what can we do? And how can we be sure the more complicated proposal is not going to kill us as well?
Using a superintelligence to optimize some explicit goal, i.e. giving it the ‘wrapper structure’, is an obvious and tractable way to direct it
Is it obvious and tractable? MIRI doesn’t currently seem to think so. Given that, it might be worth considering some alternative possibilities. Especially those that don’t involve the creation of a superpowered wrapper mind as a failure state.
and some superintelligences kill everyone anyways
The arguments for why superintelligences will kill everyone tend to route through those intelligences being or becoming wrapper minds. So if we had an AI architecture that was not a wrapper mind, or especially likely to become one, that might defuse some of those arguments’ force.
Is it obvious and tractable? MIRI doesn’t currently seem to think so.
I disagree, MIRI thinks it’s obvious and tractable to predict the “end result” of creating a superintelligence with the wrong value function. It’s just not good.
The arguments for why superintelligences will kill everyone tend to route through those intelligences being or becoming wrapper minds. So if we had an AI architecture that was not a wrapper mind, or especially likely to become one, that might defuse some of those arguments’ force.
This is the kind of logic I talk about when I say “sound more like attempts to obfuscate the problem than serious proposals designed to verifiably prevent the end of the world”. It’s like the reasoning goes:
Mathematicians keep pointing out the ways superintelligences with explicit goals cause bad outcomes.
If we add complications, such as implicit goals, to the superintelligences, they can’t reason as analytically about them
Therefore superintelligences with implicit goals are “safe”
The analytical reasoning is not the problem. The problem is humans have this very particular habitat they need for their survival and if we have a superintelligence running around waving their magic wand, casting random spells at things, we will probably not be able to survive. The problem is also that we need a surefire way of getting this superintelligence to wave its magic wand and prevent the other superintelligences from spawning. “What if we flip random bits in the superintelligence’s code” is not a solution either, for the same reason.
I disagree, MIRI thinks it’s obvious and tractable to predict the “end result” of creating a superintelligence
But they don’t think it’s an obvious and tractable means of “directing” such a superintelligence towards our actual goals, which is what the sentence I was quoting was about.
It’s like the reasoning goes
I didn’t say any of that. I would rather summarize my position as:
Mathematicians keep pointing out the ways superintelligences with explicit goals will lead to bad outcomes. They also claim that any powerful cognitive system will tend to have such goals
But we seem to have lots of examples of powerful cognitive systems which don’t behave like explicit goal maximizers
Therefore, perhaps we should try to design superintelligences which are also not explicit goal maximizers. And also re-analyze the conditions under which the mathematicians’ purported theorems hold so we can have a better picture of which cognitive systems will act like explicit goal maximizers under which circumstances.
I think my real objection is that MIRI kind of agrees with the idea “don’t attempt to make a pure utility maximizer with a static loss function on the first try” and thus has tried to build systems that aren’t pure utility maximizers, like ones that are instead corrigible or have “chill”. They just kinda don’t work so far and anybody suggesting that they haven’t looked is being a bit silly.
Instead, I wish someone suggesting this would actually concretely describe the properties they hope to gain by removing a value function, as I suspect the real answer is… corrigibility or chill. Saying “oh this pure utillity maximizer thing looks really hard let’s explore the space of all possible agent designs instead” isn’t really helpful—what are you looking to find and why is it safer?
Content Warning: Do not listen to anything I say about the technical problem of alignment. I am generally incapable of reasoning about these sorts of things and am almost certainly wrong.
I think the big question MIRI asks at this point is “how do we prevent someone from using a superintelligence to kill everyone”. In other words, how not to end up there. Some superintelligences that would kill everyone have the wrapper structure, some superintelligences end up modifying themselves to have ‘the wrapper structure’, and some superintelligences kill everyone in their default state.
The reason people in alignment focus on giving a superintelligence some explicit goal, i.e. giving it the ‘wrapper structure’, is that it’s an obvious and mathematically tractable way to direct it. The reason we need to direct it in the first place is so we can prevent other organizations and people from creating those superintelligences that ruin everything. There are other ways to direct them, of course, but when I hear those other ways explicitly described, they either turn out to also kill us or end up seeming like attempts to obfuscate the problem rather than serious proposals designed to verifiably prevent the end of the world.
You say:
But someone is eventually going to try to run one unless we do something about it. So what can we do? And how can we be sure the more complicated proposal is not going to kill us as well?
Is it obvious and tractable? MIRI doesn’t currently seem to think so. Given that, it might be worth considering some alternative possibilities. Especially those that don’t involve the creation of a superpowered wrapper mind as a failure state.
The arguments for why superintelligences will kill everyone tend to route through those intelligences being or becoming wrapper minds. So if we had an AI architecture that was not a wrapper mind, or especially likely to become one, that might defuse some of those arguments’ force.
I disagree, MIRI thinks it’s obvious and tractable to predict the “end result” of creating a superintelligence with the wrong value function. It’s just not good.
This is the kind of logic I talk about when I say “sound more like attempts to obfuscate the problem than serious proposals designed to verifiably prevent the end of the world”. It’s like the reasoning goes:
Mathematicians keep pointing out the ways superintelligences with explicit goals cause bad outcomes.
If we add complications, such as implicit goals, to the superintelligences, they can’t reason as analytically about them
Therefore superintelligences with implicit goals are “safe”
The analytical reasoning is not the problem. The problem is humans have this very particular habitat they need for their survival and if we have a superintelligence running around waving their magic wand, casting random spells at things, we will probably not be able to survive. The problem is also that we need a surefire way of getting this superintelligence to wave its magic wand and prevent the other superintelligences from spawning. “What if we flip random bits in the superintelligence’s code” is not a solution either, for the same reason.
But they don’t think it’s an obvious and tractable means of “directing” such a superintelligence towards our actual goals, which is what the sentence I was quoting was about.
I didn’t say any of that. I would rather summarize my position as:
Mathematicians keep pointing out the ways superintelligences with explicit goals will lead to bad outcomes. They also claim that any powerful cognitive system will tend to have such goals
But we seem to have lots of examples of powerful cognitive systems which don’t behave like explicit goal maximizers
Therefore, perhaps we should try to design superintelligences which are also not explicit goal maximizers. And also re-analyze the conditions under which the mathematicians’ purported theorems hold so we can have a better picture of which cognitive systems will act like explicit goal maximizers under which circumstances.
I think my real objection is that MIRI kind of agrees with the idea “don’t attempt to make a pure utility maximizer with a static loss function on the first try” and thus has tried to build systems that aren’t pure utility maximizers, like ones that are instead corrigible or have “chill”. They just kinda don’t work so far and anybody suggesting that they haven’t looked is being a bit silly.
Instead, I wish someone suggesting this would actually concretely describe the properties they hope to gain by removing a value function, as I suspect the real answer is… corrigibility or chill. Saying “oh this pure utillity maximizer thing looks really hard let’s explore the space of all possible agent designs instead” isn’t really helpful—what are you looking to find and why is it safer?