I disagree, MIRI thinks it’s obvious and tractable to predict the “end result” of creating a superintelligence
But they don’t think it’s an obvious and tractable means of “directing” such a superintelligence towards our actual goals, which is what the sentence I was quoting was about.
It’s like the reasoning goes
I didn’t say any of that. I would rather summarize my position as:
Mathematicians keep pointing out the ways superintelligences with explicit goals will lead to bad outcomes. They also claim that any powerful cognitive system will tend to have such goals
But we seem to have lots of examples of powerful cognitive systems which don’t behave like explicit goal maximizers
Therefore, perhaps we should try to design superintelligences which are also not explicit goal maximizers. And also re-analyze the conditions under which the mathematicians’ purported theorems hold so we can have a better picture of which cognitive systems will act like explicit goal maximizers under which circumstances.
I think my real objection is that MIRI kind of agrees with the idea “don’t attempt to make a pure utility maximizer with a static loss function on the first try” and thus has tried to build systems that aren’t pure utility maximizers, like ones that are instead corrigible or have “chill”. They just kinda don’t work so far and anybody suggesting that they haven’t looked is being a bit silly.
Instead, I wish someone suggesting this would actually concretely describe the properties they hope to gain by removing a value function, as I suspect the real answer is… corrigibility or chill. Saying “oh this pure utillity maximizer thing looks really hard let’s explore the space of all possible agent designs instead” isn’t really helpful—what are you looking to find and why is it safer?
But they don’t think it’s an obvious and tractable means of “directing” such a superintelligence towards our actual goals, which is what the sentence I was quoting was about.
I didn’t say any of that. I would rather summarize my position as:
Mathematicians keep pointing out the ways superintelligences with explicit goals will lead to bad outcomes. They also claim that any powerful cognitive system will tend to have such goals
But we seem to have lots of examples of powerful cognitive systems which don’t behave like explicit goal maximizers
Therefore, perhaps we should try to design superintelligences which are also not explicit goal maximizers. And also re-analyze the conditions under which the mathematicians’ purported theorems hold so we can have a better picture of which cognitive systems will act like explicit goal maximizers under which circumstances.
I think my real objection is that MIRI kind of agrees with the idea “don’t attempt to make a pure utility maximizer with a static loss function on the first try” and thus has tried to build systems that aren’t pure utility maximizers, like ones that are instead corrigible or have “chill”. They just kinda don’t work so far and anybody suggesting that they haven’t looked is being a bit silly.
Instead, I wish someone suggesting this would actually concretely describe the properties they hope to gain by removing a value function, as I suspect the real answer is… corrigibility or chill. Saying “oh this pure utillity maximizer thing looks really hard let’s explore the space of all possible agent designs instead” isn’t really helpful—what are you looking to find and why is it safer?