I think I see how X-and-only-X is a problem if we are using a classifier to furnish a 0⁄1 reward. However, it seems like less of a problem if we’re using a regression model to furnish a floating point reward that attempts to describe all of our values (not just our values as they pertain to the completion of one particular task).
Suppose we are granted a regression model which accurately predicts the value we assign to any event which happens in the world. If this model furnishes the AI’s reward function, it creates pressure to avoid optimizing for hidden Ys we don’t want: Since we don’t want them, the regression model gives them a negative score, and the AI works to avoid them.
A regression model which accurately predicts our values is a huge ask. But I’m not sure getting from here to their would require solving new basic problems. Instead, it seems to me like we’d need to get much better at an existing problem: Building models with high predictive accuracy in complex domains.
Maybe you don’t think we will get to the necessary level of accuracy by hill-climbing our existing predictive model tech, and this is what will create the new basic problems?
An AI that learns to exactly imitate humans, not just passing the Turing Test to the limits of human discrimination on human inspection, but perfect imitation with all added bad subtle properties thereby excluded, must be so cognitively powerful that its learnable hypothesis space includes systems equivalent to entire human brains. I see no way that we’re not talking about a superintelligence here.
“Superintelligence” is a word which, to me, suggests a qualitative shift relative to existing hypothesis learning systems. Existing hypothesis learning systems don’t attempt to maximize paperclips or anything like that—they’re procedures that search for hypotheses which fit data.
There are many quantitative axes along which such procedures can be compared: How much time does the procedure take? How much data does the procedure require? How complex can the data be? How well do the resulting hypotheses generalize? Etc.
I don’t see any reason to think we will see sudden qualitative shifts as our learning procedures improve along these quantitative axes. Therefore, I suspect the word “superintelligence” has connotations that aren’t actually necessary for the operation of an extremely advanced hypothesis search system. We already have hypothesis learning systems that are superhuman at e.g. predicting stock prices, but these systems don’t seem to be trying to break out of their boxes or anything like that. I’m not sure why a hypothesis learning system which is a superhuman neuroscientist would be different.
We have no guarantee of non-Y for any Y a human can’t detect, which covers an enormous amount of lethal territory
I think there are two things that might be worth separating here: malign plans that are disguised as benign plans, and undetectable imperfections broadly speaking. The key difference is whether the undetectable imperfection is a result of deliberate deception on the AI’s part, vs the broad phenomenon of systems that have very high (but not perfect) fidelity.
Suppose our emulation of Paul has very high (but not perfect) fidelity, and Paul is not the sort of person who will disguise a malign plan as a benign plan. In this case, we’re likely to see the second phenomenon, but not the first—the first phenomenon would require a gross error in our emulation of Paul, and by assumption our emulation of Paul is very high fidelity.
I think a good case has been made that we need to be very worried about malign plans disguised as benign plans. I’m not personally convinced we need to be very worried about undetectable imperfections more broadly.
So we cannot for example let an untrusted superintelligence originate queries that it can use to learn human behavior; it has to be strictly unsupervised example-based learning rather than a query model.
I found the use of “unsupervised” confusing in this context (“example-based” sounds like supervised learning, where the system gets labeled data). I think maybe passive vs active) learning is the distinction you are looking for?
I think I see how X-and-only-X is a problem if we are using a classifier to furnish a 0⁄1 reward. However, it seems like less of a problem if we’re using a regression model to furnish a floating point reward that attempts to describe all of our values (not just our values as they pertain to the completion of one particular task).
Suppose we are granted a regression model which accurately predicts the value we assign to any event which happens in the world. If this model furnishes the AI’s reward function, it creates pressure to avoid optimizing for hidden Ys we don’t want: Since we don’t want them, the regression model gives them a negative score, and the AI works to avoid them.
A regression model which accurately predicts our values is a huge ask. But I’m not sure getting from here to their would require solving new basic problems. Instead, it seems to me like we’d need to get much better at an existing problem: Building models with high predictive accuracy in complex domains.
Maybe you don’t think we will get to the necessary level of accuracy by hill-climbing our existing predictive model tech, and this is what will create the new basic problems?
“Superintelligence” is a word which, to me, suggests a qualitative shift relative to existing hypothesis learning systems. Existing hypothesis learning systems don’t attempt to maximize paperclips or anything like that—they’re procedures that search for hypotheses which fit data.
There are many quantitative axes along which such procedures can be compared: How much time does the procedure take? How much data does the procedure require? How complex can the data be? How well do the resulting hypotheses generalize? Etc.
I don’t see any reason to think we will see sudden qualitative shifts as our learning procedures improve along these quantitative axes. Therefore, I suspect the word “superintelligence” has connotations that aren’t actually necessary for the operation of an extremely advanced hypothesis search system. We already have hypothesis learning systems that are superhuman at e.g. predicting stock prices, but these systems don’t seem to be trying to break out of their boxes or anything like that. I’m not sure why a hypothesis learning system which is a superhuman neuroscientist would be different.
I think there are two things that might be worth separating here: malign plans that are disguised as benign plans, and undetectable imperfections broadly speaking. The key difference is whether the undetectable imperfection is a result of deliberate deception on the AI’s part, vs the broad phenomenon of systems that have very high (but not perfect) fidelity.
Suppose our emulation of Paul has very high (but not perfect) fidelity, and Paul is not the sort of person who will disguise a malign plan as a benign plan. In this case, we’re likely to see the second phenomenon, but not the first—the first phenomenon would require a gross error in our emulation of Paul, and by assumption our emulation of Paul is very high fidelity.
I think a good case has been made that we need to be very worried about malign plans disguised as benign plans. I’m not personally convinced we need to be very worried about undetectable imperfections more broadly.
I found the use of “unsupervised” confusing in this context (“example-based” sounds like supervised learning, where the system gets labeled data). I think maybe passive vs active) learning is the distinction you are looking for?