Counterfactual do-what-I-mean
A putative new idea for AI control; index here.
The counterfactual approach could be used to possibly allow natural language goals for AIs.
The basic idea is that when the AI is given a natural language goal like “increase human happiness” or “implement CEV”, it is not to figure out what these goals mean, but to follow what a pure learning algorithm would establish these goals as meaning.
This would be safer than a simple figure-out-the-utility-you’re-currently-maximising approach. But it still doesn’t solve a few drawbacks. Firstly, the learning algorithm has to be effective itself (in particular, modifying human understanding of the words should be ruled out, and the learning process must avoid concluding the simpler interpretations are always better). And secondly, humans’ don’t yet know what these words mean, outside our usual comfort zone, so the “learning” task also involves the AI extrapolating beyond what we know.
So I think there are basically two issues with specifying natural language goals that should be separated:
Defining “what we mean” precisely, e.g. as a mathematical expression. Here is an example of such a proposal.
Design an AI system to make its best effort at estimating the value of this mathematical expression (with its limited computing resources). Here is a statement of this problem and some initial thoughts on it. If we are able to program an AI system to efficiently estimate this value, then it does not seem much harder to program an AI system to efficiently act on the basis of this estimate.
Both of these problems seem pretty hard. I don’t know whether you factor the problem a similar way? I also don’t see where stratification fits in; I’m not sure what it is meant to do that Paul’s proposal for (1) doesn’t already do. My guess is that you want “what we mean” to be defined in terms of the interaction of a human/AI system rather than just a human thinking for a really long time, but I don’t see the motivation for this.
This idea is actually very similar to Paul’s idea, but doesn’t require such an ideal setup.
I still don’t understand the motivation. Is the hope that “what <X value learning algorithm> would infer from observing humans in some hypothetical that doesn’t actually happen” is easier to make inferences about than “what humans would do if they thought for a very long time”?