So I think there are basically two issues with specifying natural language goals that should be separated:
Defining “what we mean” precisely, e.g. as a mathematical expression. Here is an example of such a proposal.
Design an AI system to make its best effort at estimating the value of this mathematical expression (with its limited computing resources). Here is a statement of this problem and some initial thoughts on it. If we are able to program an AI system to efficiently estimate this value, then it does not seem much harder to program an AI system to efficiently act on the basis of this estimate.
Both of these problems seem pretty hard. I don’t know whether you factor the problem a similar way? I also don’t see where stratification fits in; I’m not sure what it is meant to do that Paul’s proposal for (1) doesn’t already do. My guess is that you want “what we mean” to be defined in terms of the interaction of a human/AI system rather than just a human thinking for a really long time, but I don’t see the motivation for this.
I still don’t understand the motivation. Is the hope that “what <X value learning algorithm> would infer from observing humans in some hypothetical that doesn’t actually happen” is easier to make inferences about than “what humans would do if they thought for a very long time”?
So I think there are basically two issues with specifying natural language goals that should be separated:
Defining “what we mean” precisely, e.g. as a mathematical expression. Here is an example of such a proposal.
Design an AI system to make its best effort at estimating the value of this mathematical expression (with its limited computing resources). Here is a statement of this problem and some initial thoughts on it. If we are able to program an AI system to efficiently estimate this value, then it does not seem much harder to program an AI system to efficiently act on the basis of this estimate.
Both of these problems seem pretty hard. I don’t know whether you factor the problem a similar way? I also don’t see where stratification fits in; I’m not sure what it is meant to do that Paul’s proposal for (1) doesn’t already do. My guess is that you want “what we mean” to be defined in terms of the interaction of a human/AI system rather than just a human thinking for a really long time, but I don’t see the motivation for this.
This idea is actually very similar to Paul’s idea, but doesn’t require such an ideal setup.
I still don’t understand the motivation. Is the hope that “what <X value learning algorithm> would infer from observing humans in some hypothetical that doesn’t actually happen” is easier to make inferences about than “what humans would do if they thought for a very long time”?