I don’t really understand when this would be useful. Is it for an oracle that you don’t trust, or that otherwise has no incentive to explain things to you? Because in that case, nothing here constrains the explanation (or the questions) to be related to reality. It could just choose to explain something that is easy for the human to understand, and then the questions will all be answered correctly.
If you do trust the oracle, the second AI is unnecessary—the first one could just ask questions of the human to confirm understanding, like any student-teacher dynamic.
What am I missing here?
It’s an attempt to define what trustworthiness means. If we say “AI 1 gave a truthful or at least correctly informative answer”, then what does that mean? How does that cash out? I’m trying to cash it out as “the human can then answer questions about the subject to some decent degree of efficacy.”
The boolean questions that the second AI asks are fixed; neither AI gets to choose them. The reward for each AI is if the human choice of yes/no is accurate, or at least accurate-according-to-the-first-AI’s-estimate. Therefore the first AI must attempt to get the human to have an accurate impression of the situation, sufficient to answer the second AI’s questions correctly.
Note that if the boolean questions are easy for a human to parse and understand, then the second AI isn’t needed. But I’d expect that, in general, that would not be the case.
I don’t really understand when this would be useful. Is it for an oracle that you don’t trust, or that otherwise has no incentive to explain things to you? Because in that case, nothing here constrains the explanation (or the questions) to be related to reality. It could just choose to explain something that is easy for the human to understand, and then the questions will all be answered correctly. If you do trust the oracle, the second AI is unnecessary—the first one could just ask questions of the human to confirm understanding, like any student-teacher dynamic. What am I missing here?
See here for a longer answer: http://lesswrong.com/r/discussion/lw/ol7/true_understanding_comes_from_passing_exams/
It’s an attempt to define what trustworthiness means. If we say “AI 1 gave a truthful or at least correctly informative answer”, then what does that mean? How does that cash out? I’m trying to cash it out as “the human can then answer questions about the subject to some decent degree of efficacy.”
The boolean questions that the second AI asks are fixed; neither AI gets to choose them. The reward for each AI is if the human choice of yes/no is accurate, or at least accurate-according-to-the-first-AI’s-estimate. Therefore the first AI must attempt to get the human to have an accurate impression of the situation, sufficient to answer the second AI’s questions correctly.
Note that if the boolean questions are easy for a human to parse and understand, then the second AI isn’t needed. But I’d expect that, in general, that would not be the case.