It’s an attempt to define what trustworthiness means. If we say “AI 1 gave a truthful or at least correctly informative answer”, then what does that mean? How does that cash out? I’m trying to cash it out as “the human can then answer questions about the subject to some decent degree of efficacy.”
The boolean questions that the second AI asks are fixed; neither AI gets to choose them. The reward for each AI is if the human choice of yes/no is accurate, or at least accurate-according-to-the-first-AI’s-estimate. Therefore the first AI must attempt to get the human to have an accurate impression of the situation, sufficient to answer the second AI’s questions correctly.
Note that if the boolean questions are easy for a human to parse and understand, then the second AI isn’t needed. But I’d expect that, in general, that would not be the case.
It’s an attempt to define what trustworthiness means. If we say “AI 1 gave a truthful or at least correctly informative answer”, then what does that mean? How does that cash out? I’m trying to cash it out as “the human can then answer questions about the subject to some decent degree of efficacy.”
The boolean questions that the second AI asks are fixed; neither AI gets to choose them. The reward for each AI is if the human choice of yes/no is accurate, or at least accurate-according-to-the-first-AI’s-estimate. Therefore the first AI must attempt to get the human to have an accurate impression of the situation, sufficient to answer the second AI’s questions correctly.
Note that if the boolean questions are easy for a human to parse and understand, then the second AI isn’t needed. But I’d expect that, in general, that would not be the case.