The counterfactual oracle can answer questions for which you can evaluate answers automatically (and might be safe because it doesn’t care about being right in the case where you read the prediction so it won’t manipulate you), and the low-bandwith oracle can answer multiple-choice questions (and might be safe because none of the multiple-choice options are unsafe).
My first thought for this is to ask the counterfactual oracle for an essay on the importance of coffee, and in the case where you don’t see its answer, you get an expert to write the best essay on coffee possible, and score the oracle by the similarity between what it writes and what the expert writes. Though this only gives you human levels of performance.
The counterfactual oracle can answer questions for which you can evaluate answers automatically (and might be safe because it doesn’t care about being right in the case where you read the prediction so it won’t manipulate you), and the low-bandwith oracle can answer multiple-choice questions (and might be safe because none of the multiple-choice options are unsafe).
My first thought for this is to ask the counterfactual oracle for an essay on the importance of coffee, and in the case where you don’t see its answer, you get an expert to write the best essay on coffee possible, and score the oracle by the similarity between what it writes and what the expert writes. Though this only gives you human levels of performance.
Thank you. This makes much more sense.