I don’t think you actually want to use supervised training for training f, you want to use feedback of the form “Is this answer much wronger than that answer?” and then train the model to not produce definitely-wrong answers.
Likewise the f+=f− constraint would really want to be something softer (e.g. forcing f+ to give plausible-looking answers to questions as evaluated by f−).
I think that most questions about what is useful / tacitly assumed / etc. can be easily handled on top of the “raw” ability to elicit the model’s knowledge (if you like you could imagine having a debate about which answer is better all things considered, using f+ to assess the model’s beliefs about closed question)
I do think there are a lot of problems along these lines that you’d want to think about a bunch in theory, and then later need to do a bunch of empirical work on. But unfortunately I also think there are a lot of “bigger fish to fry” that are very likely to sink this entire family of approaches. So the first order of business is understanding those and wandering our way to a general category of solution that might actually work.
Ok, the softer constraints make sense to me, thanks.
Using a debate with f+ assessing simple closed questions makes sense, but it seems to me that only moves much of the problem rather than solving it. We start with “answering honestly vs predicting human answers” and end up with “judging honestly vs predicting human judgments”.
While “Which answer is better, Alice’s or Bob’s?” is a closed question, learning to answer the general case still requires applying a full model of human values—so it seems a judge-model is likely to be instrumental (or essentially equivalent: again, I’m not really sure what we’d mean by an intended model for the judge).
But perhaps I’m missing something here; is predicting-the-judge less of a problem than the original? Are there better approaches than using debate which wouldn’t have analogous issues?
I don’t think you actually want to use supervised training for training f, you want to use feedback of the form “Is this answer much wronger than that answer?” and then train the model to not produce definitely-wrong answers.
Likewise the f+=f− constraint would really want to be something softer (e.g. forcing f+ to give plausible-looking answers to questions as evaluated by f−).
I think that most questions about what is useful / tacitly assumed / etc. can be easily handled on top of the “raw” ability to elicit the model’s knowledge (if you like you could imagine having a debate about which answer is better all things considered, using f+ to assess the model’s beliefs about closed question)
I do think there are a lot of problems along these lines that you’d want to think about a bunch in theory, and then later need to do a bunch of empirical work on. But unfortunately I also think there are a lot of “bigger fish to fry” that are very likely to sink this entire family of approaches. So the first order of business is understanding those and wandering our way to a general category of solution that might actually work.
Ok, the softer constraints make sense to me, thanks.
Using a debate with f+ assessing simple closed questions makes sense, but it seems to me that only moves much of the problem rather than solving it. We start with “answering honestly vs predicting human answers” and end up with “judging honestly vs predicting human judgments”.
While “Which answer is better, Alice’s or Bob’s?” is a closed question, learning to answer the general case still requires applying a full model of human values—so it seems a judge-model is likely to be instrumental (or essentially equivalent: again, I’m not really sure what we’d mean by an intended model for the judge).
But perhaps I’m missing something here; is predicting-the-judge less of a problem than the original? Are there better approaches than using debate which wouldn’t have analogous issues?