Rohin Shah comments on Some criteria for sandwiching projects

Rohin Shah 23 Aug 2021 13:39 UTC
LW: 4 AF: 3
AF
Planned summary for the Alignment Newsletter:
This post outlines the pieces needed in order to execute a “sandwiching” project on <@aligning narrowly superhuman models@>(@The case for aligning narrowly superhuman models@), with the example of answering questions about a text when humans have limited access to that text. (Imagine answering questions about a paper, where the model can read the full paper but human labelers can only read the abstract.) The required pieces are:
1. **Aligned metric:** There needs to be some way of telling whether the project succeeded, i.e. the technique made the narrowly superhuman model more aligned. In the Q&A case, we get the aligned metric by seeing how humans answer when they can read the entire text.
2. **A narrowly superhuman model:** The model must have the capability to outperform the labelers on the task. In the Q&A case, we get this by artificially restricting the input that the labelers get (relative to what the model gets). In other cases we could use labelers who lack the relevant domain expertise that the model instead knows.
3. **Headroom on the aligned metric:** Baseline methods (such as training from labeler feedback) should not perform very well, so that there is room for a better technique to improve performance. It would be especially nice if making the model larger led to no improvement in the aligned metric; this would mean that we are working in a situation that is primarily an alignment failure.
4. **A natural plan of attack:** We have some approach for doing better than the baseline. For the Q&A example, we could train one model that selects the most relevant piece of text (by training on labelers’ ratings of relevance) and another model that answers the question given that relevant piece.
Planned opinion:
This seems like a good way to generate good concrete empirical projects to work on. It does differ from the original post in placing less of an emphasis on “fuzzy” tasks, where aligned metrics are hard to come by, though it isn’t incompatible with it (in a “fuzzy” task, you probably still want as aligned a metric as you can get in order to measure progress).