I think that we should keep these stronger Lookup Table and Overseer’s Manual scenarios in mind when considering whether HCH might be safe.
These scenarios are not just strictly stronger but also have downsides, right? In particular they seem to give up some properties that made some people optimistic in Paul’s approach in the first place. See this post by Daniel Dewey for example:
It seems to me that this kind of approach is also much more likely to be robust to unanticipated problems than a formal, HRAD-style approach would be, since it explicitly aims to learn how to reason in human-endorsed ways instead of relying on researchers to notice and formally solve all critical problems of reasoning before the system is built.
In the Lookup Table case, and to the extent that humans in Overseer’s Manual are just acting as lookup table decompressors, aren’t we back to “relying on researchers to notice and formally solve all critical problems of reasoning before the system is built”?
Yeah, to some extent. In the Lookup Table case, you need to have a (potentially quite expensive) way of resolving all mistakes. In the Overseer’s Manual case, you can also leverage humans to do some kind of more robust reasoning (for example, they can notice a typo in a question and still respond correctly, even if the Lookup Table would fail in this case). Though in low-bandwidth oversight, the space of things that participants could notice and correct is fairly limited.
Though I think this still differs from HRAD in that it seems like the output of HRAD would be a much smaller thing in terms of description length than the Lookup Table, and you can buy extra robustness by adding many more human-reasoned things into the Lookup Table (ie. automatically add versions of all questions with typos that don’t change the meaning of a question into the Lookup Table, add 1000 different sanity check questions to flag that things can go wrong).
So I think there are additional ways the system could correct mistaken reasoning relative to what I would think the output of HRAD would look like, but you do need to have processes that you think can correct any way that reasoning goes wrong. So the problem could be less difficult than HRAD, but still tricky to get right.
These scenarios are not just strictly stronger but also have downsides, right? In particular they seem to give up some properties that made some people optimistic in Paul’s approach in the first place. See this post by Daniel Dewey for example:
In the Lookup Table case, and to the extent that humans in Overseer’s Manual are just acting as lookup table decompressors, aren’t we back to “relying on researchers to notice and formally solve all critical problems of reasoning before the system is built”?
Yeah, to some extent. In the Lookup Table case, you need to have a (potentially quite expensive) way of resolving all mistakes. In the Overseer’s Manual case, you can also leverage humans to do some kind of more robust reasoning (for example, they can notice a typo in a question and still respond correctly, even if the Lookup Table would fail in this case). Though in low-bandwidth oversight, the space of things that participants could notice and correct is fairly limited.
Though I think this still differs from HRAD in that it seems like the output of HRAD would be a much smaller thing in terms of description length than the Lookup Table, and you can buy extra robustness by adding many more human-reasoned things into the Lookup Table (ie. automatically add versions of all questions with typos that don’t change the meaning of a question into the Lookup Table, add 1000 different sanity check questions to flag that things can go wrong).
So I think there are additional ways the system could correct mistaken reasoning relative to what I would think the output of HRAD would look like, but you do need to have processes that you think can correct any way that reasoning goes wrong. So the problem could be less difficult than HRAD, but still tricky to get right.