This is called “sandwiching” and seems to be indeed something a bunch of people (including Ajeya, given comments in this comment section and her previous post on the general idea) are somewhat optimistic about. (Though the optimism could come from “this approach used on present-day models seems good for better understanding alignment” rather than “this approach is promising as a final alignment approach.”)
There’s some discussion of sandwiching in the section “Using higher-quality feedback and extrapolating feedback quality” of the OP that explains why sandwiching probably won’t do enough by itself, so I assume it has to be complemented with other safety methods.
This is called “sandwiching” and seems to be indeed something a bunch of people (including Ajeya, given comments in this comment section and her previous post on the general idea) are somewhat optimistic about. (Though the optimism could come from “this approach used on present-day models seems good for better understanding alignment” rather than “this approach is promising as a final alignment approach.”)
There’s some discussion of sandwiching in the section “Using higher-quality feedback and extrapolating feedback quality” of the OP that explains why sandwiching probably won’t do enough by itself, so I assume it has to be complemented with other safety methods.