Very insightful piece! One small quibble, you state the disclaimer that you’re not assuming only Naive Safety measures is realistic many, many times. While I think doing this might be needed when writing for a more general audience, I think for the audience of this writing, only stating it once or twice is necessary.
One possible idea I had. What if, when training Alex based on human feedback, the first team of human evaluators were intentionally picked to be less knowledgeable, more prone to manipulation, and less likely to question answers Alex gave them. Then, you introduce a second team of the most thoughtful, knowledgeable, and skeptical researchers to evaluate Alex. If Alex was acting deceptively, it might not recognize the change fast-enough, and manipulation that worked in the first team might be caught by the second team. Yes, after a while Alex would probably catch on and improve its tactics, but by then the deceptive behavior would have already been exposed.
This is called “sandwiching” and seems to be indeed something a bunch of people (including Ajeya, given comments in this comment section and her previous post on the general idea) are somewhat optimistic about. (Though the optimism could come from “this approach used on present-day models seems good for better understanding alignment” rather than “this approach is promising as a final alignment approach.”)
There’s some discussion of sandwiching in the section “Using higher-quality feedback and extrapolating feedback quality” of the OP that explains why sandwiching probably won’t do enough by itself, so I assume it has to be complemented with other safety methods.
Very insightful piece! One small quibble, you state the disclaimer that you’re not assuming only Naive Safety measures is realistic many, many times. While I think doing this might be needed when writing for a more general audience, I think for the audience of this writing, only stating it once or twice is necessary.
One possible idea I had. What if, when training Alex based on human feedback, the first team of human evaluators were intentionally picked to be less knowledgeable, more prone to manipulation, and less likely to question answers Alex gave them. Then, you introduce a second team of the most thoughtful, knowledgeable, and skeptical researchers to evaluate Alex. If Alex was acting deceptively, it might not recognize the change fast-enough, and manipulation that worked in the first team might be caught by the second team. Yes, after a while Alex would probably catch on and improve its tactics, but by then the deceptive behavior would have already been exposed.
This is called “sandwiching” and seems to be indeed something a bunch of people (including Ajeya, given comments in this comment section and her previous post on the general idea) are somewhat optimistic about. (Though the optimism could come from “this approach used on present-day models seems good for better understanding alignment” rather than “this approach is promising as a final alignment approach.”)
There’s some discussion of sandwiching in the section “Using higher-quality feedback and extrapolating feedback quality” of the OP that explains why sandwiching probably won’t do enough by itself, so I assume it has to be complemented with other safety methods.