Yeah, largely the letter of the law isn’t sufficient.
Some evals are hard to goodhart. E.g. “can red-teamers demonstrate problems (given our mitigations)” is pretty robust — if red-teamers can’t demonstrate problems, that’s good evidence of safety (for deployment with those mitigations), even if the mitigations feel jury-rigged.
Yeah, this is intended to be complemented by superalignment.
Some personal takes in response:
Yeah, largely the letter of the law isn’t sufficient.
Some evals are hard to goodhart. E.g. “can red-teamers demonstrate problems (given our mitigations)” is pretty robust — if red-teamers can’t demonstrate problems, that’s good evidence of safety (for deployment with those mitigations), even if the mitigations feel jury-rigged.
Yeah, this is intended to be complemented by superalignment.