Yes, this seems like an example of a control measure and a control evaluation.
However, the situation looks less like “human team plays gpt-5” more like “human team trains gpt-5″.
(And there are additional complications once the blue team starts training the weak monitor based on gpt-5 outputs, then the red team needs to be able to demonstrate exploration hacking approaches.)
My two bullet point summary / example, to test my understanding:
We ask labs to implement some sort of filter to monitor their AI’s outputs (likely AI based).
Then we have a human team “play GPT-5” and try to submit dangerous outputs that the filter does not detect (w/ AI assistance etc. of course).
Is this (an example of) a control measure, and a control evaluation?
Yes, this seems like an example of a control measure and a control evaluation.
However, the situation looks less like “human team plays gpt-5” more like “human team trains gpt-5″.
(And there are additional complications once the blue team starts training the weak monitor based on gpt-5 outputs, then the red team needs to be able to demonstrate exploration hacking approaches.)