StefanHex comments on The case for ensuring that powerful AIs are controlled

StefanHex 31 Jan 2024 21:42 UTC
1 point
0
My two bullet point summary / example, to test my understanding:
- We ask labs to implement some sort of filter to monitor their AI’s outputs (likely AI based).
- Then we have a human team “play GPT-5” and try to submit dangerous outputs that the filter does not detect (w/ AI assistance etc. of course).
Is this (an example of) a control measure, and a control evaluation?
- ryan_greenblatt 1 Feb 2024 0:38 UTC
  2 points
  0
  Parent
  Yes, this seems like an example of a control measure and a control evaluation.
  
  However, the situation looks less like “human team plays gpt-5” more like “human team trains gpt-5″.
  
  (And there are additional complications once the blue team starts training the weak monitor based on gpt-5 outputs, then the red team needs to be able to demonstrate exploration hacking approaches.)