Jozdien comments on Why imperfect adversarial robustness doesn’t doom AI control

Jozdien 18 Nov 2024 20:39 UTC
9 points
2
What if model CoTs become less transparent? I think o1 has shadows of this—often in the CoT summaries, there’ll be a couple lines in a completely different language, that make me think that the actual CoT looks weirder than the reasoning CoTs of previous models. That could be harder for either weaker or different models to monitor reliably. Do you expect the divergence to not be large enough to pose a problem, or that there are other ways to address that, or “yeah you shouldn’t do RL on your model CoTs, that would be crazy”?
- Satron 19 Nov 2024 21:15 UTC
  8 points
  0
  Parent
  I also wonder what Buck thinks about CoTs becoming less transparent (especially in light of recent o1 developments).