Rohin Shah comments on Zach Stein-Perlman’s Shortform

Rohin Shah 17 Dec 2024 23:22 UTC
LW: 4 AF: 4
0
AF
I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:
1. You train a general-purpose reasoner (via debate), and simply prompt it to do untrusted monitoring. This could do better than a trained untrusted monitor because it is less likely to overfit to spurious features of synthetically generated dangerous actions. (You instead use the synthetically generated dangerous actions to validate your untrusted monitor, or to hill climb on your prompt.)
2. You do train on synthetically generated dangerous actions, but you don’t automatically label those as dangerous, instead you use debate to compute the labels. Sometimes some of the synthetically generated dangerous actions are actually not dangerous, and debate correctly recognizes this, allowing you to reduce your false positive rate.
On the meta level, I suspect that when considering
- Technique A, that has a broad general argument plus some moderately-interesting concrete instantiations but no very-compelling concrete instantiations, and
- Technique B, that has a few very compelling concrete instantiations
I tend to be relatively more excited about A compared to you (and probably this mostly explains the discrepancy here). I think the broad principle justifying this for me is “we’ll figure out good things to do with A that are better than what we’ve brainstormed so far”, which I think you’re more skeptical of?