(Not an author, but Buck suggested I look at this)
I think there’s a few reasons why it’s not clear from these results whether the steganography results introduced would reduce safety in an untrusted monitoring control protocol like that in Greenblatt et al.:
- The untrusted model needs to reliably collude in order to defeat the blue-team because if it messes up once during the upfront evaluation period (30 problems), it loses. Having a 90% chance of communicating correctly to the monitor on each problem gives a 0.9^30 = 4% chance of passing the upfront eval period. (This is a lossy translation of the 90% success rate in Mathew et al. E.g. Greenblatt et al. targets a 1.4% overall positive rate after the upfront eval, while Mathew et al. calibrates to at least a 7% false positive rate; The suspiciousness scores in Greenblatt et al. come from the receiver itself.)
- From the above, winrate is quite sensitive to the reliability of collusion, and it’s pretty unclear what the error-bars should be around the 90% number.
- The untrusted model in Grennblatt et al. not only has to implement the steganographic signal and complete the task but also has to implement a backdoor.
- It’s a different distribution of tasks from the APPS backdoor setting.
So overall I think it’s pretty unclear how well the steganography in Mathew et al. would compare to the while-loop signal that performed best in Greenblatt et al.
With appropriate collusion mitigations I think untrusted monitoring is an excellent control protocol so I’m glad to see work investigating more natural learned collusion capabilities and the effectiveness of countermeasures. I’d be excited to see evaluations of these steganographic strategies in the context of untrusted monitoring so that the results have an easier interpretation in terms of safety and usefulness.
(Not an author, but Buck suggested I look at this)
I think there’s a few reasons why it’s not clear from these results whether the steganography results introduced would reduce safety in an untrusted monitoring control protocol like that in Greenblatt et al.:
- The untrusted model needs to reliably collude in order to defeat the blue-team because if it messes up once during the upfront evaluation period (30 problems), it loses. Having a 90% chance of communicating correctly to the monitor on each problem gives a 0.9^30 = 4% chance of passing the upfront eval period. (This is a lossy translation of the 90% success rate in Mathew et al. E.g. Greenblatt et al. targets a 1.4% overall positive rate after the upfront eval, while Mathew et al. calibrates to at least a 7% false positive rate; The suspiciousness scores in Greenblatt et al. come from the receiver itself.)
- From the above, winrate is quite sensitive to the reliability of collusion, and it’s pretty unclear what the error-bars should be around the 90% number.
- The untrusted model in Grennblatt et al. not only has to implement the steganographic signal and complete the task but also has to implement a backdoor.
- It’s a different distribution of tasks from the APPS backdoor setting.
So overall I think it’s pretty unclear how well the steganography in Mathew et al. would compare to the while-loop signal that performed best in Greenblatt et al.
With appropriate collusion mitigations I think untrusted monitoring is an excellent control protocol so I’m glad to see work investigating more natural learned collusion capabilities and the effectiveness of countermeasures. I’d be excited to see evaluations of these steganographic strategies in the context of untrusted monitoring so that the results have an easier interpretation in terms of safety and usefulness.
Thank you for both of your responses, they address all of my worries. Good luck on all of your future projects!