Buck comments on Zach Stein-Perlman’s Shortform

Buck 17 Dec 2024 1:29 UTC
LW: 4 AF: 4
4
AF
> You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)
In cases where you’re worried about the model taking small numbers of catastrophic actions (i.e. concentrated failures a.k.a. high-stake failures), this is basically equivalent to what I usually call untrusted monitoring, which means you have to worry about collusion.