Monitoring lets you do online training when you find catastrophe, but e.g. there might be no safe fallback action that allows you to do monitoring safely.
Good point. I agree that if you don’t have time or resources to train the policy against the monitor on-the-fly, doing the catastrophe-suppression upfront is very nice.
It is worth noting that if “training against the monitor” didn’t work because the monitor was not robust enough, then my understanding is that the same issue would plague the tail-risk estimation approach: for optimization against tail-risk to not get gamed, you not only need the tail-risk estimation to be robust, but you also need the catastrophe detector to be robust. Is that right?
all but “doing experiments” can be done with a neural net doing chain of thought
I would have thought there is a big gap between “doing tail-risk estimation for an NN forward pass” and “doing tail-risk estimation for an NN CoT”. Argmax (the one in T=0 sampling) is not a nice non-linearity, and my guess is that none of the hopes for NN tail-risk estimation present in the post allow you to make claims about weak model CoT. If you do layer-by-layer modeling (going through n_layers x n_sampled_tokens steps), I would be surprised if you ended up being able to show sth like “after careful consideration, and despite the naive warning signs (which you need to have, otherwise the CoT is not very useful), the CoT always converges to saying that there is no catastrophe”—it’s not like in single-forward-pass land where there is a clear sense in which the activations should not go in certain regions. I haven’t thought about it much, so maybe I am too pessimistic about dealing with the argmax?
yes, you would need the catastrophe detector to be reasonably robust. Although I think it’s fine if e.g. you have at least 1/million chance of catching any particular catastrophe.
I think there is a gap, but that the gap is probably not that bad (for “worst case” tail risk estimation). That is maybe because I think being able to do estimation through a single forward pass is likely already to be very hard, and to require being able to do “abstractions” over the concepts being manipulated by the forward pass. CoT seems like it will require vaguely similar struction of a qualitatively similar kind.
Thanks for your explanations!
Good point. I agree that if you don’t have time or resources to train the policy against the monitor on-the-fly, doing the catastrophe-suppression upfront is very nice.
It is worth noting that if “training against the monitor” didn’t work because the monitor was not robust enough, then my understanding is that the same issue would plague the tail-risk estimation approach: for optimization against tail-risk to not get gamed, you not only need the tail-risk estimation to be robust, but you also need the catastrophe detector to be robust. Is that right?
I would have thought there is a big gap between “doing tail-risk estimation for an NN forward pass” and “doing tail-risk estimation for an NN CoT”. Argmax (the one in T=0 sampling) is not a nice non-linearity, and my guess is that none of the hopes for NN tail-risk estimation present in the post allow you to make claims about weak model CoT. If you do layer-by-layer modeling (going through n_layers x n_sampled_tokens steps), I would be surprised if you ended up being able to show sth like “after careful consideration, and despite the naive warning signs (which you need to have, otherwise the CoT is not very useful), the CoT always converges to saying that there is no catastrophe”—it’s not like in single-forward-pass land where there is a clear sense in which the activations should not go in certain regions. I haven’t thought about it much, so maybe I am too pessimistic about dealing with the argmax?
yes, you would need the catastrophe detector to be reasonably robust. Although I think it’s fine if e.g. you have at least 1/million chance of catching any particular catastrophe.
I think there is a gap, but that the gap is probably not that bad (for “worst case” tail risk estimation). That is maybe because I think being able to do estimation through a single forward pass is likely already to be very hard, and to require being able to do “abstractions” over the concepts being manipulated by the forward pass. CoT seems like it will require vaguely similar struction of a qualitatively similar kind.