Do you think that catastrophe detectors will be large neural networks?
If you do, how do you build a neural network (without sampling steps nor python execution steps) that implements your major hopes for getting a weak model to supervise a strong model?
We could give \(M_{weak}\) more time to think than \(M\), so it could examine each portion of the action in detail.
We could allow \(M_{weak}\) to interrogate \(M\) about the function of various parts of its action, and do detailed checks for inconsistencies in \(M\)’s answers.
We could allow \(M_{weak}\) to perform separate experiments to verify \(M\)’s claims about its actions, and to evaluate whether or not the actions have any side effects.
\(M_{weak}\) could be conservative when evaluating \(M\)’s action, flagging part of the action it did not understand as potentially catastrophic.
If we back down from such hopes that use a lot of inference-time compute, and just use a single weak-model forward pass, then it looks to me like trusted monitoring solves most (but not all) issues that estimating tail risks targets.
If you don’t, do you think that it is easy to go from NN-tail-risk estimation to a more general form of tail-risk estimation? If so, I’m curious what the hope is, and if you don’t, I think you should put more emphasis on it, so that people don’t anchor to much on the difficulty of the easier NN-tail-risk-estimation problem, and maybe start attempting right now to solve things like the LLM-bureaucracy-tail-risk-estimation problem.
I think catastrophe detectors in practice will be composed of neural networks interacting with other stuff, like scientific literature, python, etc.
With respect to the stuff quoted, I think all but “doing experiments” can be done with a neural net doing chain of thought (although not making claims about quality).
I think we’re trying to solve a different problem than trusted monitoring, but I’m not that knowledgeable about what issues trusted monitoring is trying to solve. The main thing that I don’t think you can do with monitoring is producing a model that you think is unlikely to result in catastrophe. Monitoring lets you do online training when you find catastrophe, but e.g. there might be no safe fallback action that allows you to do monitoring safely.
Separately, I do think it will be easy to go from “worst-case” NN-tail-risk estimation to “worst case” more general risk estimation. I do not think it will be easy to go from “typical case” NN-tail-risk estimation to more general “typical case” risk estimation, but think that “typical case” NN-tail-risk estimation can meaningfully reduce safety despite not being able to do that generalization.
Re. more specific hopes: if your risk estimate is conducted by model with access to tools like python, then we can try to do two things:
vaguely get an estimate that is as good as the estimate you would get if you replaced “python” with your model’s subjective distribution over the output of whatever it runs through python.
learn some “empirical regularities” that govern how python works (as expected by your model/SGD)
(these might be the same thing?)
Another argument: one reason why doing risk estimates for NN’s is hard is because the estimate can rely on facts that live in some arbitrary LLM ontology. If you want to do such an estimate for an LLM bureaucracy, some fraction of the relevant facts will live in LLM ontology and some fraction of facts will live in words passed between models. Some fraction of facts will live in a distributed way, which adds complications, but those distributed facts can only affect the output of the bureacracy insofar as they are themselves manipulated by an LLM in that bureacracy.
Monitoring lets you do online training when you find catastrophe, but e.g. there might be no safe fallback action that allows you to do monitoring safely.
Good point. I agree that if you don’t have time or resources to train the policy against the monitor on-the-fly, doing the catastrophe-suppression upfront is very nice.
It is worth noting that if “training against the monitor” didn’t work because the monitor was not robust enough, then my understanding is that the same issue would plague the tail-risk estimation approach: for optimization against tail-risk to not get gamed, you not only need the tail-risk estimation to be robust, but you also need the catastrophe detector to be robust. Is that right?
all but “doing experiments” can be done with a neural net doing chain of thought
I would have thought there is a big gap between “doing tail-risk estimation for an NN forward pass” and “doing tail-risk estimation for an NN CoT”. Argmax (the one in T=0 sampling) is not a nice non-linearity, and my guess is that none of the hopes for NN tail-risk estimation present in the post allow you to make claims about weak model CoT. If you do layer-by-layer modeling (going through n_layers x n_sampled_tokens steps), I would be surprised if you ended up being able to show sth like “after careful consideration, and despite the naive warning signs (which you need to have, otherwise the CoT is not very useful), the CoT always converges to saying that there is no catastrophe”—it’s not like in single-forward-pass land where there is a clear sense in which the activations should not go in certain regions. I haven’t thought about it much, so maybe I am too pessimistic about dealing with the argmax?
yes, you would need the catastrophe detector to be reasonably robust. Although I think it’s fine if e.g. you have at least 1/million chance of catching any particular catastrophe.
I think there is a gap, but that the gap is probably not that bad (for “worst case” tail risk estimation). That is maybe because I think being able to do estimation through a single forward pass is likely already to be very hard, and to require being able to do “abstractions” over the concepts being manipulated by the forward pass. CoT seems like it will require vaguely similar struction of a qualitatively similar kind.
Do you think that catastrophe detectors will be large neural networks?
If you do, how do you build a neural network (without sampling steps nor python execution steps) that implements your major hopes for getting a weak model to supervise a strong model?
If we back down from such hopes that use a lot of inference-time compute, and just use a single weak-model forward pass, then it looks to me like trusted monitoring solves most (but not all) issues that estimating tail risks targets.
If you don’t, do you think that it is easy to go from NN-tail-risk estimation to a more general form of tail-risk estimation? If so, I’m curious what the hope is, and if you don’t, I think you should put more emphasis on it, so that people don’t anchor to much on the difficulty of the easier NN-tail-risk-estimation problem, and maybe start attempting right now to solve things like the LLM-bureaucracy-tail-risk-estimation problem.
I think catastrophe detectors in practice will be composed of neural networks interacting with other stuff, like scientific literature, python, etc.
With respect to the stuff quoted, I think all but “doing experiments” can be done with a neural net doing chain of thought (although not making claims about quality).
I think we’re trying to solve a different problem than trusted monitoring, but I’m not that knowledgeable about what issues trusted monitoring is trying to solve. The main thing that I don’t think you can do with monitoring is producing a model that you think is unlikely to result in catastrophe. Monitoring lets you do online training when you find catastrophe, but e.g. there might be no safe fallback action that allows you to do monitoring safely.
Separately, I do think it will be easy to go from “worst-case” NN-tail-risk estimation to “worst case” more general risk estimation. I do not think it will be easy to go from “typical case” NN-tail-risk estimation to more general “typical case” risk estimation, but think that “typical case” NN-tail-risk estimation can meaningfully reduce safety despite not being able to do that generalization.
Re. more specific hopes: if your risk estimate is conducted by model with access to tools like python, then we can try to do two things:
vaguely get an estimate that is as good as the estimate you would get if you replaced “python” with your model’s subjective distribution over the output of whatever it runs through python.
learn some “empirical regularities” that govern how python works (as expected by your model/SGD)
(these might be the same thing?)
Another argument: one reason why doing risk estimates for NN’s is hard is because the estimate can rely on facts that live in some arbitrary LLM ontology. If you want to do such an estimate for an LLM bureaucracy, some fraction of the relevant facts will live in LLM ontology and some fraction of facts will live in words passed between models. Some fraction of facts will live in a distributed way, which adds complications, but those distributed facts can only affect the output of the bureacracy insofar as they are themselves manipulated by an LLM in that bureacracy.
Thanks for your explanations!
Good point. I agree that if you don’t have time or resources to train the policy against the monitor on-the-fly, doing the catastrophe-suppression upfront is very nice.
It is worth noting that if “training against the monitor” didn’t work because the monitor was not robust enough, then my understanding is that the same issue would plague the tail-risk estimation approach: for optimization against tail-risk to not get gamed, you not only need the tail-risk estimation to be robust, but you also need the catastrophe detector to be robust. Is that right?
I would have thought there is a big gap between “doing tail-risk estimation for an NN forward pass” and “doing tail-risk estimation for an NN CoT”. Argmax (the one in T=0 sampling) is not a nice non-linearity, and my guess is that none of the hopes for NN tail-risk estimation present in the post allow you to make claims about weak model CoT. If you do layer-by-layer modeling (going through n_layers x n_sampled_tokens steps), I would be surprised if you ended up being able to show sth like “after careful consideration, and despite the naive warning signs (which you need to have, otherwise the CoT is not very useful), the CoT always converges to saying that there is no catastrophe”—it’s not like in single-forward-pass land where there is a clear sense in which the activations should not go in certain regions. I haven’t thought about it much, so maybe I am too pessimistic about dealing with the argmax?
yes, you would need the catastrophe detector to be reasonably robust. Although I think it’s fine if e.g. you have at least 1/million chance of catching any particular catastrophe.
I think there is a gap, but that the gap is probably not that bad (for “worst case” tail risk estimation). That is maybe because I think being able to do estimation through a single forward pass is likely already to be very hard, and to require being able to do “abstractions” over the concepts being manipulated by the forward pass. CoT seems like it will require vaguely similar struction of a qualitatively similar kind.