IMO it’s good to separate out reasons to want good reward signals like so:
Maybe bad reward signals cause your model to “generalize in misaligned ways”, e.g. scheming or some kinds of non-myopic reward hacking
I agree that bad reward signals increase the chance your AI is importantly misaligned, though I don’t think that effect would be overwhelmingly strong.
Maybe bad reward signals cause your model to exploit those reward signals even on-distribution. This causes problems in a few ways:
Maybe you think that optimizing against a flawed reward signal will produce catastrophically dangerous results. X-risk concerned people have talked about this for a long time, but I’m not sold it’s that important a factor. In particular, I expect that a model would produce catastrophically dangerous actions because it’s exploiting a flawed reward signal, I expect that you would have noticed bad (but non-catastrophic) outcomes from earlier AIs exploiting flawed reward signals. So it’s hard to see how this failure mode would strike you by surprise.
Optimizing against a flawed reward signal will mean your AI is less useful than it would otherwise be, because which is bad because you presumably were training the model because you wanted it to do something useful for you.
I think this last one is the most important theory of change for research on scalable oversight.
I am curious whether @Rohin Shah disagrees with me, or whether he agrees but just phrased it (from my perspective) weirdly.
I agree that this distinction is important—I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).
I’m not in full agreement on your comments on the theories of change:
I’m pretty uncertain about the effects of bad reward signals on propensity for scheming / non-myopic reward hacking, and in particular I think the effects could be large.
I’m less worried about purely optimizing against a flawed reward signal though not unworried. I agree it doesn’t strike us by surprise, but I also don’t expect scheming to strike us by surprise? (I agree this is somewhat more likely for scheming.)
I do also generally feel good about making more useful AIs out of smaller models; I generally like having base models be smaller for a fixed level of competence (imo it reduces p(scheming)). Also if you’re using your AIs for untrusted monitoring then they will probably be better at it than they otherwise would be.
You’d hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn’t have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.
I don’t think this is right. To train an untrusted monitor, you need to synthetically generate dangerous actions (because by assumption these actions are rare). So during training, you’ll always know the correct label for an action before you show it to the model.
You might want to use something debate-like in the synthetic input generation process, but that’s structurally different.
I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:
You train a general-purpose reasoner (via debate), and simply prompt it to do untrusted monitoring. This could do better than a trained untrusted monitor because it is less likely to overfit to spurious features of synthetically generated dangerous actions. (You instead use the synthetically generated dangerous actions to validate your untrusted monitor, or to hill climb on your prompt.)
You do train on synthetically generated dangerous actions, but you don’t automatically label those as dangerous, instead you use debate to compute the labels. Sometimes some of the synthetically generated dangerous actions are actually not dangerous, and debate correctly recognizes this, allowing you to reduce your false positive rate.
On the meta level, I suspect that when considering
Technique A, that has a broad general argument plus some moderately-interesting concrete instantiations but no very-compelling concrete instantiations, and
Technique B, that has a few very compelling concrete instantiations
I tend to be relatively more excited about A compared to you (and probably this mostly explains the discrepancy here). I think the broad principle justifying this for me is “we’ll figure out good things to do with A that are better than what we’ve brainstormed so far”, which I think you’re more skeptical of?
IMO it’s good to separate out reasons to want good reward signals like so:
Maybe bad reward signals cause your model to “generalize in misaligned ways”, e.g. scheming or some kinds of non-myopic reward hacking
I agree that bad reward signals increase the chance your AI is importantly misaligned, though I don’t think that effect would be overwhelmingly strong.
Maybe bad reward signals cause your model to exploit those reward signals even on-distribution. This causes problems in a few ways:
Maybe you think that optimizing against a flawed reward signal will produce catastrophically dangerous results. X-risk concerned people have talked about this for a long time, but I’m not sold it’s that important a factor. In particular, I expect that a model would produce catastrophically dangerous actions because it’s exploiting a flawed reward signal, I expect that you would have noticed bad (but non-catastrophic) outcomes from earlier AIs exploiting flawed reward signals. So it’s hard to see how this failure mode would strike you by surprise.
Optimizing against a flawed reward signal will mean your AI is less useful than it would otherwise be, because which is bad because you presumably were training the model because you wanted it to do something useful for you.
I think this last one is the most important theory of change for research on scalable oversight.
I am curious whether @Rohin Shah disagrees with me, or whether he agrees but just phrased it (from my perspective) weirdly.
I agree that this distinction is important—I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).
I’m not in full agreement on your comments on the theories of change:
I’m pretty uncertain about the effects of bad reward signals on propensity for scheming / non-myopic reward hacking, and in particular I think the effects could be large.
I’m less worried about purely optimizing against a flawed reward signal though not unworried. I agree it doesn’t strike us by surprise, but I also don’t expect scheming to strike us by surprise? (I agree this is somewhat more likely for scheming.)
I do also generally feel good about making more useful AIs out of smaller models; I generally like having base models be smaller for a fixed level of competence (imo it reduces p(scheming)). Also if you’re using your AIs for untrusted monitoring then they will probably be better at it than they otherwise would be.
I don’t understand your last sentence, can you rephrase?
You’d hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn’t have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.
I don’t think this is right. To train an untrusted monitor, you need to synthetically generate dangerous actions (because by assumption these actions are rare). So during training, you’ll always know the correct label for an action before you show it to the model.
You might want to use something debate-like in the synthetic input generation process, but that’s structurally different.
I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:
You train a general-purpose reasoner (via debate), and simply prompt it to do untrusted monitoring. This could do better than a trained untrusted monitor because it is less likely to overfit to spurious features of synthetically generated dangerous actions. (You instead use the synthetically generated dangerous actions to validate your untrusted monitor, or to hill climb on your prompt.)
You do train on synthetically generated dangerous actions, but you don’t automatically label those as dangerous, instead you use debate to compute the labels. Sometimes some of the synthetically generated dangerous actions are actually not dangerous, and debate correctly recognizes this, allowing you to reduce your false positive rate.
On the meta level, I suspect that when considering
Technique A, that has a broad general argument plus some moderately-interesting concrete instantiations but no very-compelling concrete instantiations, and
Technique B, that has a few very compelling concrete instantiations
I tend to be relatively more excited about A compared to you (and probably this mostly explains the discrepancy here). I think the broad principle justifying this for me is “we’ll figure out good things to do with A that are better than what we’ve brainstormed so far”, which I think you’re more skeptical of?