I think it’s interesting to think about what the relaxation from requiring training/validation/deployment to be i.i.d. to just requiring that validation/deployment be i.i.d. gets us, though. First, it’s an argument that we shouldn’t be that worried about whether the training data is i.i.d. relative to the validation/deployment data.
Fair enough. In practice you still want training to also be from the same distribution because that’s what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)
That is, if the model can provide a justification which convinces you that your ground truth generator would have produced the same output, that’s just as good as actually checking against the ground truth.
This seems to rely on an assumption that “human is convinced of X” implies “X”? Which might be fine, but I’m surprised you want to rely on it.
I’m curious what an algorithm might be that leverages this relaxation.
Fair enough. In practice you still want training to also be from the same distribution because that’s what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)
Yep—agreed.
This seems to rely on an assumption that “human is convinced of X” implies “X”? Which might be fine, but I’m surprised you want to rely on it.
I’m curious what an algorithm might be that leverages this relaxation.
Well, I’m certainly concerned about relying on assumptions like that, but that doesn’t mean there aren’t ways to make it work. Approaches like debate and approval-based amplification already rely on very similar assumptions—for example, for debate to work it needs to be the case that H being convinced of X at the end of the debate implies X. Thus, one way to leverage this relaxation is just to port those approaches over to this setting. For example, you could train f(x|Z) via debate over what H(x|Z) would do if H could access the entirety of Z, then randomly do full debate rollouts during deployment. Like I mention in the post, this still just gives you average-case guarantees, not worst-case guarantees, though average-case guarantees are still pretty good and you can do a lot with them if you can actually get them.
Fair enough. In practice you still want training to also be from the same distribution because that’s what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)
This seems to rely on an assumption that “human is convinced of X” implies “X”? Which might be fine, but I’m surprised you want to rely on it.
I’m curious what an algorithm might be that leverages this relaxation.
Yep—agreed.
Well, I’m certainly concerned about relying on assumptions like that, but that doesn’t mean there aren’t ways to make it work. Approaches like debate and approval-based amplification already rely on very similar assumptions—for example, for debate to work it needs to be the case that H being convinced of X at the end of the debate implies X. Thus, one way to leverage this relaxation is just to port those approaches over to this setting. For example, you could train f(x | Z) via debate over what H(x | Z) would do if H could access the entirety of Z, then randomly do full debate rollouts during deployment. Like I mention in the post, this still just gives you average-case guarantees, not worst-case guarantees, though average-case guarantees are still pretty good and you can do a lot with them if you can actually get them.