Fair enough. In practice you still want training to also be from the same distribution because that’s what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)
Yep—agreed.
This seems to rely on an assumption that “human is convinced of X” implies “X”? Which might be fine, but I’m surprised you want to rely on it.
I’m curious what an algorithm might be that leverages this relaxation.
Well, I’m certainly concerned about relying on assumptions like that, but that doesn’t mean there aren’t ways to make it work. Approaches like debate and approval-based amplification already rely on very similar assumptions—for example, for debate to work it needs to be the case that H being convinced of X at the end of the debate implies X. Thus, one way to leverage this relaxation is just to port those approaches over to this setting. For example, you could train f(x|Z) via debate over what H(x|Z) would do if H could access the entirety of Z, then randomly do full debate rollouts during deployment. Like I mention in the post, this still just gives you average-case guarantees, not worst-case guarantees, though average-case guarantees are still pretty good and you can do a lot with them if you can actually get them.
Yep—agreed.
Well, I’m certainly concerned about relying on assumptions like that, but that doesn’t mean there aren’t ways to make it work. Approaches like debate and approval-based amplification already rely on very similar assumptions—for example, for debate to work it needs to be the case that H being convinced of X at the end of the debate implies X. Thus, one way to leverage this relaxation is just to port those approaches over to this setting. For example, you could train f(x | Z) via debate over what H(x | Z) would do if H could access the entirety of Z, then randomly do full debate rollouts during deployment. Like I mention in the post, this still just gives you average-case guarantees, not worst-case guarantees, though average-case guarantees are still pretty good and you can do a lot with them if you can actually get them.