Fwiw, it took me a few re-reads to realize you were just arguing for the no-free-lunch theorem—I initially thought you were arguing “since there is no ‘true’ distribution for a dataset, datasets can never be i.i.d., and so the theorems never apply in practice”.
Hmmm… I’ll try to edit the post to be more clear there.
How does verifiability help with this problem?
Because rather than just relying on doing ML in an i.i.d. setting giving us the guarantees that we want, we’re forcing the guarantees to hold by actually randomly checking the model’s predictions. From the perspective of a deceptive model, knowing that its predictions will just be trusted because the human thinks the data is i.i.d. gives it a lot more freedom than knowing that its predictions will actually be checked at random.
Perhaps you’d say, “with verifiability, the model would ‘show its work’, thus allowing the human to notice that the output depends on RSA-2048, and so we’d see that we have a bad model”. But this seems to rest on having some sort of interpretability mechanism
There’s no need to invoke interpretability here—we can train the model to give answers + justifications via any number of different mechanisms including amplification, debate, etc. The point is just to have some way to independently check the model’s answers to induce i.i.d.-like guarantees.
we’re forcing the guarantees to hold by actually randomly checking the model’s predictions.
How is this different from evaluating the model on a validation set?
I certainly agree that we shouldn’t just train a model and assume it is good; we should be checking its performance on a validation set. This is standard ML practice and is necessary for the i.i.d. guarantees to hold (otherwise you can’t guarantee that the model didn’t overfit to the training set).
Sure, but at the point where you’re randomly deciding whether to collect ground truth for a data point and check the model on it (that is, put it in the validation data) or collect new data using the model to make predictions, you have verifiability. Importantly, though, you can get verifiability without doing that—including if the data isn’t actually i.i.d. (e.g. there are detectable differences between training and deployment data) but you still decide randomly whether to check the model’s output against some ground truth. In either situation, though, part of the point that I’m making is that the safety benefits are coming from the verifiability part not the i.i.d. part—even in the simple example of i.i.d.-ness giving you validation data, what’s mattering is that the validation and deployment data are i.i.d. (because that’s what gives you verifiability), but not whether the training and validation/deployment data are i.i.d.
Importantly, though, you can get verifiability without doing that—including if the data isn’t actually i.i.d. (e.g. there are detectable differences between training and deployment data) but you still decide randomly whether to check the model’s output against some ground truth.
This is taking the deployment data, and splitting it up into validation vs. prediction sets that are i.i.d. (via random sampling), and then applying the i.i.d. theorem on results from the validation set to make guarantees on the prediction set. I agree the guarantees apply even if the training set is not from the same distribution, but the operation you’re doing is “make i.i.d. samples and apply the i.i.d. theorem”.
At this point we may just be debating semantics (though I do actually care about it in that I’m pretty opposed to new jargon when there’s perfectly good ML jargon to use instead).
Alright, I think we’re getting closer to being on the same page now. I think it’s interesting to think about what the relaxation from requiring training/validation/deployment to be i.i.d. to just requiring that validation/deployment be i.i.d. gets us, though. First, it’s an argument that we shouldn’t be that worried about whether the training data is i.i.d. relative to the validation/deployment data. Second, it opens the door to an even further relaxation, which is that you can do the validation while looking at the model’s output. That is, if the model can provide a justification which convinces you that your ground truth generator would have produced the same output, that’s just as good as actually checking against the ground truth. At that point, though, it really stops looking anything like the standard i.i.d. setup, which is why I’m hesitant to just call it “validation/deployment i.i.d.” or something.
I think it’s interesting to think about what the relaxation from requiring training/validation/deployment to be i.i.d. to just requiring that validation/deployment be i.i.d. gets us, though. First, it’s an argument that we shouldn’t be that worried about whether the training data is i.i.d. relative to the validation/deployment data.
Fair enough. In practice you still want training to also be from the same distribution because that’s what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)
That is, if the model can provide a justification which convinces you that your ground truth generator would have produced the same output, that’s just as good as actually checking against the ground truth.
This seems to rely on an assumption that “human is convinced of X” implies “X”? Which might be fine, but I’m surprised you want to rely on it.
I’m curious what an algorithm might be that leverages this relaxation.
Fair enough. In practice you still want training to also be from the same distribution because that’s what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)
Yep—agreed.
This seems to rely on an assumption that “human is convinced of X” implies “X”? Which might be fine, but I’m surprised you want to rely on it.
I’m curious what an algorithm might be that leverages this relaxation.
Well, I’m certainly concerned about relying on assumptions like that, but that doesn’t mean there aren’t ways to make it work. Approaches like debate and approval-based amplification already rely on very similar assumptions—for example, for debate to work it needs to be the case that H being convinced of X at the end of the debate implies X. Thus, one way to leverage this relaxation is just to port those approaches over to this setting. For example, you could train f(x|Z) via debate over what H(x|Z) would do if H could access the entirety of Z, then randomly do full debate rollouts during deployment. Like I mention in the post, this still just gives you average-case guarantees, not worst-case guarantees, though average-case guarantees are still pretty good and you can do a lot with them if you can actually get them.
Hmmm… I’ll try to edit the post to be more clear there.
Because rather than just relying on doing ML in an i.i.d. setting giving us the guarantees that we want, we’re forcing the guarantees to hold by actually randomly checking the model’s predictions. From the perspective of a deceptive model, knowing that its predictions will just be trusted because the human thinks the data is i.i.d. gives it a lot more freedom than knowing that its predictions will actually be checked at random.
There’s no need to invoke interpretability here—we can train the model to give answers + justifications via any number of different mechanisms including amplification, debate, etc. The point is just to have some way to independently check the model’s answers to induce i.i.d.-like guarantees.
How is this different from evaluating the model on a validation set?
I certainly agree that we shouldn’t just train a model and assume it is good; we should be checking its performance on a validation set. This is standard ML practice and is necessary for the i.i.d. guarantees to hold (otherwise you can’t guarantee that the model didn’t overfit to the training set).
Sure, but at the point where you’re randomly deciding whether to collect ground truth for a data point and check the model on it (that is, put it in the validation data) or collect new data using the model to make predictions, you have verifiability. Importantly, though, you can get verifiability without doing that—including if the data isn’t actually i.i.d. (e.g. there are detectable differences between training and deployment data) but you still decide randomly whether to check the model’s output against some ground truth. In either situation, though, part of the point that I’m making is that the safety benefits are coming from the verifiability part not the i.i.d. part—even in the simple example of i.i.d.-ness giving you validation data, what’s mattering is that the validation and deployment data are i.i.d. (because that’s what gives you verifiability), but not whether the training and validation/deployment data are i.i.d.
This is taking the deployment data, and splitting it up into validation vs. prediction sets that are i.i.d. (via random sampling), and then applying the i.i.d. theorem on results from the validation set to make guarantees on the prediction set. I agree the guarantees apply even if the training set is not from the same distribution, but the operation you’re doing is “make i.i.d. samples and apply the i.i.d. theorem”.
At this point we may just be debating semantics (though I do actually care about it in that I’m pretty opposed to new jargon when there’s perfectly good ML jargon to use instead).
Alright, I think we’re getting closer to being on the same page now. I think it’s interesting to think about what the relaxation from requiring training/validation/deployment to be i.i.d. to just requiring that validation/deployment be i.i.d. gets us, though. First, it’s an argument that we shouldn’t be that worried about whether the training data is i.i.d. relative to the validation/deployment data. Second, it opens the door to an even further relaxation, which is that you can do the validation while looking at the model’s output. That is, if the model can provide a justification which convinces you that your ground truth generator would have produced the same output, that’s just as good as actually checking against the ground truth. At that point, though, it really stops looking anything like the standard i.i.d. setup, which is why I’m hesitant to just call it “validation/deployment i.i.d.” or something.
Fair enough. In practice you still want training to also be from the same distribution because that’s what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)
This seems to rely on an assumption that “human is convinced of X” implies “X”? Which might be fine, but I’m surprised you want to rely on it.
I’m curious what an algorithm might be that leverages this relaxation.
Yep—agreed.
Well, I’m certainly concerned about relying on assumptions like that, but that doesn’t mean there aren’t ways to make it work. Approaches like debate and approval-based amplification already rely on very similar assumptions—for example, for debate to work it needs to be the case that H being convinced of X at the end of the debate implies X. Thus, one way to leverage this relaxation is just to port those approaches over to this setting. For example, you could train f(x | Z) via debate over what H(x | Z) would do if H could access the entirety of Z, then randomly do full debate rollouts during deployment. Like I mention in the post, this still just gives you average-case guarantees, not worst-case guarantees, though average-case guarantees are still pretty good and you can do a lot with them if you can actually get them.