First off, I want to note that it is important that datasets and data points do not come labeled with “true distributions” and you can’t rationalize one for them after the fact. But I don’t think that’s an important point in the case of i.i.d. data.
Thus, to the extent that we’re able to train models that can do a good job for i.i.d. x′ [...] it’s because there’s an implicit prior there that’s assigning a fairly high probability to the actual distribution you used to sample the data from rather than any other of the infinitely many possible distributions. Even in the i.i.d. case, therefore, there’s still a real and meaningful sense in which your performance is coming from the machine learning prior.
Why not just apply the no-free-lunch theorem? It says the same thing. Also, why do we care about this? Empirically the no-free-lunch theorem doesn’t matter, and even if it did, I struggle to see how it has any safety implications—we’d just find that our ML model is completely unable to get any validation performance and so we wouldn’t deploy.
However, at least in the context of mesa-optimization, you can never really get i.i.d. data thanks to fundamental distributional shifts such as the the very fact that one set of data points is used in training and one set of data points is used in deployment.
Want to note I agree that you never actually get i.i.d. data and so you can’t fully solve inner alignment by making everything i.i.d. (also even if you did it wouldn’t necessarily solve inner alignment).
In any event, I think that striving for verifiability is a pretty good goal that I expect to have real benefits if it can be achieved—and I think it’s a much more well-specified goal than i.i.d.-ness.
I don’t really see this. My read of this post is that you introduced “verifiability”, argued that it has exactly the same properties as i.i.d. (since i.i.d. also gives you average-case guarantees), and then claimed it’s better specified than i.i.d. because… actually I’m not sure why, but possibly because we can never actually get i.i.d. in practice?
If that’s right, then I disagree. The way in which we lose i.i.d. in practice is stuff like “the system could predict the pseudorandom number generator” or “the system could notice how much time has passed” (e.g. via RSA-2048). But verifiability has the same obstacles and more, e.g. you can’t verify your system if it can predict which outputs you will verify, you can’t verify your system if it varies its answer based on how much time has passed, you can’t verify your system if humans will give different answers depending on random background variables like how hungry they are, etc. So I don’t see why verifiability does any better.
First off, I want to note that it is important that datasets and data points do not come labeled with “true distributions” and you can’t rationalize one for them after the fact. But I don’t think that’s an important point in the case of i.i.d. data.
I agree—I pretty explicitly make that point in the post.
Why not just apply the no-free-lunch theorem? It says the same thing. Also, why do we care about this? Empirically the no-free-lunch theorem doesn’t matter, and even if it did, I struggle to see how it has any safety implications—we’d just find that our ML model is completely unable to get any validation performance and so we wouldn’t deploy.
I agree that this is just the no free lunch theorem, but I generally prefer explaining things fully rather than just linking to something else so it’s easier to understand the text just by reading it.
The reason I care, though, is because the fact that the performance is coming from the implicit ML prior means that if that prior is malign then even in the i.i.d. case you can still get malign optimization.
I don’t really see this. My read of this post is that you introduced “verifiability”, argued that it has exactly the same properties as i.i.d. (since i.i.d. also gives you average-case guarantees), and then claimed it’s better specified than i.i.d. because… actually I’m not sure why, but possibly because we can never actually get i.i.d. in practice?
That’s mostly right, but the point is that the fact that you can’t get i.i.d. in practice matters because it means you can’t get good guarantees from it—whereas I think you can get good guarantees from verifiability.
If that’s right, then I disagree. The way in which we lose i.i.d. in practice is stuff like “the system could predict the pseudorandom number generator” or “the system could notice how much time has passed” (e.g. via RSA-2048). But verifiability has the same obstacles and more, e.g. you can’t verify your system if it can predict which outputs you will verify, you can’t verify your system if it varies its answer based on how much time has passed, you can’t verify your system if humans will give different answers depending on random background variables like how hungry they are, etc. So I don’t see why verifiability does any better.
I agree that you can’t verify your model’s answers if it can predict which outputs you will verify (though I don’t think getting true randomness will actually be that hard)—but the others are notably not problems for verifiability despite being problems for i.i.d.-ness. If the model gives answers where the tree of reasoning supporting those answers depends on how much time has passed or how hungry the human was, then the idea is to reject those answers. The point is to produce a mechanism that allows you to verify justifications for correct answers to the questions that you care about.
I agree that this is just the no free lunch theorem, but I generally prefer explaining things fully rather than just linking to something else so it’s easier to understand the text just by reading it.
Fwiw, it took me a few re-reads to realize you were just arguing for the no-free-lunch theorem—I initially thought you were arguing “since there is no ‘true’ distribution for a dataset, datasets can never be i.i.d., and so the theorems never apply in practice”.
but the others are notably not problems for verifiability despite being problems for i.i.d.-ness. If the model gives answers where the tree of reasoning supporting those answers depends on how much time has passed or how hungry the human was, then the idea is to reject those answers. The point is to produce a mechanism that allows you to verify justifications for correct answers to the questions that you care about.
I still don’t see the distinction. Let’s be particularly concrete.
Say I have D = timeseries data of new COVID cases per day for the last 14 days, and I want to predict D’ = timeseries data of new COVID cases per day for the next 14 days. Maybe our Z* ends up being “on day t there will be exp(0.1 * days since Feb 17) new cases”.
Our model for predicting on D’ is trained via supervised learning on human predictions on randomly sampled data points of D’. We then use it to predict on other randomly sampled data points of D’, thus using it in a nominally i.i.d. setting. Now, you might be worried that we learn a model that thinks something like “if RSA-2048 is not factored, then plug it into the formula from Z* and report that, otherwise say there will be a billion cases and once all the humans hide away in their houses I’ll take over the world”, which leverages one of the ways in which nominally i.i.d. data is not actually i.i.d.. How does verifiability help with this problem?
Perhaps you’d say, “with verifiability, the model would ‘show its work’, thus allowing the human to notice that the output depends on RSA-2048, and so we’d see that we have a bad model”. But this seems to rest on having some sort of interpretability mechanism—I feel like you’re not just saying “rather than i.i.d., we need perfect interpretability and that will give us better guarantees”, but I don’t know what you are saying.
Fwiw, it took me a few re-reads to realize you were just arguing for the no-free-lunch theorem—I initially thought you were arguing “since there is no ‘true’ distribution for a dataset, datasets can never be i.i.d., and so the theorems never apply in practice”.
Hmmm… I’ll try to edit the post to be more clear there.
How does verifiability help with this problem?
Because rather than just relying on doing ML in an i.i.d. setting giving us the guarantees that we want, we’re forcing the guarantees to hold by actually randomly checking the model’s predictions. From the perspective of a deceptive model, knowing that its predictions will just be trusted because the human thinks the data is i.i.d. gives it a lot more freedom than knowing that its predictions will actually be checked at random.
Perhaps you’d say, “with verifiability, the model would ‘show its work’, thus allowing the human to notice that the output depends on RSA-2048, and so we’d see that we have a bad model”. But this seems to rest on having some sort of interpretability mechanism
There’s no need to invoke interpretability here—we can train the model to give answers + justifications via any number of different mechanisms including amplification, debate, etc. The point is just to have some way to independently check the model’s answers to induce i.i.d.-like guarantees.
we’re forcing the guarantees to hold by actually randomly checking the model’s predictions.
How is this different from evaluating the model on a validation set?
I certainly agree that we shouldn’t just train a model and assume it is good; we should be checking its performance on a validation set. This is standard ML practice and is necessary for the i.i.d. guarantees to hold (otherwise you can’t guarantee that the model didn’t overfit to the training set).
Sure, but at the point where you’re randomly deciding whether to collect ground truth for a data point and check the model on it (that is, put it in the validation data) or collect new data using the model to make predictions, you have verifiability. Importantly, though, you can get verifiability without doing that—including if the data isn’t actually i.i.d. (e.g. there are detectable differences between training and deployment data) but you still decide randomly whether to check the model’s output against some ground truth. In either situation, though, part of the point that I’m making is that the safety benefits are coming from the verifiability part not the i.i.d. part—even in the simple example of i.i.d.-ness giving you validation data, what’s mattering is that the validation and deployment data are i.i.d. (because that’s what gives you verifiability), but not whether the training and validation/deployment data are i.i.d.
Importantly, though, you can get verifiability without doing that—including if the data isn’t actually i.i.d. (e.g. there are detectable differences between training and deployment data) but you still decide randomly whether to check the model’s output against some ground truth.
This is taking the deployment data, and splitting it up into validation vs. prediction sets that are i.i.d. (via random sampling), and then applying the i.i.d. theorem on results from the validation set to make guarantees on the prediction set. I agree the guarantees apply even if the training set is not from the same distribution, but the operation you’re doing is “make i.i.d. samples and apply the i.i.d. theorem”.
At this point we may just be debating semantics (though I do actually care about it in that I’m pretty opposed to new jargon when there’s perfectly good ML jargon to use instead).
Alright, I think we’re getting closer to being on the same page now. I think it’s interesting to think about what the relaxation from requiring training/validation/deployment to be i.i.d. to just requiring that validation/deployment be i.i.d. gets us, though. First, it’s an argument that we shouldn’t be that worried about whether the training data is i.i.d. relative to the validation/deployment data. Second, it opens the door to an even further relaxation, which is that you can do the validation while looking at the model’s output. That is, if the model can provide a justification which convinces you that your ground truth generator would have produced the same output, that’s just as good as actually checking against the ground truth. At that point, though, it really stops looking anything like the standard i.i.d. setup, which is why I’m hesitant to just call it “validation/deployment i.i.d.” or something.
I think it’s interesting to think about what the relaxation from requiring training/validation/deployment to be i.i.d. to just requiring that validation/deployment be i.i.d. gets us, though. First, it’s an argument that we shouldn’t be that worried about whether the training data is i.i.d. relative to the validation/deployment data.
Fair enough. In practice you still want training to also be from the same distribution because that’s what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)
That is, if the model can provide a justification which convinces you that your ground truth generator would have produced the same output, that’s just as good as actually checking against the ground truth.
This seems to rely on an assumption that “human is convinced of X” implies “X”? Which might be fine, but I’m surprised you want to rely on it.
I’m curious what an algorithm might be that leverages this relaxation.
Fair enough. In practice you still want training to also be from the same distribution because that’s what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)
Yep—agreed.
This seems to rely on an assumption that “human is convinced of X” implies “X”? Which might be fine, but I’m surprised you want to rely on it.
I’m curious what an algorithm might be that leverages this relaxation.
Well, I’m certainly concerned about relying on assumptions like that, but that doesn’t mean there aren’t ways to make it work. Approaches like debate and approval-based amplification already rely on very similar assumptions—for example, for debate to work it needs to be the case that H being convinced of X at the end of the debate implies X. Thus, one way to leverage this relaxation is just to port those approaches over to this setting. For example, you could train f(x|Z) via debate over what H(x|Z) would do if H could access the entirety of Z, then randomly do full debate rollouts during deployment. Like I mention in the post, this still just gives you average-case guarantees, not worst-case guarantees, though average-case guarantees are still pretty good and you can do a lot with them if you can actually get them.
First off, I want to note that it is important that datasets and data points do not come labeled with “true distributions” and you can’t rationalize one for them after the fact. But I don’t think that’s an important point in the case of i.i.d. data.
Why not just apply the no-free-lunch theorem? It says the same thing. Also, why do we care about this? Empirically the no-free-lunch theorem doesn’t matter, and even if it did, I struggle to see how it has any safety implications—we’d just find that our ML model is completely unable to get any validation performance and so we wouldn’t deploy.
Want to note I agree that you never actually get i.i.d. data and so you can’t fully solve inner alignment by making everything i.i.d. (also even if you did it wouldn’t necessarily solve inner alignment).
I don’t really see this. My read of this post is that you introduced “verifiability”, argued that it has exactly the same properties as i.i.d. (since i.i.d. also gives you average-case guarantees), and then claimed it’s better specified than i.i.d. because… actually I’m not sure why, but possibly because we can never actually get i.i.d. in practice?
If that’s right, then I disagree. The way in which we lose i.i.d. in practice is stuff like “the system could predict the pseudorandom number generator” or “the system could notice how much time has passed” (e.g. via RSA-2048). But verifiability has the same obstacles and more, e.g. you can’t verify your system if it can predict which outputs you will verify, you can’t verify your system if it varies its answer based on how much time has passed, you can’t verify your system if humans will give different answers depending on random background variables like how hungry they are, etc. So I don’t see why verifiability does any better.
I agree—I pretty explicitly make that point in the post.
I agree that this is just the no free lunch theorem, but I generally prefer explaining things fully rather than just linking to something else so it’s easier to understand the text just by reading it.
The reason I care, though, is because the fact that the performance is coming from the implicit ML prior means that if that prior is malign then even in the i.i.d. case you can still get malign optimization.
That’s mostly right, but the point is that the fact that you can’t get i.i.d. in practice matters because it means you can’t get good guarantees from it—whereas I think you can get good guarantees from verifiability.
I agree that you can’t verify your model’s answers if it can predict which outputs you will verify (though I don’t think getting true randomness will actually be that hard)—but the others are notably not problems for verifiability despite being problems for i.i.d.-ness. If the model gives answers where the tree of reasoning supporting those answers depends on how much time has passed or how hungry the human was, then the idea is to reject those answers. The point is to produce a mechanism that allows you to verify justifications for correct answers to the questions that you care about.
Fwiw, it took me a few re-reads to realize you were just arguing for the no-free-lunch theorem—I initially thought you were arguing “since there is no ‘true’ distribution for a dataset, datasets can never be i.i.d., and so the theorems never apply in practice”.
I still don’t see the distinction. Let’s be particularly concrete.
Say I have D = timeseries data of new COVID cases per day for the last 14 days, and I want to predict D’ = timeseries data of new COVID cases per day for the next 14 days. Maybe our Z* ends up being “on day t there will be exp(0.1 * days since Feb 17) new cases”.
Our model for predicting on D’ is trained via supervised learning on human predictions on randomly sampled data points of D’. We then use it to predict on other randomly sampled data points of D’, thus using it in a nominally i.i.d. setting. Now, you might be worried that we learn a model that thinks something like “if RSA-2048 is not factored, then plug it into the formula from Z* and report that, otherwise say there will be a billion cases and once all the humans hide away in their houses I’ll take over the world”, which leverages one of the ways in which nominally i.i.d. data is not actually i.i.d.. How does verifiability help with this problem?
Perhaps you’d say, “with verifiability, the model would ‘show its work’, thus allowing the human to notice that the output depends on RSA-2048, and so we’d see that we have a bad model”. But this seems to rest on having some sort of interpretability mechanism—I feel like you’re not just saying “rather than i.i.d., we need perfect interpretability and that will give us better guarantees”, but I don’t know what you are saying.
Hmmm… I’ll try to edit the post to be more clear there.
Because rather than just relying on doing ML in an i.i.d. setting giving us the guarantees that we want, we’re forcing the guarantees to hold by actually randomly checking the model’s predictions. From the perspective of a deceptive model, knowing that its predictions will just be trusted because the human thinks the data is i.i.d. gives it a lot more freedom than knowing that its predictions will actually be checked at random.
There’s no need to invoke interpretability here—we can train the model to give answers + justifications via any number of different mechanisms including amplification, debate, etc. The point is just to have some way to independently check the model’s answers to induce i.i.d.-like guarantees.
How is this different from evaluating the model on a validation set?
I certainly agree that we shouldn’t just train a model and assume it is good; we should be checking its performance on a validation set. This is standard ML practice and is necessary for the i.i.d. guarantees to hold (otherwise you can’t guarantee that the model didn’t overfit to the training set).
Sure, but at the point where you’re randomly deciding whether to collect ground truth for a data point and check the model on it (that is, put it in the validation data) or collect new data using the model to make predictions, you have verifiability. Importantly, though, you can get verifiability without doing that—including if the data isn’t actually i.i.d. (e.g. there are detectable differences between training and deployment data) but you still decide randomly whether to check the model’s output against some ground truth. In either situation, though, part of the point that I’m making is that the safety benefits are coming from the verifiability part not the i.i.d. part—even in the simple example of i.i.d.-ness giving you validation data, what’s mattering is that the validation and deployment data are i.i.d. (because that’s what gives you verifiability), but not whether the training and validation/deployment data are i.i.d.
This is taking the deployment data, and splitting it up into validation vs. prediction sets that are i.i.d. (via random sampling), and then applying the i.i.d. theorem on results from the validation set to make guarantees on the prediction set. I agree the guarantees apply even if the training set is not from the same distribution, but the operation you’re doing is “make i.i.d. samples and apply the i.i.d. theorem”.
At this point we may just be debating semantics (though I do actually care about it in that I’m pretty opposed to new jargon when there’s perfectly good ML jargon to use instead).
Alright, I think we’re getting closer to being on the same page now. I think it’s interesting to think about what the relaxation from requiring training/validation/deployment to be i.i.d. to just requiring that validation/deployment be i.i.d. gets us, though. First, it’s an argument that we shouldn’t be that worried about whether the training data is i.i.d. relative to the validation/deployment data. Second, it opens the door to an even further relaxation, which is that you can do the validation while looking at the model’s output. That is, if the model can provide a justification which convinces you that your ground truth generator would have produced the same output, that’s just as good as actually checking against the ground truth. At that point, though, it really stops looking anything like the standard i.i.d. setup, which is why I’m hesitant to just call it “validation/deployment i.i.d.” or something.
Fair enough. In practice you still want training to also be from the same distribution because that’s what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)
This seems to rely on an assumption that “human is convinced of X” implies “X”? Which might be fine, but I’m surprised you want to rely on it.
I’m curious what an algorithm might be that leverages this relaxation.
Yep—agreed.
Well, I’m certainly concerned about relying on assumptions like that, but that doesn’t mean there aren’t ways to make it work. Approaches like debate and approval-based amplification already rely on very similar assumptions—for example, for debate to work it needs to be the case that H being convinced of X at the end of the debate implies X. Thus, one way to leverage this relaxation is just to port those approaches over to this setting. For example, you could train f(x | Z) via debate over what H(x | Z) would do if H could access the entirety of Z, then randomly do full debate rollouts during deployment. Like I mention in the post, this still just gives you average-case guarantees, not worst-case guarantees, though average-case guarantees are still pretty good and you can do a lot with them if you can actually get them.