In this case, I can pay humans to make forecasts for many randomly chosen x* in D*, train a model f to predict those forecasts, and then use f to make forecasts about the rest of D*.
The generalization is now coming entirely from human beliefs, not from the structural of the neural net — we are only applying neural nets to iid samples from D*.
Perhaps a dumb question, but don’t we now have the same problem at one remove? The model for predicting what the human would predict would still come from a “strange” prior (based on the l2 norm, or whatever).
Does the strangeness just get washed out by the one layer of indirection? Would you ever want to do two (or more) steps, and train a model to predict what a human would predict a human would predict?
Ah, I see. It sounds like the key thing I was missing was that the strangeness of the prior only matters when you’re testing on a different distribution than you trained on. (And since you can randomly sample from x* when you solicit forecasts from humans, the train and test distributions can be considered the same.)
Is that actually true though? Why is that true? Say we are training the model on a dataset of N human answers, and then we are doing to deploy it to answer 10N more questions, all from the same big pool of questions. The AI can’t tell whether it is in training or deployment, but it could decide to follow a policy of giving some sort of catastrophic answer with probability 1/10N, so that probably it’ll make it through training just fine and then still get to cause catastrophe.
That’s right—you still only get a bound on average quality, and you need to do something to cope with failures so rare they never appear in training (here’s a post reviewing my best guesses).
But before you weren’t even in the game, it wouldn’t matter how well adversarial training worked because you didn’t even have the knowledge to tell whether a given behavior is good or bad. You weren’t even getting the right behavior on average.
(In the OP I think the claim “the generalization is now coming entirely from human beliefs” is fine, I meant generalization from one distribution to another. “Neural nets are are fine” was sweeping these issues under the rug. Though note that in the real world the distribution will change from neural net training to deployment, it’s just exactly the normal robustness problem. The point of this post is just to get it down to only a robustness problem that you could solve with some kind of generalization of adversarial training, the reason to set it up as in the OP was to make the issue more clear.)
I agree with Daniel. Certainly training on actual iid samples from the deployment distribution helps a lot—as it ensures that your limiting behavior is correct—but in the finite data regime you can still find a deceptive model that defects some percentage of the time.
This is a good question, and I don’t know the answer. My guess is that Paul would say that that is a potential problem, but different from the one being addressed in this post. Not sure though.
I’m confused about this point. My understanding is that, if we sample iid examples from some dataset and then naively train a neural network with them, in the limit we may run into universal prior problems, even during training (e.g. an inference execution that leverages some software vulnerability in the computer that runs the training process).
In this case humans are doing the job of transferring from D to D∗, and the training algorithm just has to generalize from a representative sample of D∗ to the test set.
Thank you, this was helpful. I hadn’t understood what was meant by “the generalization is now coming entirely from human beliefs”, but now it seems clear. (And in retrospect obvious if I’d just read/thought more carefully.)
This is a good and valid question—I agree, it isn’t fair to say generalization comes entirely from human beliefs.
An illustrative example: suppose we’re talking about deep learning, so our predicting model is a neural network. We haven’t specified the architecture of the model yet. We choose two architectures, and train both of them from our subsampled human-labeled D* items. Almost surely, these two models won’t give exactly the same outputs on every input, even in expectation. So where did this variability come from? Some sort of bias from the model architecture!
Perhaps a dumb question, but don’t we now have the same problem at one remove? The model for predicting what the human would predict would still come from a “strange” prior (based on the l2 norm, or whatever).
Does the strangeness just get washed out by the one layer of indirection? Would you ever want to do two (or more) steps, and train a model to predict what a human would predict a human would predict?
The difference is that you can draw as many samples as you want from D* and they are all iid. Neural nets are fine in that regime.
Ah, I see. It sounds like the key thing I was missing was that the strangeness of the prior only matters when you’re testing on a different distribution than you trained on. (And since you can randomly sample from x* when you solicit forecasts from humans, the train and test distributions can be considered the same.)
Is that actually true though? Why is that true? Say we are training the model on a dataset of N human answers, and then we are doing to deploy it to answer 10N more questions, all from the same big pool of questions. The AI can’t tell whether it is in training or deployment, but it could decide to follow a policy of giving some sort of catastrophic answer with probability 1/10N, so that probably it’ll make it through training just fine and then still get to cause catastrophe.
That’s right—you still only get a bound on average quality, and you need to do something to cope with failures so rare they never appear in training (here’s a post reviewing my best guesses).
But before you weren’t even in the game, it wouldn’t matter how well adversarial training worked because you didn’t even have the knowledge to tell whether a given behavior is good or bad. You weren’t even getting the right behavior on average.
(In the OP I think the claim “the generalization is now coming entirely from human beliefs” is fine, I meant generalization from one distribution to another. “Neural nets are are fine” was sweeping these issues under the rug. Though note that in the real world the distribution will change from neural net training to deployment, it’s just exactly the normal robustness problem. The point of this post is just to get it down to only a robustness problem that you could solve with some kind of generalization of adversarial training, the reason to set it up as in the OP was to make the issue more clear.)
I agree with Daniel. Certainly training on actual iid samples from the deployment distribution helps a lot—as it ensures that your limiting behavior is correct—but in the finite data regime you can still find a deceptive model that defects some percentage of the time.
This is a good question, and I don’t know the answer. My guess is that Paul would say that that is a potential problem, but different from the one being addressed in this post. Not sure though.
Yeah, that’s my view.
Thanks for confirming.
I’m confused about this point. My understanding is that, if we sample iid examples from some dataset and then naively train a neural network with them, in the limit we may run into universal prior problems, even during training (e.g. an inference execution that leverages some software vulnerability in the computer that runs the training process).
In this case humans are doing the job of transferring from D to D∗, and the training algorithm just has to generalize from a representative sample of D∗ to the test set.
Thank you, this was helpful. I hadn’t understood what was meant by “the generalization is now coming entirely from human beliefs”, but now it seems clear. (And in retrospect obvious if I’d just read/thought more carefully.)
This is a good and valid question—I agree, it isn’t fair to say generalization comes entirely from human beliefs.
An illustrative example: suppose we’re talking about deep learning, so our predicting model is a neural network. We haven’t specified the architecture of the model yet. We choose two architectures, and train both of them from our subsampled human-labeled D* items. Almost surely, these two models won’t give exactly the same outputs on every input, even in expectation. So where did this variability come from? Some sort of bias from the model architecture!