On applying generalization bounds to AI alignment. In January, Buck gave a talk for the Winter MLAB. He argued that we know how to train AIs which answer on-distribution questions at least as well as the labeller does. I was skeptical. IIRC, his argument had the following structure:
Premises:
1. We are labelling according to some function f and loss function L.
2. We train the network on datapoints (x, f(x)) ~ D_train.
3. Learning theory results give (f, L)-bounds on D_train.
Conclusions:
4. The network should match f’s labels on the rest of D_train, on average.
5. In particular, when f represents our judgments, the network should be able to correctly answer questions which we ourselves could correctly answer.
I think (5) is probably empirically true, but not for the reasons I understood Buck to give, and I don’t think I can have super high confidence in this empirical claim. Anyways, I’m more going to discuss premise (3).
In this classification task, a 2-layer MLP with 100 hidden neurons was trained to convergence. It learned a linear boundary down the middle, classifying everything to the right as red, and to the left as blue. Then, it memorized all of the exceptions. Thus, its validation accuracy was only 95%, even though it was validated on more synthetic data from the same crisply specified historical training distribution.
However, conclusion (4) doesn’t hold in this situation. Conclusion 4 would have predicted that the network would learn a piecewise linear classifier (in orange below) which achieves >99% accuracy on validation data from the same distribution:
The MLP was expressive enough to learn this, according to the authors, but it didn’t.
Now, it’s true that this kind of problem is rather easy to catch in practice—sampling more validation points quickly reveals the generalization failure. However, I still think we can’t establish high confidence (>99%) in (5: the AI has dependable question-answering abilities for questions similar enough to training). From the same paper,
Given that neural networks tend to heavily rely on spurious features [45, 52], state-of-the-art accuracies on large and diverse validation sets provide a false sense of security; even benign distributional changes to the data (e.g., domain shifts) during prediction time can drastically degrade or even nullify model performance.
And I think many realistic use cases of AI Q&A can, in theory, involve at least “benign” distributional changes, where, to our eyes, there hasn’t been any detectable distributional shift—and where generalization still fails horribly. But now I’d anticipate Buck / Paul / co would have some other unknown counterarguments, and so I’ll just close off this “short”form for now.
This paper shows just how strong neural network simplicity biases are, and also gives some intuition for how the simplicity bias of neural networks is different from something like a circuit simplicity bias or Kolmogorov simplicity bias. E.g., neural networks don’t seem all that opposed to memorization. The paper shows examples of neural networks learning a simple linear feature which imperfectly classifies the data, then memorizing the remaining noise, despite there being a slightly more complex feature which perfectly classifies the training data (and I’ve checked, there’s no grokking phase transition, even after 2.5 million optimization steps with weight decay).
On applying generalization bounds to AI alignment. In January, Buck gave a talk for the Winter MLAB. He argued that we know how to train AIs which answer on-distribution questions at least as well as the labeller does. I was skeptical. IIRC, his argument had the following structure:
I think (5) is probably empirically true, but not for the reasons I understood Buck to give, and I don’t think I can have super high confidence in this empirical claim. Anyways, I’m more going to discuss premise (3).
It seems like (3) is false (and no one gave me a reference to this result), at least for one really simple task from The Pitfalls of Simplicity Bias in Neural Networks (table 2, p8):
In this classification task, a 2-layer MLP with 100 hidden neurons was trained to convergence. It learned a linear boundary down the middle, classifying everything to the right as red, and to the left as blue. Then, it memorized all of the exceptions. Thus, its validation accuracy was only 95%, even though it was validated on more synthetic data from the same crisply specified historical training distribution.
However, conclusion (4) doesn’t hold in this situation. Conclusion 4 would have predicted that the network would learn a piecewise linear classifier (in orange below) which achieves >99% accuracy on validation data from the same distribution:
The MLP was expressive enough to learn this, according to the authors, but it didn’t.
Now, it’s true that this kind of problem is rather easy to catch in practice—sampling more validation points quickly reveals the generalization failure. However, I still think we can’t establish high confidence (>99%) in (5: the AI has dependable question-answering abilities for questions similar enough to training). From the same paper,
And I think many realistic use cases of AI Q&A can, in theory, involve at least “benign” distributional changes, where, to our eyes, there hasn’t been any detectable distributional shift—and where generalization still fails horribly. But now I’d anticipate Buck / Paul / co would have some other unknown counterarguments, and so I’ll just close off this “short”form for now.
Quintin Pope also wrote: