My answer will address the special case where V is the true risk and U the empirical risk of an ANN for some offline learning task. The question is more general, but I think that this special case is important and instructional.
The questions you ask seem to be closely related to statistical learning theory. Statistical learning theory studies questions such as, (a) “how many samples do I need in my training set, to be sure that the model my algorithm learned generalizes well outside the training set?” and (b) “is the given training set sufficiently representative to warrant generalization?” In the case of AI alignment, we are worried that, a model of human values trained on a limited data set will not generalize well outside this data set. Indeed, many classical failure modes that have been speculated about can be regarded as, human values coincide with a different function inside the training set (e.g. with human approval, or with the electronic reward signal fed into the AI) but not on the entire space.
The classical solutions to questions (a) and (b) for offline learning is Vapnik-Chervonenkis theory and its various offshoots. Specifically, VC dimension determines the answer to question (a) and Rademacher complexity determines the answer to question (b). The problem is, VC theory is known to be inadequate for deep learning. Specifically, the VC dimension of a typical ANN architecture is too large to explain the empirical sample complexity. Moreover, risk minimization for ANNs is known to be NP-complete (even learning the intersection of two half-spaces is already NP-complete), but gradient descent obviously does a “good enough” job in some sense, despite that most known guarantees for gradient descent only apply to convex cost functions.
Explaining these paradoxes is an active field of research, and the problem is not solved. That said, there have been some results in recent years that IMO provide some very important insights. One type of result is, algorithms different from gradient descent that provably learn ANNs under some assumptions. Goel and Klivans 2017 show how to learn an ANN with two non-linear layers. They bypass the no-go results by assuming either learning a sufficiently smooth function, or, learning a binary classification but with a sufficient margin between the positive examples and the negative examples. Zhang et al 2017 learn ANNs with any number of layers, assuming a margin and l1 regularization. Another type of result is Allen-Zhu, Li and Liang 2018 (although it wasn’t peer reviewed yet, AFAIK, and I certainly haven’t verified all the proofs EDIT: it was published in NIPS 2019). They examine a rather realistic algorithm (ReLU response, stochastic gradient descent with something similar to drop-outs) but limited to three layers. Intriguingly, they prove that the algorithm successfully achieves improper learning of a different class of functions, which have the form of somewhat smaller ANNs with smooth response. The “smoothness” serves the same role as in the previous results, to avoid the NP-completeness results. The “improperness” solves the apparent paradox of high VC dimension: it is indeed infeasible to learn a function that looks like the actual neural network, instead the neural network is just a means to learn a somewhat simpler function.
I am especially excited about the improper learning result. Assuming this result can be generalized to any number of layers (and I don’t see any theoretical reason why it can’t), the next question is understanding the expressiveness of the class that is learned. That is, it is well known that ANNs can approximate any Boolean circuit, but the “smooth” ANNs that are actually learned are somewhat weaker. Understanding this class of functions better might provide deep insights regarding the power and limitation of deep learning.
I really like the connection between optimal learning and Goodhart failures, and I’d love to think about / discuss this more. I’ve mostly thought about it in the online case, since we can sample from human preferences iteratively, and build human-in-the-loop systems as I suggested here: https://arxiv.org/abs/1811.09246 “Oversight of Unsafe Systems via Dynamic Safety Envelopes”, which I think parallels, but is less developed than one part of Paul Christiano’s approach, but I see why that’s infeasible in many settings, which is a critical issue that the offline case addresses.
I also want to note that this addresses issues of extremal model insufficiency, and to an extent regressional Goodhart, but not regime change or causal Goodhart.
As an example of the former for human values, I’d suggest that “maximize food intake” is a critical goal in starving humans, but there is a point at which the goal becomes actively harmful, and if all you see are starving humans, you need a fairly complex model of human happiness to notice that. The same regime change applies to sex, and to most other specific desires.
As an example of the latter, causal Goodhart would be where an AI system optimizes for systems that are good at reporting successful space flights, rather than optimizing for actual success—any divergence leads to a system that will kill people and lie about it.
Hi David, if you want to discuss this more, I think we can do it in person? AFAIK you live in Israel? For example, you can come to my talk in the LessWrong meetup on July 2.
My answer will address the special case where V is the true risk and U the empirical risk of an ANN for some offline learning task. The question is more general, but I think that this special case is important and instructional.
The questions you ask seem to be closely related to statistical learning theory. Statistical learning theory studies questions such as, (a) “how many samples do I need in my training set, to be sure that the model my algorithm learned generalizes well outside the training set?” and (b) “is the given training set sufficiently representative to warrant generalization?” In the case of AI alignment, we are worried that, a model of human values trained on a limited data set will not generalize well outside this data set. Indeed, many classical failure modes that have been speculated about can be regarded as, human values coincide with a different function inside the training set (e.g. with human approval, or with the electronic reward signal fed into the AI) but not on the entire space.
The classical solutions to questions (a) and (b) for offline learning is Vapnik-Chervonenkis theory and its various offshoots. Specifically, VC dimension determines the answer to question (a) and Rademacher complexity determines the answer to question (b). The problem is, VC theory is known to be inadequate for deep learning. Specifically, the VC dimension of a typical ANN architecture is too large to explain the empirical sample complexity. Moreover, risk minimization for ANNs is known to be NP-complete (even learning the intersection of two half-spaces is already NP-complete), but gradient descent obviously does a “good enough” job in some sense, despite that most known guarantees for gradient descent only apply to convex cost functions.
Explaining these paradoxes is an active field of research, and the problem is not solved. That said, there have been some results in recent years that IMO provide some very important insights. One type of result is, algorithms different from gradient descent that provably learn ANNs under some assumptions. Goel and Klivans 2017 show how to learn an ANN with two non-linear layers. They bypass the no-go results by assuming either learning a sufficiently smooth function, or, learning a binary classification but with a sufficient margin between the positive examples and the negative examples. Zhang et al 2017 learn ANNs with any number of layers, assuming a margin and l1 regularization. Another type of result is Allen-Zhu, Li and Liang 2018 (
although it wasn’t peer reviewed yet, AFAIK, and I certainly haven’t verified all the proofsEDIT: it was published in NIPS 2019). They examine a rather realistic algorithm (ReLU response, stochastic gradient descent with something similar to drop-outs) but limited to three layers. Intriguingly, they prove that the algorithm successfully achieves improper learning of a different class of functions, which have the form of somewhat smaller ANNs with smooth response. The “smoothness” serves the same role as in the previous results, to avoid the NP-completeness results. The “improperness” solves the apparent paradox of high VC dimension: it is indeed infeasible to learn a function that looks like the actual neural network, instead the neural network is just a means to learn a somewhat simpler function.I am especially excited about the improper learning result. Assuming this result can be generalized to any number of layers (and I don’t see any theoretical reason why it can’t), the next question is understanding the expressiveness of the class that is learned. That is, it is well known that ANNs can approximate any Boolean circuit, but the “smooth” ANNs that are actually learned are somewhat weaker. Understanding this class of functions better might provide deep insights regarding the power and limitation of deep learning.
I really like the connection between optimal learning and Goodhart failures, and I’d love to think about / discuss this more. I’ve mostly thought about it in the online case, since we can sample from human preferences iteratively, and build human-in-the-loop systems as I suggested here: https://arxiv.org/abs/1811.09246 “Oversight of Unsafe Systems via Dynamic Safety Envelopes”, which I think parallels, but is less developed than one part of Paul Christiano’s approach, but I see why that’s infeasible in many settings, which is a critical issue that the offline case addresses.
I also want to note that this addresses issues of extremal model insufficiency, and to an extent regressional Goodhart, but not regime change or causal Goodhart.
As an example of the former for human values, I’d suggest that “maximize food intake” is a critical goal in starving humans, but there is a point at which the goal becomes actively harmful, and if all you see are starving humans, you need a fairly complex model of human happiness to notice that. The same regime change applies to sex, and to most other specific desires.
As an example of the latter, causal Goodhart would be where an AI system optimizes for systems that are good at reporting successful space flights, rather than optimizing for actual success—any divergence leads to a system that will kill people and lie about it.
Hi David, if you want to discuss this more, I think we can do it in person? AFAIK you live in Israel? For example, you can come to my talk in the LessWrong meetup on July 2.
Yes, and yes, I’m hoping to be there.