To help me check my understanding of what you’re saying, we train an AI on a bunch of videos/media about Alice’s life, in the hope that it learns an internal concept of “Alice’s values”. Then we use SL/RL to train the AI, e.g., give it a positive reward whenever it does something that the supervisor thinks benefits Alice’s values. The hope here is that the AI learns to optimize the world according to its internal concept of “Alice’s values” that it learned in the previous step. And we hope that its concept of “Alice’s values” includes the idea that Alice wants AIs, including any future AIs, to keep improving their understanding of Alice’s values and to serve those values, and that this solves alignment in the long run.
Assuming the above is basically correct, this (in part) depends on the AI learning a good enough understanding of “improving understanding of Alice’s values” in step 1. This in turn (assuming “improving understanding of Alice’s values” involves “using philosophical reasoning to solve various confusions related to understanding Alice’s values, including Alice’s own confusions”) depends on that the AI can learn a correct or good enough concept of “philosophical reasoning” from unsupervised training. Correct?
If AI can learn “philosophical reasoning” from unsupervised training, GPT-N should be able to do philosophy (e.g., solve open philosophical problems), right?
There’s a lot of moving pieces here, so the answer is long. Apologies in advance.
I basically agree with everything up until the parts on philosophy. The point of divergence is roughly here:
assuming “improving understanding of Alice’s values” involves “using philosophical reasoning to solve various confusions related to understanding Alice’s values, including Alice’s own confusions”
I do think that resolving certain confusions around values involves solving some philosophical problems. But just because the problems are philosophical does not mean that they need to be solved by philosophical reasoning.
The kinds of philosophical problems I have in mind are things like:
What is the type signature of human values?
What kind of data structure naturally represents human values?
How do human values interface with the rest of the world?
In other words, they’re exactly the sort of questions for which “utility function” and “Cartesian boundary” are answers, but probably not the right answers.
How could an AI make progress on these sorts of questions, other than by philosophical reasoning?
Let’s switch gears a moment and talk about some analogous problems:
What is the type signature of the concept of “tree”?
What kind of data structure naturally represents “tree”?
How do “trees” (as high-level abstract objects) interface with the rest of the world?
Though they’re not exactly the same questions, these are philosophical questions of a qualitatively similar sort to the questions about human values.
Empirically, AIs already do a remarkable job reasoning about trees, and finding answers to questions like those above, despite presumably not having much notion of “philosophical reasoning”. They learn some data structure for representing the concept of tree, and they learn how the high-level abstract “tree” objects interact with the rest of the (lower-level) world. And it seems like such AIs’ notion of “tree” tends to improve as we throw more data and compute at them, at least over the ranges explored to date.
In other words: empirically, we seem to be able to solve philosophical problems to a surprising degree by throwing data and compute at neural networks. Well, at least “solve” in the sense that the neural networks themselves seem to acquire solutions to the problems… not that either the neural nets or the humans gain much understanding of such problems in general.
Going up a meta level: why would this be the case? Why would solutions to philosophical problems end up embedded in random learning algorithms, without either the algorithms or the humans having a general understanding of the problems?
Well, presumably neural nets end up with a notion of “tree” for much the same reason that humans end up with a notion of “tree”: it’s a useful concept. We don’t have a precise mathematical theory of when or why it’s useful (though I do hopefully have some groundwork for that), but we can see instrumental convergence to a useful concept even without understanding why the concept is useful.
In short: solutions to certain philosophical problems are probably instrumentally convergent, so the solutions will probably pop up in a fairly broad range of systems despite neither the systems nor their designers understanding the philosophical problems.
Now, so far this has talked about why solutions to philosophical problems would pop up in one AI. But does that help one AI to improve its own solutions? Depends on the setup, but at the very least it offers an AI a possible path to improving its solutions to such philosophical problems without going through philosophical reasoning.
Finally, I’ll note that if humans want to be able to recognize an AI’s solutions to philosophical problems, e.g. decode a model of human values from the weights of a neural net, then we’ll probably need to make some philosophical/mathematical progress ourselves in order to do that reliably. After all, we don’t even know the type signature of the thing we’re looking for or a data structure with which to represent it.
So similarly, a human could try to understand Alice’s values in two ways. The first, equivalent to what you describe here for AI, is to just apply whatever learning algorithm their brain uses when observing Alice, and form an intuitive notion of “Alice’s values”. And the second is to apply explicit philosophical reasoning to this problem. So sure, you can possibly go a long way towards understanding Alice’s values by just doing the former, but is that enough to avoid disaster? (See Two Neglected Problems in Human-AI Safety for the kind of disaster I have in mind here.)
(I keep bringing up metaphilosophy but I’m pretty much resigned to be living in a part of the multiverse where civilization will just throw the dice and bet on AI safety not depending on solving it. What hope is there for our civilization to do what I think is the prudent thing, when no professional philosophers, even ones in EA who are concerned about AI safety, ever talk about it?)
I mostly agree with you here. I don’t think the chances of alignment by default are high. There are marginal gains to be had, but to get a high probability of alignment in the long term we will probably need actual understanding of the relevant philosophical problems.
My take is that corrigibility is sufficient to get you an AI that understands what it means to “keep improving their understanding of Alice’s values and to serve those values”. I don’t think the AI needs to play the “genius philosopher” role, just the “loyal and trustworthy servant” role. A superintelligent AI which plays that role should be able to facilitate a “long reflection” where flesh and blood humans solve philosophical problems.
(I also separately think unsupervised learning systems could in principle make philosophical breakthroughs. Maybe one already has.)
To help me check my understanding of what you’re saying, we train an AI on a bunch of videos/media about Alice’s life, in the hope that it learns an internal concept of “Alice’s values”. Then we use SL/RL to train the AI, e.g., give it a positive reward whenever it does something that the supervisor thinks benefits Alice’s values. The hope here is that the AI learns to optimize the world according to its internal concept of “Alice’s values” that it learned in the previous step. And we hope that its concept of “Alice’s values” includes the idea that Alice wants AIs, including any future AIs, to keep improving their understanding of Alice’s values and to serve those values, and that this solves alignment in the long run.
Assuming the above is basically correct, this (in part) depends on the AI learning a good enough understanding of “improving understanding of Alice’s values” in step 1. This in turn (assuming “improving understanding of Alice’s values” involves “using philosophical reasoning to solve various confusions related to understanding Alice’s values, including Alice’s own confusions”) depends on that the AI can learn a correct or good enough concept of “philosophical reasoning” from unsupervised training. Correct?
If AI can learn “philosophical reasoning” from unsupervised training, GPT-N should be able to do philosophy (e.g., solve open philosophical problems), right?
There’s a lot of moving pieces here, so the answer is long. Apologies in advance.
I basically agree with everything up until the parts on philosophy. The point of divergence is roughly here:
I do think that resolving certain confusions around values involves solving some philosophical problems. But just because the problems are philosophical does not mean that they need to be solved by philosophical reasoning.
The kinds of philosophical problems I have in mind are things like:
What is the type signature of human values?
What kind of data structure naturally represents human values?
How do human values interface with the rest of the world?
In other words, they’re exactly the sort of questions for which “utility function” and “Cartesian boundary” are answers, but probably not the right answers.
How could an AI make progress on these sorts of questions, other than by philosophical reasoning?
Let’s switch gears a moment and talk about some analogous problems:
What is the type signature of the concept of “tree”?
What kind of data structure naturally represents “tree”?
How do “trees” (as high-level abstract objects) interface with the rest of the world?
Though they’re not exactly the same questions, these are philosophical questions of a qualitatively similar sort to the questions about human values.
Empirically, AIs already do a remarkable job reasoning about trees, and finding answers to questions like those above, despite presumably not having much notion of “philosophical reasoning”. They learn some data structure for representing the concept of tree, and they learn how the high-level abstract “tree” objects interact with the rest of the (lower-level) world. And it seems like such AIs’ notion of “tree” tends to improve as we throw more data and compute at them, at least over the ranges explored to date.
In other words: empirically, we seem to be able to solve philosophical problems to a surprising degree by throwing data and compute at neural networks. Well, at least “solve” in the sense that the neural networks themselves seem to acquire solutions to the problems… not that either the neural nets or the humans gain much understanding of such problems in general.
Going up a meta level: why would this be the case? Why would solutions to philosophical problems end up embedded in random learning algorithms, without either the algorithms or the humans having a general understanding of the problems?
Well, presumably neural nets end up with a notion of “tree” for much the same reason that humans end up with a notion of “tree”: it’s a useful concept. We don’t have a precise mathematical theory of when or why it’s useful (though I do hopefully have some groundwork for that), but we can see instrumental convergence to a useful concept even without understanding why the concept is useful.
In short: solutions to certain philosophical problems are probably instrumentally convergent, so the solutions will probably pop up in a fairly broad range of systems despite neither the systems nor their designers understanding the philosophical problems.
Now, so far this has talked about why solutions to philosophical problems would pop up in one AI. But does that help one AI to improve its own solutions? Depends on the setup, but at the very least it offers an AI a possible path to improving its solutions to such philosophical problems without going through philosophical reasoning.
Finally, I’ll note that if humans want to be able to recognize an AI’s solutions to philosophical problems, e.g. decode a model of human values from the weights of a neural net, then we’ll probably need to make some philosophical/mathematical progress ourselves in order to do that reliably. After all, we don’t even know the type signature of the thing we’re looking for or a data structure with which to represent it.
So similarly, a human could try to understand Alice’s values in two ways. The first, equivalent to what you describe here for AI, is to just apply whatever learning algorithm their brain uses when observing Alice, and form an intuitive notion of “Alice’s values”. And the second is to apply explicit philosophical reasoning to this problem. So sure, you can possibly go a long way towards understanding Alice’s values by just doing the former, but is that enough to avoid disaster? (See Two Neglected Problems in Human-AI Safety for the kind of disaster I have in mind here.)
(I keep bringing up metaphilosophy but I’m pretty much resigned to be living in a part of the multiverse where civilization will just throw the dice and bet on AI safety not depending on solving it. What hope is there for our civilization to do what I think is the prudent thing, when no professional philosophers, even ones in EA who are concerned about AI safety, ever talk about it?)
I mostly agree with you here. I don’t think the chances of alignment by default are high. There are marginal gains to be had, but to get a high probability of alignment in the long term we will probably need actual understanding of the relevant philosophical problems.
My take is that corrigibility is sufficient to get you an AI that understands what it means to “keep improving their understanding of Alice’s values and to serve those values”. I don’t think the AI needs to play the “genius philosopher” role, just the “loyal and trustworthy servant” role. A superintelligent AI which plays that role should be able to facilitate a “long reflection” where flesh and blood humans solve philosophical problems.
(I also separately think unsupervised learning systems could in principle make philosophical breakthroughs. Maybe one already has.)