There’s a lot of moving pieces here, so the answer is long. Apologies in advance.
I basically agree with everything up until the parts on philosophy. The point of divergence is roughly here:
assuming “improving understanding of Alice’s values” involves “using philosophical reasoning to solve various confusions related to understanding Alice’s values, including Alice’s own confusions”
I do think that resolving certain confusions around values involves solving some philosophical problems. But just because the problems are philosophical does not mean that they need to be solved by philosophical reasoning.
The kinds of philosophical problems I have in mind are things like:
What is the type signature of human values?
What kind of data structure naturally represents human values?
How do human values interface with the rest of the world?
In other words, they’re exactly the sort of questions for which “utility function” and “Cartesian boundary” are answers, but probably not the right answers.
How could an AI make progress on these sorts of questions, other than by philosophical reasoning?
Let’s switch gears a moment and talk about some analogous problems:
What is the type signature of the concept of “tree”?
What kind of data structure naturally represents “tree”?
How do “trees” (as high-level abstract objects) interface with the rest of the world?
Though they’re not exactly the same questions, these are philosophical questions of a qualitatively similar sort to the questions about human values.
Empirically, AIs already do a remarkable job reasoning about trees, and finding answers to questions like those above, despite presumably not having much notion of “philosophical reasoning”. They learn some data structure for representing the concept of tree, and they learn how the high-level abstract “tree” objects interact with the rest of the (lower-level) world. And it seems like such AIs’ notion of “tree” tends to improve as we throw more data and compute at them, at least over the ranges explored to date.
In other words: empirically, we seem to be able to solve philosophical problems to a surprising degree by throwing data and compute at neural networks. Well, at least “solve” in the sense that the neural networks themselves seem to acquire solutions to the problems… not that either the neural nets or the humans gain much understanding of such problems in general.
Going up a meta level: why would this be the case? Why would solutions to philosophical problems end up embedded in random learning algorithms, without either the algorithms or the humans having a general understanding of the problems?
Well, presumably neural nets end up with a notion of “tree” for much the same reason that humans end up with a notion of “tree”: it’s a useful concept. We don’t have a precise mathematical theory of when or why it’s useful (though I do hopefully have some groundwork for that), but we can see instrumental convergence to a useful concept even without understanding why the concept is useful.
In short: solutions to certain philosophical problems are probably instrumentally convergent, so the solutions will probably pop up in a fairly broad range of systems despite neither the systems nor their designers understanding the philosophical problems.
Now, so far this has talked about why solutions to philosophical problems would pop up in one AI. But does that help one AI to improve its own solutions? Depends on the setup, but at the very least it offers an AI a possible path to improving its solutions to such philosophical problems without going through philosophical reasoning.
Finally, I’ll note that if humans want to be able to recognize an AI’s solutions to philosophical problems, e.g. decode a model of human values from the weights of a neural net, then we’ll probably need to make some philosophical/mathematical progress ourselves in order to do that reliably. After all, we don’t even know the type signature of the thing we’re looking for or a data structure with which to represent it.
So similarly, a human could try to understand Alice’s values in two ways. The first, equivalent to what you describe here for AI, is to just apply whatever learning algorithm their brain uses when observing Alice, and form an intuitive notion of “Alice’s values”. And the second is to apply explicit philosophical reasoning to this problem. So sure, you can possibly go a long way towards understanding Alice’s values by just doing the former, but is that enough to avoid disaster? (See Two Neglected Problems in Human-AI Safety for the kind of disaster I have in mind here.)
(I keep bringing up metaphilosophy but I’m pretty much resigned to be living in a part of the multiverse where civilization will just throw the dice and bet on AI safety not depending on solving it. What hope is there for our civilization to do what I think is the prudent thing, when no professional philosophers, even ones in EA who are concerned about AI safety, ever talk about it?)
I mostly agree with you here. I don’t think the chances of alignment by default are high. There are marginal gains to be had, but to get a high probability of alignment in the long term we will probably need actual understanding of the relevant philosophical problems.
There’s a lot of moving pieces here, so the answer is long. Apologies in advance.
I basically agree with everything up until the parts on philosophy. The point of divergence is roughly here:
I do think that resolving certain confusions around values involves solving some philosophical problems. But just because the problems are philosophical does not mean that they need to be solved by philosophical reasoning.
The kinds of philosophical problems I have in mind are things like:
What is the type signature of human values?
What kind of data structure naturally represents human values?
How do human values interface with the rest of the world?
In other words, they’re exactly the sort of questions for which “utility function” and “Cartesian boundary” are answers, but probably not the right answers.
How could an AI make progress on these sorts of questions, other than by philosophical reasoning?
Let’s switch gears a moment and talk about some analogous problems:
What is the type signature of the concept of “tree”?
What kind of data structure naturally represents “tree”?
How do “trees” (as high-level abstract objects) interface with the rest of the world?
Though they’re not exactly the same questions, these are philosophical questions of a qualitatively similar sort to the questions about human values.
Empirically, AIs already do a remarkable job reasoning about trees, and finding answers to questions like those above, despite presumably not having much notion of “philosophical reasoning”. They learn some data structure for representing the concept of tree, and they learn how the high-level abstract “tree” objects interact with the rest of the (lower-level) world. And it seems like such AIs’ notion of “tree” tends to improve as we throw more data and compute at them, at least over the ranges explored to date.
In other words: empirically, we seem to be able to solve philosophical problems to a surprising degree by throwing data and compute at neural networks. Well, at least “solve” in the sense that the neural networks themselves seem to acquire solutions to the problems… not that either the neural nets or the humans gain much understanding of such problems in general.
Going up a meta level: why would this be the case? Why would solutions to philosophical problems end up embedded in random learning algorithms, without either the algorithms or the humans having a general understanding of the problems?
Well, presumably neural nets end up with a notion of “tree” for much the same reason that humans end up with a notion of “tree”: it’s a useful concept. We don’t have a precise mathematical theory of when or why it’s useful (though I do hopefully have some groundwork for that), but we can see instrumental convergence to a useful concept even without understanding why the concept is useful.
In short: solutions to certain philosophical problems are probably instrumentally convergent, so the solutions will probably pop up in a fairly broad range of systems despite neither the systems nor their designers understanding the philosophical problems.
Now, so far this has talked about why solutions to philosophical problems would pop up in one AI. But does that help one AI to improve its own solutions? Depends on the setup, but at the very least it offers an AI a possible path to improving its solutions to such philosophical problems without going through philosophical reasoning.
Finally, I’ll note that if humans want to be able to recognize an AI’s solutions to philosophical problems, e.g. decode a model of human values from the weights of a neural net, then we’ll probably need to make some philosophical/mathematical progress ourselves in order to do that reliably. After all, we don’t even know the type signature of the thing we’re looking for or a data structure with which to represent it.
So similarly, a human could try to understand Alice’s values in two ways. The first, equivalent to what you describe here for AI, is to just apply whatever learning algorithm their brain uses when observing Alice, and form an intuitive notion of “Alice’s values”. And the second is to apply explicit philosophical reasoning to this problem. So sure, you can possibly go a long way towards understanding Alice’s values by just doing the former, but is that enough to avoid disaster? (See Two Neglected Problems in Human-AI Safety for the kind of disaster I have in mind here.)
(I keep bringing up metaphilosophy but I’m pretty much resigned to be living in a part of the multiverse where civilization will just throw the dice and bet on AI safety not depending on solving it. What hope is there for our civilization to do what I think is the prudent thing, when no professional philosophers, even ones in EA who are concerned about AI safety, ever talk about it?)
I mostly agree with you here. I don’t think the chances of alignment by default are high. There are marginal gains to be had, but to get a high probability of alignment in the long term we will probably need actual understanding of the relevant philosophical problems.