I encountered serious AI safety researchers who were dismissive of the need to work on (4)
The argument against (4) is that the AI will be able to figure out our preferences since it is superintelligent, so all we need to do is ensure that it is incentivized to figure out and satisfy our preferences, and then it will do the rest. I wouldn’t dismiss work on (4), but it doesn’t seem like the highest priority given this argument.
One potential counterargument is that the AI must look like an expected utility maximizer due to coherence arguments, and so we need to figure out the utility function, but I don’t buy this argument.
The argument against (4) is that the AI will be able to figure out our preferences since it is superintelligent, so all we need to do is ensure that it is incentivized to figure out and satisfy our preferences, and then it will do the rest. I wouldn’t dismiss work on (4), but it doesn’t seem like the highest priority given this argument.
I mostly view this as a kind of passing-the-buck argument in a way that is potentially dangerous, because it assumes we can keep a superintelligent AI aligned long enough that it can figure out (4) in a way we would be happy with. We maybe don’t have to solve literally all of (4) and everything connected to it, but it would seem we have to solve enough of it that we can reasonable trust a system we build will stay aligned with us as it works out the rest of the details for us.
I think of this as something like, we need an adequate theory of values, preferences, metapreferences, volition, or whatever you want to call it such that we can evaluate if some AI would, if we asked it to, work out (4) in a way that satisfies that theory and us, and this reduces back to having a theory of values that satisfies us well enough that it’s capturing everything we care about in principle (even if we can’t, say, work out exactly what it implies about specific behaviors in all scenarios) in a way we would endorse after operating on more data.
To make an analogy, I think we need enough values “math” to be sure we get something that hangs together as well as basing math on category theory rather than naive set theory does and doesn’t result in unexpected gaps and incompletenesses or inconsistencies that only become apparent much later after working out the theory further. Basically I’m saying where we stand on understanding values today is about where we stood 120 years ago in mathematical foundations: we have something that kind of works, but we can tell it will probably blow up in our faces if we push it to its limits, and I suspect we need something that won’t do that if we’re to be able to adequately assess whether or not we can safely hand off the remainder of the task to an AI because without that there are likely to be leaps of faith in our proofs that doing that would be safe the same way in the past people had to put up sign posts that said “just don’t go over here” in order to get on with a theory of mathematical foundations.
The thing you’re describing is a theory of human preferences, not (4): An actual grounded definition of human preferences (which implies that in addition to the theory we need to run some computation that produces some representation of human preferences). I was mostly arguing against requiring an actual grounded definition of human preferences.
I am unsure on the question of whether it is necessary to have a theory of human preferences or values. I agree that such a theory would help us evaluate whether or not a particular AI agent is going to be aligned or not. But how much does it help? I can certainly see other paths that don’t require it. For example, if we had a theory of optimization and agents, and a method of “pointing” optimization power at humans so that the AI is “trying to help the human”, I could imagine feeling confident enough to turn on that AI system. (It obviously depends on the details.)
Ah, sounds like we were working with different definitions of “definition”.
For example, if we had a theory of optimization and agents, and a method of “pointing” optimization power at humans so that the AI is “trying to help the human”, I could imagine feeling confident enough to turn on that AI system. (It obviously depends on the details.)
That still seems dangerous to me, since I see no reason to believe it wouldn’t end up optimizing for something we didn’t want. I guess you would have a theory of optimization and agents so good you could know that it wouldn’t optimize in ways you didn’t want it to, but I think this also begs the question by hiding details in “want” that would ultimately require a sufficient theory of human preferences.
As I often say, the reason I think we need to prioritize a theory of human preferences is not because I have a slam dunk proof that we need it, but because I believe we fail to adequately work to mitigate known risks of superintelligent AI if we don’t because we don’t, on the other side, have a slam dunk argument for why we wouldn’t end up needing it, and I’d rather live in a world where we worked it out and didn’t need it than one where we didn’t work it out and do need it.
That still seems dangerous to me, since I see no reason to believe it wouldn’t end up optimizing for something we didn’t want. I guess you would have a theory of optimization and agents so good you could know that it wouldn’t optimize in ways you didn’t want it to
In my head, the theory + implementation ensures that all of the optimization is pointed toward the goal “try to help the human”. If you could then legitimately say “it could still end up optimizing for something else”, then we don’t have the right theory + implementation as I’m imagining it.
but I think this also begs the question by hiding details in “want” that would ultimately require a sufficient theory of human preferences.
I think it’s hiding details in “optimization”, “try” and “help” (and to a lesser extent, “human”). I don’t think it’s hiding details in “want”. You could maybe argue that any operationalization of “help” would necessarily have “want” as a prerequisite, but this doesn’t seem obvious to me.
You could also argue that any beneficial future requires us to figure out our preferences, but that wouldn’t explain why it had to happen before building superintelligent AI.
As I often say, the reason I think we need to prioritize a theory of human preferences is not because I have a slam dunk proof that we need it, but because I believe we fail to adequately work to mitigate known risks of superintelligent AI if we don’t because we don’t, on the other side, have a slam dunk argument for why we wouldn’t end up needing it, and I’d rather live in a world where we worked it out and didn’t need it than one where we didn’t work it out and do need it.
I agree with this, but it’s not an argument on the margin. There are many aspects of AI safety I could work on. Why a theory of human preferences in particular, as opposed to e.g. detecting optimization?
I agree with this, but it’s not an argument on the margin. There are many aspects of AI safety I could work on. Why a theory of human preferences in particular, as opposed to e.g. detecting optimization?
How we think of whether to work on one thing versus another seems a matter of both how important the project is overall and how likely it is that any individual can do something about it relative to their ability to do something about something else. That is, I don’t think of human researchers as a commodity, thus much of the answer to this question is about what can any one of us do not just what could we do if we were all equally capable of working on anything.
I think of this as a function of impact, neglectedness, tractability, skill, and interest. The first three hold constant across all people, but the last two vary. So when I choose what to work on given that I am willed to work on AI safety, I decide also based on what I am most skilled to do relative to others (my comparative advantage) and what I am most interested or excited about working on (to what extent am I excited to work on something for its own sake, irrelevant of the amount it advances AI safety).
Weighing the first three together suggests whether or not it is worth anyone working on a problem, and including the last two point to whether it’s worth you working on a problem. So when you ask about what “I could work on” all those factors are playing a role.
But if we go back to just the question of why I think a theory of human preferences is impactful, neglected, and tractable, we can take these in turn.
First, I think it’s pretty clear that developing a theory of human preferences precise enough to be useful in AI alignment is neglected. Stuart aside, everyone else I can think of thinking about this is still in the proto-formal stage, trying to figuring out what the formalisms would even look like to begin making precise statements to allow forming the theory. Yes, we have some theories about human preferences already, but they seem inadequate for AI safety purposes.
Second, I think we have reason to believe it’s tractable in that no one has previously needed a theory of human preferences accurate enough to address the needs presented by AI alignment, so we haven’t spent much time really trying to solve the problem of “precise theory of human values that doesn’t fall apart under heavy optimization”. The closest we have is the theory of preferences from behavioral economics, which I view Stuart’s work as an evolution of, and that was able to be worked out over a few decades once markets became a powerful enough optimization force that it was valuable to have it so that we could get more of what we wanted from markets and other modern transactional means of interaction. Yes, this time is different, since we are working ahead of the optimization force being present in our lives, and we are doing that for reasons relevant to the last point, impact.
I also have an intuition that thinking that we can get away without having an adequate theory of human preferences is just the same mistake we made 20 years ago when folks argued that we didn’t have to worry about safety at all because a sufficiently intelligent AI would be smart enough to be nice to us, i.e. that sufficient intelligence would result in sufficient ethics. Now of course we know this as the orthogonality thesis that the two are not correlated, and I think of trying to build powerful optimizers without adequate understanding of what we want them to optimize for (in this case human preferences) as making the same mistake we made early on, thinking we will easily solve one very different problem by solving another.
All of this suggests to me that we should be making space for some people to work on a theory of human preferences, just like we make space for some people to work on agent foundations and some people to work on alignment in the context of current ML systems. There’s no one overseeing AI safety as a Manhattan-style project, so the natural way to organize ourselves is not around driving towards a single objective, but of driving towards a goal that we might approach by many means and, lacking clear consensus, it is worth pursuing these many means as a hedge against the likelihood that we are wrong, and we are not resourced to the frontier such that we must make tradeoffs against different directions so much as we can just expand the frontier by bringing in more people to work on AI safety.
I agree with basically all of this; maybe I’m more pessimistic about tractability, but not enough to matter for any actual decision.
It sounds to me that given these beliefs the thing you would want to advocate is “let those who want to figure out a theory of human preferences do so and don’t shun them from AI safety”. Perhaps also “let’s have some introductory articles for such a theory so that new entrants to the field know that it is a problem that could use more work and can make an informed decision about what to work on”. Both of these I would certainly agree with.
In your original comment it sounded to me like you were advocating something stronger: that a theory of human preferences was necessary for AI safety, and (by implication) at least some of us who don’t work on it should switch to working on it. In addition, we should differentially encourage newer entrants to the field to work on a theory of human preferences, rather than some other problem of AI safety, so as to build a community around (4). I would disagree with these stronger claims.
Do you perhaps only endorse the first paragraph and not the second?
I endorse what you propose in the first paragraph. I do think a theory of human preferences is necessary and that at least someone should work on it (and if I didn’t think this I probably wouldn’t be doing it myself), although not necessarily that someone should switch to it all else equal, and I wouldn’t say we should encourage folks to work on it more than other problems as a general policy since there’s a lot to be done and I remain uncertain about prioritization so can’t make a strong recommendation there beyond “let’s make sure we don’t fail to work on as much as seems relevant as possible”.
So it sounds like we only disagree on the necessity aspect, and that seems to be the result of an inferential gap I’m not sure how to bridge yet, i.e. why it is I believe it to be necessary hinges in part on deeper beliefs we may not share and haven’t figured out to make explicit. That’s good to know, because it points towards something worth thinking about and addressing so that existing and new entrants to AI safety work may more accept it as important and useful work.
so all we need to do is ensure that it is incentivized to figure out and satisfy our preferences, and then it will do the rest.
That’s actually what I’m aiming at with the research agenda, but the Occam’s razor argument shows that this itself is highly non-trivial, and we need some strong grounding of the definition of preference.
There’s a difference between “creating an explicit preference learning system” and “having a generally capable system learn preferences”. I think the former is difficult (because of the Occam’s razor argument) but the latter is not.
Suppose I told you that we built a superintelligent AI system without thinking at all about grounded human preferences. Do you think that AI system doesn’t “know” what humans would want it to do, even if it doesn’t optimize for it? (See also this failed utopia story.)
Do you think that AI system doesn’t “know” what humans would want, even if it doesn’t optimize for it?
I think the AI would not know that, because “what humans would want” is not defined. “What humans say they want”, “what, upon reflection, humans would agree they want...”, etc can be done, but “what humans want” is not a defined things about the world or about humans—without extra assumptions (which cannot be deduced from observation).
The argument against (4) is that the AI will be able to figure out our preferences since it is superintelligent, so all we need to do is ensure that it is incentivized to figure out and satisfy our preferences, and then it will do the rest. I wouldn’t dismiss work on (4), but it doesn’t seem like the highest priority given this argument.
One potential counterargument is that the AI must look like an expected utility maximizer due to coherence arguments, and so we need to figure out the utility function, but I don’t buy this argument.
I mostly view this as a kind of passing-the-buck argument in a way that is potentially dangerous, because it assumes we can keep a superintelligent AI aligned long enough that it can figure out (4) in a way we would be happy with. We maybe don’t have to solve literally all of (4) and everything connected to it, but it would seem we have to solve enough of it that we can reasonable trust a system we build will stay aligned with us as it works out the rest of the details for us.
I think of this as something like, we need an adequate theory of values, preferences, metapreferences, volition, or whatever you want to call it such that we can evaluate if some AI would, if we asked it to, work out (4) in a way that satisfies that theory and us, and this reduces back to having a theory of values that satisfies us well enough that it’s capturing everything we care about in principle (even if we can’t, say, work out exactly what it implies about specific behaviors in all scenarios) in a way we would endorse after operating on more data.
To make an analogy, I think we need enough values “math” to be sure we get something that hangs together as well as basing math on category theory rather than naive set theory does and doesn’t result in unexpected gaps and incompletenesses or inconsistencies that only become apparent much later after working out the theory further. Basically I’m saying where we stand on understanding values today is about where we stood 120 years ago in mathematical foundations: we have something that kind of works, but we can tell it will probably blow up in our faces if we push it to its limits, and I suspect we need something that won’t do that if we’re to be able to adequately assess whether or not we can safely hand off the remainder of the task to an AI because without that there are likely to be leaps of faith in our proofs that doing that would be safe the same way in the past people had to put up sign posts that said “just don’t go over here” in order to get on with a theory of mathematical foundations.
The thing you’re describing is a theory of human preferences, not (4): An actual grounded definition of human preferences (which implies that in addition to the theory we need to run some computation that produces some representation of human preferences). I was mostly arguing against requiring an actual grounded definition of human preferences.
I am unsure on the question of whether it is necessary to have a theory of human preferences or values. I agree that such a theory would help us evaluate whether or not a particular AI agent is going to be aligned or not. But how much does it help? I can certainly see other paths that don’t require it. For example, if we had a theory of optimization and agents, and a method of “pointing” optimization power at humans so that the AI is “trying to help the human”, I could imagine feeling confident enough to turn on that AI system. (It obviously depends on the details.)
Ah, sounds like we were working with different definitions of “definition”.
That still seems dangerous to me, since I see no reason to believe it wouldn’t end up optimizing for something we didn’t want. I guess you would have a theory of optimization and agents so good you could know that it wouldn’t optimize in ways you didn’t want it to, but I think this also begs the question by hiding details in “want” that would ultimately require a sufficient theory of human preferences.
As I often say, the reason I think we need to prioritize a theory of human preferences is not because I have a slam dunk proof that we need it, but because I believe we fail to adequately work to mitigate known risks of superintelligent AI if we don’t because we don’t, on the other side, have a slam dunk argument for why we wouldn’t end up needing it, and I’d rather live in a world where we worked it out and didn’t need it than one where we didn’t work it out and do need it.
In my head, the theory + implementation ensures that all of the optimization is pointed toward the goal “try to help the human”. If you could then legitimately say “it could still end up optimizing for something else”, then we don’t have the right theory + implementation as I’m imagining it.
I think it’s hiding details in “optimization”, “try” and “help” (and to a lesser extent, “human”). I don’t think it’s hiding details in “want”. You could maybe argue that any operationalization of “help” would necessarily have “want” as a prerequisite, but this doesn’t seem obvious to me.
You could also argue that any beneficial future requires us to figure out our preferences, but that wouldn’t explain why it had to happen before building superintelligent AI.
I agree with this, but it’s not an argument on the margin. There are many aspects of AI safety I could work on. Why a theory of human preferences in particular, as opposed to e.g. detecting optimization?
How we think of whether to work on one thing versus another seems a matter of both how important the project is overall and how likely it is that any individual can do something about it relative to their ability to do something about something else. That is, I don’t think of human researchers as a commodity, thus much of the answer to this question is about what can any one of us do not just what could we do if we were all equally capable of working on anything.
I think of this as a function of impact, neglectedness, tractability, skill, and interest. The first three hold constant across all people, but the last two vary. So when I choose what to work on given that I am willed to work on AI safety, I decide also based on what I am most skilled to do relative to others (my comparative advantage) and what I am most interested or excited about working on (to what extent am I excited to work on something for its own sake, irrelevant of the amount it advances AI safety).
Weighing the first three together suggests whether or not it is worth anyone working on a problem, and including the last two point to whether it’s worth you working on a problem. So when you ask about what “I could work on” all those factors are playing a role.
But if we go back to just the question of why I think a theory of human preferences is impactful, neglected, and tractable, we can take these in turn.
First, I think it’s pretty clear that developing a theory of human preferences precise enough to be useful in AI alignment is neglected. Stuart aside, everyone else I can think of thinking about this is still in the proto-formal stage, trying to figuring out what the formalisms would even look like to begin making precise statements to allow forming the theory. Yes, we have some theories about human preferences already, but they seem inadequate for AI safety purposes.
Second, I think we have reason to believe it’s tractable in that no one has previously needed a theory of human preferences accurate enough to address the needs presented by AI alignment, so we haven’t spent much time really trying to solve the problem of “precise theory of human values that doesn’t fall apart under heavy optimization”. The closest we have is the theory of preferences from behavioral economics, which I view Stuart’s work as an evolution of, and that was able to be worked out over a few decades once markets became a powerful enough optimization force that it was valuable to have it so that we could get more of what we wanted from markets and other modern transactional means of interaction. Yes, this time is different, since we are working ahead of the optimization force being present in our lives, and we are doing that for reasons relevant to the last point, impact.
On the question of impact, I think Stuart addressed this question well several months ago.
I also have an intuition that thinking that we can get away without having an adequate theory of human preferences is just the same mistake we made 20 years ago when folks argued that we didn’t have to worry about safety at all because a sufficiently intelligent AI would be smart enough to be nice to us, i.e. that sufficient intelligence would result in sufficient ethics. Now of course we know this as the orthogonality thesis that the two are not correlated, and I think of trying to build powerful optimizers without adequate understanding of what we want them to optimize for (in this case human preferences) as making the same mistake we made early on, thinking we will easily solve one very different problem by solving another.
All of this suggests to me that we should be making space for some people to work on a theory of human preferences, just like we make space for some people to work on agent foundations and some people to work on alignment in the context of current ML systems. There’s no one overseeing AI safety as a Manhattan-style project, so the natural way to organize ourselves is not around driving towards a single objective, but of driving towards a goal that we might approach by many means and, lacking clear consensus, it is worth pursuing these many means as a hedge against the likelihood that we are wrong, and we are not resourced to the frontier such that we must make tradeoffs against different directions so much as we can just expand the frontier by bringing in more people to work on AI safety.
I agree with basically all of this; maybe I’m more pessimistic about tractability, but not enough to matter for any actual decision.
It sounds to me that given these beliefs the thing you would want to advocate is “let those who want to figure out a theory of human preferences do so and don’t shun them from AI safety”. Perhaps also “let’s have some introductory articles for such a theory so that new entrants to the field know that it is a problem that could use more work and can make an informed decision about what to work on”. Both of these I would certainly agree with.
In your original comment it sounded to me like you were advocating something stronger: that a theory of human preferences was necessary for AI safety, and (by implication) at least some of us who don’t work on it should switch to working on it. In addition, we should differentially encourage newer entrants to the field to work on a theory of human preferences, rather than some other problem of AI safety, so as to build a community around (4). I would disagree with these stronger claims.
Do you perhaps only endorse the first paragraph and not the second?
I endorse what you propose in the first paragraph. I do think a theory of human preferences is necessary and that at least someone should work on it (and if I didn’t think this I probably wouldn’t be doing it myself), although not necessarily that someone should switch to it all else equal, and I wouldn’t say we should encourage folks to work on it more than other problems as a general policy since there’s a lot to be done and I remain uncertain about prioritization so can’t make a strong recommendation there beyond “let’s make sure we don’t fail to work on as much as seems relevant as possible”.
So it sounds like we only disagree on the necessity aspect, and that seems to be the result of an inferential gap I’m not sure how to bridge yet, i.e. why it is I believe it to be necessary hinges in part on deeper beliefs we may not share and haven’t figured out to make explicit. That’s good to know, because it points towards something worth thinking about and addressing so that existing and new entrants to AI safety work may more accept it as important and useful work.
That’s actually what I’m aiming at with the research agenda, but the Occam’s razor argument shows that this itself is highly non-trivial, and we need some strong grounding of the definition of preference.
There’s a difference between “creating an explicit preference learning system” and “having a generally capable system learn preferences”. I think the former is difficult (because of the Occam’s razor argument) but the latter is not.
Suppose I told you that we built a superintelligent AI system without thinking at all about grounded human preferences. Do you think that AI system doesn’t “know” what humans would want it to do, even if it doesn’t optimize for it? (See also this failed utopia story.)
I think the AI would not know that, because “what humans would want” is not defined. “What humans say they want”, “what, upon reflection, humans would agree they want...”, etc can be done, but “what humans want” is not a defined things about the world or about humans—without extra assumptions (which cannot be deduced from observation).