One approach would be to take current-me and put current-me through a variety of virtual environments with fake memories that start from current-time without removing my real memories and use whatever is inferred from that as my utility function. (Basically, treat all experiences and memories up to the current time as “part of me”, and treat that as the initial state from which you are trying to determine a utility function.)
But more generally, if you think that a different set of life experiences means that you are a different person with different values, then that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed. Not just ambitious value learning, _any_ framework that involves an AI optimizing some expected utility would not work.
But more generally, if you think that a different set of life experiences means that you are a different person with different values, then that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed.
Maybe it’s not that bad? For example I can imagine learning the human utility function in two stages. The first stage uses the current human to learn a partial utility function (or some other kind of data structure) about how they want their life to go prior to figuring out their full utility function. E.g., perhaps they want a safe and supportive environment to think, talk to other humans, and solve various philosophical problems related to figuring out one’s utility function, with various kinds of assistance, safeguards, etc. from the AI (but otherwise no strong optimizing forces acting upon them). In the second stage, the AI use that information to compute a distribution of “preferred” future lives and then learns the full utility function only from those lives.
Another possibility is if we could design an Oracle AI that is really good at answering philosophical questions (including understanding what our confused questions mean), we can just ask it “What is my utility function?”
So I would argue that your proposal is one example of how you could learn a utility function from humans assuming you know the full human policy, where you are proposing that we pay attention to a very small part of the human policy (the part that specifies our answers to the question “how do we want our life to go” at the current time, and then the part that specifies our behavior in the “preferred” future lives).
You can think of this as ambitious value learning with a hardcoded structure by which the AI is supposed to infer the utility function from behavior. (A mediocre analogy: AlphaGoZero learns to play Go with a hardcoded structure of MCTS.) As a result, you would still need to grapple with the arguments against ambitious value learning brought up in subsequent posts—primarily, that you need to have a good model of the mistakes that humans make in order to better than humans would themselves. In your proposal, I think the mistake model is “everything that humans do could be mistaken, but when they talk about how they want their life to go, they are not mistaken about that”. This seems like a better mistake model than most, and it could work—but we are hardcoding in an assumption about humans here that could be misspecified. (Eg. humans say they want autonomy and freedom from manipulation but actually they would have been better off if they had let the AI make arguments to them about what they care about.)
In your proposal, I think the mistake model is “everything that humans do could be mistaken, but when they talk about how they want their life to go, they are not mistaken about that”.
Ok, this is helpful for making a connection between my way of thinking and the “mistake model” way, but it seems a bit of a stretch, since I almost certainly am mistaken (or suboptimal) about how I want my life to go. I only want autonomy and freedom from manipulation because I don’t know how to let an AI manipulate me (i.e., make arguments to me about my values) in a way that would be safe and lead to good results. If I did, I may well let the AI do that and save myself the trouble and risk of trying to figure out my values on my own.
Yeah, I agree that the mistake model implied by your proposal isn’t correct, and as a result you would not infer the true utility function. Of course, you might still infer one that is sufficiently close that we get a great future.
Tbc, I do think there are lots of other ways of thinking about the problem that are useful that are not captured by the “mistake model” way of thinking. I use the “mistake model” way of thinking because it often shows a different perspective on a proposal, and helps pinpoint what you’re relying on in your alignment proposal.
Of course this is all assuming that there does exist a true utility function, but I think we can replace “true utility function” with “utility function that encodes the optimal actions to take for the best possible universe” and everything still follows through. But of course, not hitting this target just means that we don’t do the perfectly optimal thing—it’s totally possible that we end up doing something that is only very slightly suboptimal.
Of course this is all assuming that there does exist a true utility function, but I think we can replace “true utility function” with “utility function that encodes the optimal actions to take for the best possible universe” and everything still follows through.
The replacement feels just as obscure to me as the original.
People often argue “there is no true utility function for humans” because we often do things that are contradictory that imply that we violate the VNM axioms. However, in theory you could look at all action sequences, rank them, take the best one, and find a utility function for which that action sequence is optimal, and you could call that the utility function that you want. That utility function exists as long as you agree that an ordering over action sequences exists, which seems very reasonable.
TL;DR the point of that reframing is to overcome objections that the “true utility function” doesn’t exist.
Thanks! I think I understand the intent of the rephrasing now.
What I meant with “obscure” is that both “true utility function” and “utility function that encodes the optimal actions to take for the best possible universe” have normative terminology in them that I don’t know how to reduce or operationalize.
For instance, imagine I am looking at action sequences and ranking them. Presumably large portions of that process would feel like difficult judgment calls where I’d feel nervous about still making some kind of mistake. Both your phrasings (to my ears) carry the connotation that there is a “best” mistake model, one which is in a relevant sense independent from our own judgment, where we can learn things that will make us more and more confident that now we’re probably not making mistakes anymore because of progress in finding the correct way of thinking about our values. That’s the part that feels obscure to me because I think we’ll always be in this unsatisfying epistemic situation where we’re nervous about making some kind of mistake by the light of a standard that we cannot properly describe.
I do get the intuition for thinking in these terms, though. It feels conceivable that another discovery similar to what cognitive biases did could improve our thinking, and I definitely agree that we want a concept for staying open to this possibility. I’m just pointing out that non-operationalized normative concepts seem obscure. (Though maybe that’s fine if we’re treating them in the same way Yudkowsky treats “magic reality fluid” – as a placeholder for whatever comes once we’re less confused about “measure”.)
What I meant with “obscure” is that both “true utility function” and “utility function that encodes the optimal actions to take for the best possible universe” have normative terminology in them that I don’t know how to reduce or operationalize.
Oh yeah, I was definitely speaking normatively there.
For instance, imagine I am looking at action sequences and ranking them. Presumably large portions of that process would feel like difficult judgment calls where I’d feel nervous about still making some kind of mistake.
Agreed, I’m just saying that in principle there exists some “best” way of making those calls.
Both your phrasings (to my ears) carry the connotation that there is a “best” mistake model, one which is in a relevant sense independent from our own judgment
Agreed that I’m assuming that there is a “best” mistake model, I wouldn’t say that it has to be independent from our own judgment.
But more generally, if you think that a different set of life experiences means that you are a different person with different values, then that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed. Not just ambitious value learning, _any_ framework that involves an AI optimizing some expected utility would not work.
This statement feels pretty strong, especially given that I find it trivially true that I’d be a different person under many plausible alternative histories. This makes me think I’m probably misinterpreting something. :)
At first I read your paragraph as the strong claim that if it’s true that individual human values are underdetermined at birth, then ambitious value learning looks doomed. And I’d take it as proof for “individual human values are underdetermined at birth” if, replaying history, I’d now have different values (or a different probability distribution over values) if I had encountered Yudkowsky’s writings before Singer’s, rather than vice-versa. Or if I would be less single-minded about altruism had I encountered EA a couple of years later in life, after already taking on another self-identity.
But these points (especially the second example) seem so trivially true that I’m probably talking about a different thing. In addition, they’re addressed by the solution you propose in your first paragraph, namely taking current-you as the starting point.
Another concern could be that “there is almost never a stable core of an individual human’s values”, i.e., that “even going forward from today, the values of Lukas or Rohin or Wei are going to be heavily underdetermined”. Is that the concern? This seems like it could be possible for most people, but definitely not for all people. And undetermined values are not necessarily that bad (though I find it mildly disconcerting, personally). [Edit: Wei’s comment and your reply to it sounds like this might indeed be the concern. :) Good discussion there!]
The fact that I have a hard time understanding the framework behind your statement is probably because I’m thinking in terms of a different part of my brain when I talk about “my values”. I identify very much with my reflective life goals to a point that seems unusual. I don’t identify much with “What Lukas’s behavior, if you were to put him in different environments and then watch, would indirectly consistently tell you about the things he appears to want – e.g., ‘values’ like being held in high esteem by others, having a comfortable life, romance, having either some kind of overarching purpose or enough distractions to not feel bother by the lack of purpose, etc.”. There is definitely a sense in which the code that runs me is caring about all these implicit goals. But that’s not how I most want to see it. I also know that in all the environments that offer the options to self-modify into a more efficient pursuer of explicitly held personal ideals, I would make substantial use of the option to self-modify. And that seems relevant for the same reason that we wouldn’t want to count cognitive biases as people’s values.
(I should probably continue reading the sequence and then come back to this later if I still feel unclear about it.)
Another concern could be that “there is almost never a stable core of an individual human’s values”, i.e., that “even going forward from today, the values of Lukas or Rohin or Wei are going to be heavily underdetermined”. Is that the concern?
Yeah. Also I suspect some people are worried about taking current-you as a starting point—that seems somewhat arbitrary. But if you’re fine with that, then the major concern is that values are still underdetermined going forward.
The fact that I have a hard time understanding the framework behind your statement is probably because I’m thinking in terms of a different part of my brain when I talk about “my values”. I identify very much with my reflective life goals to a point that seems unusual.
I interpreted Wei’s comment as saying that even your reflective life goals would be underdetermined—presumably even now if you hear convincing moral argument A but not B, then you’d have different reflective life goals than if you hear B but not A. This seems broadly correct to me.
I interpreted Wei’s comment as saying that even your reflective life goals would be underdetermined—presumably even now if you hear convincing moral argument A but not B, then you’d have different reflective life goals than if you hear B but not A.
Okay yeah, that also seems broadly correct to me.
I am hoping though that, as long as I’m not subjected to optimization pressures from outside that weren’t crafted to be helpful, it’s very rare that something I’d currently consider very important can end up either staying important or becoming completely unimportant merely based on order of new arguments encountered. And similarly I’m hoping that my value endpoints would still cluster decisively around the things I currently consider most important, – though that’s where it becomes tricky to trade off goal preservation versus openness for philosophical progress.
No, I’m not planning to tackle this issue.
One approach would be to take current-me and put current-me through a variety of virtual environments with fake memories that start from current-time without removing my real memories and use whatever is inferred from that as my utility function. (Basically, treat all experiences and memories up to the current time as “part of me”, and treat that as the initial state from which you are trying to determine a utility function.)
But more generally, if you think that a different set of life experiences means that you are a different person with different values, then that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed. Not just ambitious value learning, _any_ framework that involves an AI optimizing some expected utility would not work.
Maybe it’s not that bad? For example I can imagine learning the human utility function in two stages. The first stage uses the current human to learn a partial utility function (or some other kind of data structure) about how they want their life to go prior to figuring out their full utility function. E.g., perhaps they want a safe and supportive environment to think, talk to other humans, and solve various philosophical problems related to figuring out one’s utility function, with various kinds of assistance, safeguards, etc. from the AI (but otherwise no strong optimizing forces acting upon them). In the second stage, the AI use that information to compute a distribution of “preferred” future lives and then learns the full utility function only from those lives.
Another possibility is if we could design an Oracle AI that is really good at answering philosophical questions (including understanding what our confused questions mean), we can just ask it “What is my utility function?”
So I would argue that your proposal is one example of how you could learn a utility function from humans assuming you know the full human policy, where you are proposing that we pay attention to a very small part of the human policy (the part that specifies our answers to the question “how do we want our life to go” at the current time, and then the part that specifies our behavior in the “preferred” future lives).
You can think of this as ambitious value learning with a hardcoded structure by which the AI is supposed to infer the utility function from behavior. (A mediocre analogy: AlphaGoZero learns to play Go with a hardcoded structure of MCTS.) As a result, you would still need to grapple with the arguments against ambitious value learning brought up in subsequent posts—primarily, that you need to have a good model of the mistakes that humans make in order to better than humans would themselves. In your proposal, I think the mistake model is “everything that humans do could be mistaken, but when they talk about how they want their life to go, they are not mistaken about that”. This seems like a better mistake model than most, and it could work—but we are hardcoding in an assumption about humans here that could be misspecified. (Eg. humans say they want autonomy and freedom from manipulation but actually they would have been better off if they had let the AI make arguments to them about what they care about.)
Ok, this is helpful for making a connection between my way of thinking and the “mistake model” way, but it seems a bit of a stretch, since I almost certainly am mistaken (or suboptimal) about how I want my life to go. I only want autonomy and freedom from manipulation because I don’t know how to let an AI manipulate me (i.e., make arguments to me about my values) in a way that would be safe and lead to good results. If I did, I may well let the AI do that and save myself the trouble and risk of trying to figure out my values on my own.
Yeah, I agree that the mistake model implied by your proposal isn’t correct, and as a result you would not infer the true utility function. Of course, you might still infer one that is sufficiently close that we get a great future.
Tbc, I do think there are lots of other ways of thinking about the problem that are useful that are not captured by the “mistake model” way of thinking. I use the “mistake model” way of thinking because it often shows a different perspective on a proposal, and helps pinpoint what you’re relying on in your alignment proposal.
Of course this is all assuming that there does exist a true utility function, but I think we can replace “true utility function” with “utility function that encodes the optimal actions to take for the best possible universe” and everything still follows through. But of course, not hitting this target just means that we don’t do the perfectly optimal thing—it’s totally possible that we end up doing something that is only very slightly suboptimal.
The replacement feels just as obscure to me as the original.
What do you mean by “obscure”?
People often argue “there is no true utility function for humans” because we often do things that are contradictory that imply that we violate the VNM axioms. However, in theory you could look at all action sequences, rank them, take the best one, and find a utility function for which that action sequence is optimal, and you could call that the utility function that you want. That utility function exists as long as you agree that an ordering over action sequences exists, which seems very reasonable.
TL;DR the point of that reframing is to overcome objections that the “true utility function” doesn’t exist.
Thanks! I think I understand the intent of the rephrasing now.
What I meant with “obscure” is that both “true utility function” and “utility function that encodes the optimal actions to take for the best possible universe” have normative terminology in them that I don’t know how to reduce or operationalize.
For instance, imagine I am looking at action sequences and ranking them. Presumably large portions of that process would feel like difficult judgment calls where I’d feel nervous about still making some kind of mistake. Both your phrasings (to my ears) carry the connotation that there is a “best” mistake model, one which is in a relevant sense independent from our own judgment, where we can learn things that will make us more and more confident that now we’re probably not making mistakes anymore because of progress in finding the correct way of thinking about our values. That’s the part that feels obscure to me because I think we’ll always be in this unsatisfying epistemic situation where we’re nervous about making some kind of mistake by the light of a standard that we cannot properly describe.
I do get the intuition for thinking in these terms, though. It feels conceivable that another discovery similar to what cognitive biases did could improve our thinking, and I definitely agree that we want a concept for staying open to this possibility. I’m just pointing out that non-operationalized normative concepts seem obscure. (Though maybe that’s fine if we’re treating them in the same way Yudkowsky treats “magic reality fluid” – as a placeholder for whatever comes once we’re less confused about “measure”.)
Oh yeah, I was definitely speaking normatively there.
Agreed, I’m just saying that in principle there exists some “best” way of making those calls.
Agreed that I’m assuming that there is a “best” mistake model, I wouldn’t say that it has to be independent from our own judgment.
This statement feels pretty strong, especially given that I find it trivially true that I’d be a different person under many plausible alternative histories. This makes me think I’m probably misinterpreting something. :)
At first I read your paragraph as the strong claim that if it’s true that individual human values are underdetermined at birth, then ambitious value learning looks doomed. And I’d take it as proof for “individual human values are underdetermined at birth” if, replaying history, I’d now have different values (or a different probability distribution over values) if I had encountered Yudkowsky’s writings before Singer’s, rather than vice-versa. Or if I would be less single-minded about altruism had I encountered EA a couple of years later in life, after already taking on another self-identity.
But these points (especially the second example) seem so trivially true that I’m probably talking about a different thing. In addition, they’re addressed by the solution you propose in your first paragraph, namely taking current-you as the starting point.
Another concern could be that “there is almost never a stable core of an individual human’s values”, i.e., that “even going forward from today, the values of Lukas or Rohin or Wei are going to be heavily underdetermined”. Is that the concern? This seems like it could be possible for most people, but definitely not for all people. And undetermined values are not necessarily that bad (though I find it mildly disconcerting, personally). [Edit: Wei’s comment and your reply to it sounds like this might indeed be the concern. :) Good discussion there!]
The fact that I have a hard time understanding the framework behind your statement is probably because I’m thinking in terms of a different part of my brain when I talk about “my values”. I identify very much with my reflective life goals to a point that seems unusual. I don’t identify much with “What Lukas’s behavior, if you were to put him in different environments and then watch, would indirectly consistently tell you about the things he appears to want – e.g., ‘values’ like being held in high esteem by others, having a comfortable life, romance, having either some kind of overarching purpose or enough distractions to not feel bother by the lack of purpose, etc.”. There is definitely a sense in which the code that runs me is caring about all these implicit goals. But that’s not how I most want to see it. I also know that in all the environments that offer the options to self-modify into a more efficient pursuer of explicitly held personal ideals, I would make substantial use of the option to self-modify. And that seems relevant for the same reason that we wouldn’t want to count cognitive biases as people’s values.
(I should probably continue reading the sequence and then come back to this later if I still feel unclear about it.)
Yeah. Also I suspect some people are worried about taking current-you as a starting point—that seems somewhat arbitrary. But if you’re fine with that, then the major concern is that values are still underdetermined going forward.
I interpreted Wei’s comment as saying that even your reflective life goals would be underdetermined—presumably even now if you hear convincing moral argument A but not B, then you’d have different reflective life goals than if you hear B but not A. This seems broadly correct to me.
Okay yeah, that also seems broadly correct to me.
I am hoping though that, as long as I’m not subjected to optimization pressures from outside that weren’t crafted to be helpful, it’s very rare that something I’d currently consider very important can end up either staying important or becoming completely unimportant merely based on order of new arguments encountered. And similarly I’m hoping that my value endpoints would still cluster decisively around the things I currently consider most important, – though that’s where it becomes tricky to trade off goal preservation versus openness for philosophical progress.