It’s important to note that human preferences may be messy, but the mechanism by which we obtain them probably isn’t. I think the question really isn’t “What do I want (and how can I make an AI understand that)?” but rather “How do I end up wanting things (and how can I make an AI accurately predict how that process will unfold)?”
I don’t disagree with the first sentence (well, it depends on where you draw the line for “messy”).
I do mostly disagree with the second sentence.
I’m optimistic that we will eventually have a complete answer to your second question. But once we do have that answer, I think we’ll still have a hard time figuring out a specification for what we want the AI to actually do, for the reasons in my comment—in short, if we take a prospective approach (my preferences now determine what to do), then it’s hard because my preferences are self-inconsistent, invalid-out-of-distribution, they might involve ghosts, etc.; or if we take a retrospective approach (the AGI should ensure that the human is happy with how things turned out in hindsight), then we get brainwashing and so on.
I don’t really see how this is a problem. The AI should do something that is no worse from me-now’s perspective than whatever I myself would have done. Given that if I have inconsistent preferences I am probably struggling to figure out how to balance them myself, it doesn’t seem reasonable to me to expect an AI to do better.
I think also a mixture of prospective and retrospective makes most sense; every choice you make is a trade between your present and future selves, after all. So whatever the AI does should be something that both you-now and you-afterward would accept as legitimate.
Also, my inconsistent preferences probably all agree that becoming more consistent would be desirable, though they would disagree about how to do this; so the AI would probably try to help me achieve internal consistency in a way both me-before (all subagents) and me-after agree upon, through some kind of internal arbitration (helping me figure out what I want) and then act upon that.
And if my preferences involve things that don’t exist or that I don’t understand correctly, the AI may be able to extrapolate what the closest real thing is to my confused goal (improve the welfare of ghosts → take the utility functions of currently dead people more into account in making decisions), whether both me-now and me-after would agree that this is reasonable, and then if so do that.
Again, we’re assuming for the sake of argument that there’s an AI which completely understands an adult human’s current preferences (which are somewhat inconsistent etc.), and how those preferences would change under different circumstances. We need a specification for what this AI should do right now.
If you’re arguing that there is such a specification which is not messy, can write down exactly what that specification is? If you already said it, I missed it. Can you put it in italics or something? :)
(Your comment said that the AI “should” or “would” do this or that a bunch of times, but I’m not sure if you’re listing various different consequences of a single simple specification that you have in mind, or if you’re listing different desiderata that must be met by a yet-to-be-determined specification.)
I think out loud a lot. Assume nearly everything I say in conversations like this is desiderata I’m listing off the top of my head with no prior planning. I’m really not good at the kind of rigorous think-before-you-speak that is normative on LessWrong.
A really bad starting point for a specification which almost certainly has tons of holes in it: have the AI predict what I would do up to a given length of time in the future if it did not exist, and from there make small modifications to construct a variety of different timelines for similar things I might instead have done.
In each such timeline predict how much I-now and I-after would approve of that sequence of actions, and maximize the minimum of those two. Stop after a certain number of timelines have been considered and tell me the results. Update its predictions of me-now based on how I respond, and if I ask it to, run the simulation again with this new data and a new set of randomly deviating future timelines.
This would produce a relatively myopic (doesn’t look too far into the future) and satisficing (doesn’t consider too many options) advice-giving AI which would not have agency of its own but only help me find courses of action for me to do which I like better than whatever I would have done without its advice.
There’s almost certainly tons of failure modes here, such as a timeline where my actions seem reasonable at first, but turn me into a different person who also thinks the actions were reasonable, but who otherwise wildly differs from me in a way that is invisible to me-now receiving the advice. But it’s a zeroth draft anyway.
(That whole thing there was another example of me thinking out loud in response to what you said, rather than anything preconceived. It’s very hard for me to do otherwise. I just get writer’s block and anxiety if I try to.)
I also edit my previous comments a lot after I realize there was more I ought to have said. Very bad habit—look back at the comment you just replied to please, I edited it before realizing you’d already read it! I really need to stop doing that...
Oh it’s fine, plenty of people edit their comments after posting including me, I should be mindful of that by not replying immediately :-P As for the rest of your comment:
I think your comment has a slight resemblance to Vanessa Kosoy’s “Hippocratic Timeline-Driven Learning” (Section 4.1 here), if you haven’t already heard of that.
My suspicion is that, if one were to sort out all the details, including things like the AI-human communication protocol, such that it really works and is powerful and has no failure modes, you would wind up with something that’s at least “rather messy” (again, “rather messy” means “in the same messiness ballpark as Stuart Armstrong research agenda v0.9”) (and “powerful” rules out literal Hippocratic Timeline-Driven Learning, IMO).
It’s important to note that human preferences may be messy, but the mechanism by which we obtain them probably isn’t. I think the question really isn’t “What do I want (and how can I make an AI understand that)?” but rather “How do I end up wanting things (and how can I make an AI accurately predict how that process will unfold)?”
I don’t disagree with the first sentence (well, it depends on where you draw the line for “messy”).
I do mostly disagree with the second sentence.
I’m optimistic that we will eventually have a complete answer to your second question. But once we do have that answer, I think we’ll still have a hard time figuring out a specification for what we want the AI to actually do, for the reasons in my comment—in short, if we take a prospective approach (my preferences now determine what to do), then it’s hard because my preferences are self-inconsistent, invalid-out-of-distribution, they might involve ghosts, etc.; or if we take a retrospective approach (the AGI should ensure that the human is happy with how things turned out in hindsight), then we get brainwashing and so on.
I don’t really see how this is a problem. The AI should do something that is no worse from me-now’s perspective than whatever I myself would have done. Given that if I have inconsistent preferences I am probably struggling to figure out how to balance them myself, it doesn’t seem reasonable to me to expect an AI to do better.
I think also a mixture of prospective and retrospective makes most sense; every choice you make is a trade between your present and future selves, after all. So whatever the AI does should be something that both you-now and you-afterward would accept as legitimate.
Also, my inconsistent preferences probably all agree that becoming more consistent would be desirable, though they would disagree about how to do this; so the AI would probably try to help me achieve internal consistency in a way both me-before (all subagents) and me-after agree upon, through some kind of internal arbitration (helping me figure out what I want) and then act upon that.
And if my preferences involve things that don’t exist or that I don’t understand correctly, the AI may be able to extrapolate what the closest real thing is to my confused goal (improve the welfare of ghosts → take the utility functions of currently dead people more into account in making decisions), whether both me-now and me-after would agree that this is reasonable, and then if so do that.
Again, we’re assuming for the sake of argument that there’s an AI which completely understands an adult human’s current preferences (which are somewhat inconsistent etc.), and how those preferences would change under different circumstances. We need a specification for what this AI should do right now.
If you’re arguing that there is such a specification which is not messy, can write down exactly what that specification is? If you already said it, I missed it. Can you put it in italics or something? :)
(Your comment said that the AI “should” or “would” do this or that a bunch of times, but I’m not sure if you’re listing various different consequences of a single simple specification that you have in mind, or if you’re listing different desiderata that must be met by a yet-to-be-determined specification.)
(Again, in my book, Stuart Armstrong research agenda v0.9 counts as rather messy.)
I think out loud a lot. Assume nearly everything I say in conversations like this is desiderata I’m listing off the top of my head with no prior planning. I’m really not good at the kind of rigorous think-before-you-speak that is normative on LessWrong.
A really bad starting point for a specification which almost certainly has tons of holes in it: have the AI predict what I would do up to a given length of time in the future if it did not exist, and from there make small modifications to construct a variety of different timelines for similar things I might instead have done.
In each such timeline predict how much I-now and I-after would approve of that sequence of actions, and maximize the minimum of those two. Stop after a certain number of timelines have been considered and tell me the results. Update its predictions of me-now based on how I respond, and if I ask it to, run the simulation again with this new data and a new set of randomly deviating future timelines.
This would produce a relatively myopic (doesn’t look too far into the future) and satisficing (doesn’t consider too many options) advice-giving AI which would not have agency of its own but only help me find courses of action for me to do which I like better than whatever I would have done without its advice.
There’s almost certainly tons of failure modes here, such as a timeline where my actions seem reasonable at first, but turn me into a different person who also thinks the actions were reasonable, but who otherwise wildly differs from me in a way that is invisible to me-now receiving the advice. But it’s a zeroth draft anyway.
(That whole thing there was another example of me thinking out loud in response to what you said, rather than anything preconceived. It’s very hard for me to do otherwise. I just get writer’s block and anxiety if I try to.)
Gotcha, thanks :) [ETA—this was in response to just the first paragraph]
I also edit my previous comments a lot after I realize there was more I ought to have said. Very bad habit—look back at the comment you just replied to please, I edited it before realizing you’d already read it! I really need to stop doing that...
Oh it’s fine, plenty of people edit their comments after posting including me, I should be mindful of that by not replying immediately :-P As for the rest of your comment:
I think your comment has a slight resemblance to Vanessa Kosoy’s “Hippocratic Timeline-Driven Learning” (Section 4.1 here), if you haven’t already heard of that.
My suspicion is that, if one were to sort out all the details, including things like the AI-human communication protocol, such that it really works and is powerful and has no failure modes, you would wind up with something that’s at least “rather messy” (again, “rather messy” means “in the same messiness ballpark as Stuart Armstrong research agenda v0.9”) (and “powerful” rules out literal Hippocratic Timeline-Driven Learning, IMO).