I think you’re saying: if a thing is messy, at least there can be a non-messy procedure / algorithm that converges to (a.k.a. points to) the thing. I think I’m with Charlie in feeling skeptical about this in regards to value learning, because I think value learning is significantly a normative question. Let me elaborate:
My genes plus 1.2e9 seconds of experience have built have built a fundamentally messy set of preferences, which are in some cases self-inconsistent, easily-manipulated, invalid-out-of-distribution, etc. It’s easy enough to point to the set of preferences as a whole—you just say “Steve’s preferences right now”.
In fact, one might eventually (I expect) be able to write down the learning algorithm, reward function, etc., that led to those preferences (but we won’t be able to write down the many petabytes of messy training data), and we’ll be able to talk about what the preferences look like in the brain. But still, you shouldn’t and can’t directly optimize according those preferences because they’re self-inconsistent, invalid-out-of-distribution, they might involve ghosts, etc.
So then we have a normative question: if “fulfill Steve’s preferences” isn’t a straightforward thing, then what exactly should the AGI do? Maybe we should ask Steve what value learning ought to look like? But maybe I say “I don’t know”, or maybe I give an answer that I wouldn’t endorse upon reflection, or in hindsight. So maybe we should have the AGI do whatever Steve will endorse in hindsight? No, that leads to brainwashing.
Anyway, it’s possible that we’ll come up with an operationalization of value learning that really nails down what we think the AGI ought to do. (Let’s say, for example, something like CEV but more specific.) If we do, to what extent should we expect this operationalization to be simple and elegant, versus messy? (For example, in my book, Stuart Armstrong research agenda v0.9 counts as rather messy.) I think an answer on the messier side is quite plausible. Remember, (1) this is a normative question, and (2) that means that the foundation on which it’s built is human preferences (about what value learning ought to look like), and (3) as above, human preferences are fundamentally messy because they involve a lifetime of learning from data. This is especially true if we don’t want to trample over individual / cultural differences of opinion about (for example) the boundary between advice (good) vs manipulation (bad).
It’s important to note that human preferences may be messy, but the mechanism by which we obtain them probably isn’t. I think the question really isn’t “What do I want (and how can I make an AI understand that)?” but rather “How do I end up wanting things (and how can I make an AI accurately predict how that process will unfold)?”
I don’t disagree with the first sentence (well, it depends on where you draw the line for “messy”).
I do mostly disagree with the second sentence.
I’m optimistic that we will eventually have a complete answer to your second question. But once we do have that answer, I think we’ll still have a hard time figuring out a specification for what we want the AI to actually do, for the reasons in my comment—in short, if we take a prospective approach (my preferences now determine what to do), then it’s hard because my preferences are self-inconsistent, invalid-out-of-distribution, they might involve ghosts, etc.; or if we take a retrospective approach (the AGI should ensure that the human is happy with how things turned out in hindsight), then we get brainwashing and so on.
I don’t really see how this is a problem. The AI should do something that is no worse from me-now’s perspective than whatever I myself would have done. Given that if I have inconsistent preferences I am probably struggling to figure out how to balance them myself, it doesn’t seem reasonable to me to expect an AI to do better.
I think also a mixture of prospective and retrospective makes most sense; every choice you make is a trade between your present and future selves, after all. So whatever the AI does should be something that both you-now and you-afterward would accept as legitimate.
Also, my inconsistent preferences probably all agree that becoming more consistent would be desirable, though they would disagree about how to do this; so the AI would probably try to help me achieve internal consistency in a way both me-before (all subagents) and me-after agree upon, through some kind of internal arbitration (helping me figure out what I want) and then act upon that.
And if my preferences involve things that don’t exist or that I don’t understand correctly, the AI may be able to extrapolate what the closest real thing is to my confused goal (improve the welfare of ghosts → take the utility functions of currently dead people more into account in making decisions), whether both me-now and me-after would agree that this is reasonable, and then if so do that.
Again, we’re assuming for the sake of argument that there’s an AI which completely understands an adult human’s current preferences (which are somewhat inconsistent etc.), and how those preferences would change under different circumstances. We need a specification for what this AI should do right now.
If you’re arguing that there is such a specification which is not messy, can write down exactly what that specification is? If you already said it, I missed it. Can you put it in italics or something? :)
(Your comment said that the AI “should” or “would” do this or that a bunch of times, but I’m not sure if you’re listing various different consequences of a single simple specification that you have in mind, or if you’re listing different desiderata that must be met by a yet-to-be-determined specification.)
I think out loud a lot. Assume nearly everything I say in conversations like this is desiderata I’m listing off the top of my head with no prior planning. I’m really not good at the kind of rigorous think-before-you-speak that is normative on LessWrong.
A really bad starting point for a specification which almost certainly has tons of holes in it: have the AI predict what I would do up to a given length of time in the future if it did not exist, and from there make small modifications to construct a variety of different timelines for similar things I might instead have done.
In each such timeline predict how much I-now and I-after would approve of that sequence of actions, and maximize the minimum of those two. Stop after a certain number of timelines have been considered and tell me the results. Update its predictions of me-now based on how I respond, and if I ask it to, run the simulation again with this new data and a new set of randomly deviating future timelines.
This would produce a relatively myopic (doesn’t look too far into the future) and satisficing (doesn’t consider too many options) advice-giving AI which would not have agency of its own but only help me find courses of action for me to do which I like better than whatever I would have done without its advice.
There’s almost certainly tons of failure modes here, such as a timeline where my actions seem reasonable at first, but turn me into a different person who also thinks the actions were reasonable, but who otherwise wildly differs from me in a way that is invisible to me-now receiving the advice. But it’s a zeroth draft anyway.
(That whole thing there was another example of me thinking out loud in response to what you said, rather than anything preconceived. It’s very hard for me to do otherwise. I just get writer’s block and anxiety if I try to.)
I also edit my previous comments a lot after I realize there was more I ought to have said. Very bad habit—look back at the comment you just replied to please, I edited it before realizing you’d already read it! I really need to stop doing that...
Oh it’s fine, plenty of people edit their comments after posting including me, I should be mindful of that by not replying immediately :-P As for the rest of your comment:
I think your comment has a slight resemblance to Vanessa Kosoy’s “Hippocratic Timeline-Driven Learning” (Section 4.1 here), if you haven’t already heard of that.
My suspicion is that, if one were to sort out all the details, including things like the AI-human communication protocol, such that it really works and is powerful and has no failure modes, you would wind up with something that’s at least “rather messy” (again, “rather messy” means “in the same messiness ballpark as Stuart Armstrong research agenda v0.9”) (and “powerful” rules out literal Hippocratic Timeline-Driven Learning, IMO).
I think you’re saying: if a thing is messy, at least there can be a non-messy procedure / algorithm that converges to (a.k.a. points to) the thing. I think I’m with Charlie in feeling skeptical about this in regards to value learning, because I think value learning is significantly a normative question. Let me elaborate:
My genes plus 1.2e9 seconds of experience have built have built a fundamentally messy set of preferences, which are in some cases self-inconsistent, easily-manipulated, invalid-out-of-distribution, etc. It’s easy enough to point to the set of preferences as a whole—you just say “Steve’s preferences right now”.
In fact, one might eventually (I expect) be able to write down the learning algorithm, reward function, etc., that led to those preferences (but we won’t be able to write down the many petabytes of messy training data), and we’ll be able to talk about what the preferences look like in the brain. But still, you shouldn’t and can’t directly optimize according those preferences because they’re self-inconsistent, invalid-out-of-distribution, they might involve ghosts, etc.
So then we have a normative question: if “fulfill Steve’s preferences” isn’t a straightforward thing, then what exactly should the AGI do? Maybe we should ask Steve what value learning ought to look like? But maybe I say “I don’t know”, or maybe I give an answer that I wouldn’t endorse upon reflection, or in hindsight. So maybe we should have the AGI do whatever Steve will endorse in hindsight? No, that leads to brainwashing.
Anyway, it’s possible that we’ll come up with an operationalization of value learning that really nails down what we think the AGI ought to do. (Let’s say, for example, something like CEV but more specific.) If we do, to what extent should we expect this operationalization to be simple and elegant, versus messy? (For example, in my book, Stuart Armstrong research agenda v0.9 counts as rather messy.) I think an answer on the messier side is quite plausible. Remember, (1) this is a normative question, and (2) that means that the foundation on which it’s built is human preferences (about what value learning ought to look like), and (3) as above, human preferences are fundamentally messy because they involve a lifetime of learning from data. This is especially true if we don’t want to trample over individual / cultural differences of opinion about (for example) the boundary between advice (good) vs manipulation (bad).
(Low confidence on all this.)
It’s important to note that human preferences may be messy, but the mechanism by which we obtain them probably isn’t. I think the question really isn’t “What do I want (and how can I make an AI understand that)?” but rather “How do I end up wanting things (and how can I make an AI accurately predict how that process will unfold)?”
I don’t disagree with the first sentence (well, it depends on where you draw the line for “messy”).
I do mostly disagree with the second sentence.
I’m optimistic that we will eventually have a complete answer to your second question. But once we do have that answer, I think we’ll still have a hard time figuring out a specification for what we want the AI to actually do, for the reasons in my comment—in short, if we take a prospective approach (my preferences now determine what to do), then it’s hard because my preferences are self-inconsistent, invalid-out-of-distribution, they might involve ghosts, etc.; or if we take a retrospective approach (the AGI should ensure that the human is happy with how things turned out in hindsight), then we get brainwashing and so on.
I don’t really see how this is a problem. The AI should do something that is no worse from me-now’s perspective than whatever I myself would have done. Given that if I have inconsistent preferences I am probably struggling to figure out how to balance them myself, it doesn’t seem reasonable to me to expect an AI to do better.
I think also a mixture of prospective and retrospective makes most sense; every choice you make is a trade between your present and future selves, after all. So whatever the AI does should be something that both you-now and you-afterward would accept as legitimate.
Also, my inconsistent preferences probably all agree that becoming more consistent would be desirable, though they would disagree about how to do this; so the AI would probably try to help me achieve internal consistency in a way both me-before (all subagents) and me-after agree upon, through some kind of internal arbitration (helping me figure out what I want) and then act upon that.
And if my preferences involve things that don’t exist or that I don’t understand correctly, the AI may be able to extrapolate what the closest real thing is to my confused goal (improve the welfare of ghosts → take the utility functions of currently dead people more into account in making decisions), whether both me-now and me-after would agree that this is reasonable, and then if so do that.
Again, we’re assuming for the sake of argument that there’s an AI which completely understands an adult human’s current preferences (which are somewhat inconsistent etc.), and how those preferences would change under different circumstances. We need a specification for what this AI should do right now.
If you’re arguing that there is such a specification which is not messy, can write down exactly what that specification is? If you already said it, I missed it. Can you put it in italics or something? :)
(Your comment said that the AI “should” or “would” do this or that a bunch of times, but I’m not sure if you’re listing various different consequences of a single simple specification that you have in mind, or if you’re listing different desiderata that must be met by a yet-to-be-determined specification.)
(Again, in my book, Stuart Armstrong research agenda v0.9 counts as rather messy.)
I think out loud a lot. Assume nearly everything I say in conversations like this is desiderata I’m listing off the top of my head with no prior planning. I’m really not good at the kind of rigorous think-before-you-speak that is normative on LessWrong.
A really bad starting point for a specification which almost certainly has tons of holes in it: have the AI predict what I would do up to a given length of time in the future if it did not exist, and from there make small modifications to construct a variety of different timelines for similar things I might instead have done.
In each such timeline predict how much I-now and I-after would approve of that sequence of actions, and maximize the minimum of those two. Stop after a certain number of timelines have been considered and tell me the results. Update its predictions of me-now based on how I respond, and if I ask it to, run the simulation again with this new data and a new set of randomly deviating future timelines.
This would produce a relatively myopic (doesn’t look too far into the future) and satisficing (doesn’t consider too many options) advice-giving AI which would not have agency of its own but only help me find courses of action for me to do which I like better than whatever I would have done without its advice.
There’s almost certainly tons of failure modes here, such as a timeline where my actions seem reasonable at first, but turn me into a different person who also thinks the actions were reasonable, but who otherwise wildly differs from me in a way that is invisible to me-now receiving the advice. But it’s a zeroth draft anyway.
(That whole thing there was another example of me thinking out loud in response to what you said, rather than anything preconceived. It’s very hard for me to do otherwise. I just get writer’s block and anxiety if I try to.)
Gotcha, thanks :) [ETA—this was in response to just the first paragraph]
I also edit my previous comments a lot after I realize there was more I ought to have said. Very bad habit—look back at the comment you just replied to please, I edited it before realizing you’d already read it! I really need to stop doing that...
Oh it’s fine, plenty of people edit their comments after posting including me, I should be mindful of that by not replying immediately :-P As for the rest of your comment:
I think your comment has a slight resemblance to Vanessa Kosoy’s “Hippocratic Timeline-Driven Learning” (Section 4.1 here), if you haven’t already heard of that.
My suspicion is that, if one were to sort out all the details, including things like the AI-human communication protocol, such that it really works and is powerful and has no failure modes, you would wind up with something that’s at least “rather messy” (again, “rather messy” means “in the same messiness ballpark as Stuart Armstrong research agenda v0.9”) (and “powerful” rules out literal Hippocratic Timeline-Driven Learning, IMO).