A: Humans aren’t agents, humans don’t want things. It would be bad to make an AI that assumes these things.
B: What do you mean by “bad”?
A: Well, there are multiple metaethical theories, but for this conversation, let’s say “bad” means “not leading to what the agents in this context collectively want”.
B: Aha, but what does “want” mean?
A: …
[EDIT: what I am suggesting is something like “find your wants in your metaphysical orientation, not your ontology, although perhaps use your ontology for more information about your wants”.]
[EDIT2: Also, your metaphysical orientation might be confused, in which case the solution is to resolve that confusion, producing a new metaphysical orientation, plausibly one that doesn’t have “wanting” and for which there is therefore no proper “AI alignment” problem, although it might still have AI-related philosophical problems]
Person A isn’t getting it quite right :P Humans want things, in the usual sense that “humans want things” indicates a useful class of models I use to predict humans. But they don’t Really Want things, the sort of essential Wanting that requires a unique, privileged function from a physical state of the human to the things Wanted.
So here’s the dialogue with A’s views more of an insert of my own:
A: Humans aren’t agents, by which I mean that humans don’t Really Want things. It would be bad to make an AI that assumes they do.
B: What do you mean by “bad”?
A: I mean that there wouldn’t be such a privileged Want for the AI to find in humans—humans want things, but can be modeled as wanting different things depending on the environment and level of detail of the model.
B: No, I mean how could you cash out “bad” if not in terms of what you Really Want?
A: Just in terms of what I regular, contingently want—how I’m modeling myself right now.
B: But isn’t that a privileged model that the AI could figure out and then use to locate your wants? And since these wants so naturally privileged, wouldn’t that make them what you Really Want?
A: The AI could do something like that, but I don’t like to think of that as finding out what I Really Want. The result isn’t going to be truly unique because I use multiple models of myself, and they’re all vague and fallible. And maybe more importantly, programming an AI to understand me “on my own terms” faces a lot of difficult challenges that don’t make sense if you think the goal is just to translate what I Really Want into the AI’s internal ontology.
B: Like what?
A: You remember the Bay Area train analogy at the end of The Tails Coming Apart as Metaphor for Life? When the train lines diverge, thinking of the problem as “figure out what train we Really Wanted” doesn’t help, and might divert people from the possible solutions, which are going to be contingent and sometimes messy.
B: But eventually you actually do follow one of the train lines, or program it into the AI, which uniquely specifies that as what you Really Want! Problem solved.
A: “Whatever I do is what I wanted to do” doesn’t help you make choices, though.
Thanks for explaining, your position makes more sense now. I think I agree with your overall point that there isn’t a “platonic Want” than can be directly inferred from physical state, at least without substantial additional psychology/philosophy investigation (which could, among other things, define bargaining solutions among the different wants).
So, there are at least a few different issues here for contingent wants:
Wants vary over time.
OK, so add a time parameter, and do what I want right now.
People could potentially use different “wanting” models for themselves.
Yes, but some models are better than others. (There’s a discussion of arbitrariness of models here which seems relevant)
In practice the brain is going to use some weighting procedure between them. If this procedure isn’t doing necessary messy work (it’s really not clear if it is), then it can be replaced with an algorithm. If it is, then perhaps the top priority for value learning is “figure out what this thingy is doing and form moral opinions about it”.
“Wanting” models are fallible.
Not necessarily a problem (but see next point); the main thing with AI alignment is to do much better than the “default” policy of having aligned humans continue to take actions, using whatever brain they have, without using AGI assistance. If people manage with having fallible “wanting” models, then perhaps the machinery people use to manage this can be understood?
“Wanting” models have limited domains of applicability.
This seems like Wei’s partial utility function problem and is related to the ontology identification problem. It’s pretty serious and is also a problem independently of value learning. Solving this problem would require either directly solving the philosophical problem, or doing psychology to figure out what machinery does ontology updates (and form moral opinions about that).
A fictional dialogue to illustrate:
A: Humans aren’t agents, humans don’t want things. It would be bad to make an AI that assumes these things.
B: What do you mean by “bad”?
A: Well, there are multiple metaethical theories, but for this conversation, let’s say “bad” means “not leading to what the agents in this context collectively want”.
B: Aha, but what does “want” mean?
A: …
[EDIT: what I am suggesting is something like “find your wants in your metaphysical orientation, not your ontology, although perhaps use your ontology for more information about your wants”.]
[EDIT2: Also, your metaphysical orientation might be confused, in which case the solution is to resolve that confusion, producing a new metaphysical orientation, plausibly one that doesn’t have “wanting” and for which there is therefore no proper “AI alignment” problem, although it might still have AI-related philosophical problems]
Person A isn’t getting it quite right :P Humans want things, in the usual sense that “humans want things” indicates a useful class of models I use to predict humans. But they don’t Really Want things, the sort of essential Wanting that requires a unique, privileged function from a physical state of the human to the things Wanted.
So here’s the dialogue with A’s views more of an insert of my own:
A: Humans aren’t agents, by which I mean that humans don’t Really Want things. It would be bad to make an AI that assumes they do.
B: What do you mean by “bad”?
A: I mean that there wouldn’t be such a privileged Want for the AI to find in humans—humans want things, but can be modeled as wanting different things depending on the environment and level of detail of the model.
B: No, I mean how could you cash out “bad” if not in terms of what you Really Want?
A: Just in terms of what I regular, contingently want—how I’m modeling myself right now.
B: But isn’t that a privileged model that the AI could figure out and then use to locate your wants? And since these wants so naturally privileged, wouldn’t that make them what you Really Want?
A: The AI could do something like that, but I don’t like to think of that as finding out what I Really Want. The result isn’t going to be truly unique because I use multiple models of myself, and they’re all vague and fallible. And maybe more importantly, programming an AI to understand me “on my own terms” faces a lot of difficult challenges that don’t make sense if you think the goal is just to translate what I Really Want into the AI’s internal ontology.
B: Like what?
A: You remember the Bay Area train analogy at the end of The Tails Coming Apart as Metaphor for Life? When the train lines diverge, thinking of the problem as “figure out what train we Really Wanted” doesn’t help, and might divert people from the possible solutions, which are going to be contingent and sometimes messy.
B: But eventually you actually do follow one of the train lines, or program it into the AI, which uniquely specifies that as what you Really Want! Problem solved.
A: “Whatever I do is what I wanted to do” doesn’t help you make choices, though.
Thanks for explaining, your position makes more sense now. I think I agree with your overall point that there isn’t a “platonic Want” than can be directly inferred from physical state, at least without substantial additional psychology/philosophy investigation (which could, among other things, define bargaining solutions among the different wants).
So, there are at least a few different issues here for contingent wants:
Wants vary over time.
OK, so add a time parameter, and do what I want right now.
People could potentially use different “wanting” models for themselves.
Yes, but some models are better than others. (There’s a discussion of arbitrariness of models here which seems relevant)
In practice the brain is going to use some weighting procedure between them. If this procedure isn’t doing necessary messy work (it’s really not clear if it is), then it can be replaced with an algorithm. If it is, then perhaps the top priority for value learning is “figure out what this thingy is doing and form moral opinions about it”.
“Wanting” models are fallible.
Not necessarily a problem (but see next point); the main thing with AI alignment is to do much better than the “default” policy of having aligned humans continue to take actions, using whatever brain they have, without using AGI assistance. If people manage with having fallible “wanting” models, then perhaps the machinery people use to manage this can be understood?
“Wanting” models have limited domains of applicability.
This seems like Wei’s partial utility function problem and is related to the ontology identification problem. It’s pretty serious and is also a problem independently of value learning. Solving this problem would require either directly solving the philosophical problem, or doing psychology to figure out what machinery does ontology updates (and form moral opinions about that).