I think the obvious extreme is a detailed microscopic model that reproduces human behavior without using the intentional stance—is this a model that doesn’t generate itself, or is this a model that assigns agency to some humans?
It would generate itself given enough compute, but you can’t, as a human, use physics to predict that humans will invent physics, without using some agency concept. Anyway, there are decision theoretic issues with modeling yourself as a pure mechanism; to make decisions, you think of yourself as controlling what this mechanism does. (This is getting somewhat speculative; I guess my main point here is that you, in practice, have to use the intentional stance to actually predict human behavior as complex as making models of humans, which doesn’t mean an AI would)
Does it seem clear to you that if you model a human as a somewhat complicated thermostat (perhaps making decisions according to some kind of flowchart) then you aren’t going to predict that a human would write a post about humans being somewhat complicated thermostats?
There’s wanting, and then there’s Wanting.
When I say “suppose you want something” I mean “actual wanting” with respect to the purposes of this conversation, which might map to your Wanting. It’s hard to specify exactly. The thing I’m saying here is that a notion of what “wanting” is is implicit in many discourses, including discourse on what AI we should build (notice the word “should” in that sentence).
Relevant: this discussion of proofs of the existence of God makes the similar point that perhaps proofs of God are about revealing a notion of God already implicit in the society’s discourse. I’m proposing a similar thing about “wanting”.
(note: this comment and my previous one should both be read as speculative research idea generation, not solidified confident opinions)
A: Humans aren’t agents, humans don’t want things. It would be bad to make an AI that assumes these things.
B: What do you mean by “bad”?
A: Well, there are multiple metaethical theories, but for this conversation, let’s say “bad” means “not leading to what the agents in this context collectively want”.
B: Aha, but what does “want” mean?
A: …
[EDIT: what I am suggesting is something like “find your wants in your metaphysical orientation, not your ontology, although perhaps use your ontology for more information about your wants”.]
[EDIT2: Also, your metaphysical orientation might be confused, in which case the solution is to resolve that confusion, producing a new metaphysical orientation, plausibly one that doesn’t have “wanting” and for which there is therefore no proper “AI alignment” problem, although it might still have AI-related philosophical problems]
Person A isn’t getting it quite right :P Humans want things, in the usual sense that “humans want things” indicates a useful class of models I use to predict humans. But they don’t Really Want things, the sort of essential Wanting that requires a unique, privileged function from a physical state of the human to the things Wanted.
So here’s the dialogue with A’s views more of an insert of my own:
A: Humans aren’t agents, by which I mean that humans don’t Really Want things. It would be bad to make an AI that assumes they do.
B: What do you mean by “bad”?
A: I mean that there wouldn’t be such a privileged Want for the AI to find in humans—humans want things, but can be modeled as wanting different things depending on the environment and level of detail of the model.
B: No, I mean how could you cash out “bad” if not in terms of what you Really Want?
A: Just in terms of what I regular, contingently want—how I’m modeling myself right now.
B: But isn’t that a privileged model that the AI could figure out and then use to locate your wants? And since these wants so naturally privileged, wouldn’t that make them what you Really Want?
A: The AI could do something like that, but I don’t like to think of that as finding out what I Really Want. The result isn’t going to be truly unique because I use multiple models of myself, and they’re all vague and fallible. And maybe more importantly, programming an AI to understand me “on my own terms” faces a lot of difficult challenges that don’t make sense if you think the goal is just to translate what I Really Want into the AI’s internal ontology.
B: Like what?
A: You remember the Bay Area train analogy at the end of The Tails Coming Apart as Metaphor for Life? When the train lines diverge, thinking of the problem as “figure out what train we Really Wanted” doesn’t help, and might divert people from the possible solutions, which are going to be contingent and sometimes messy.
B: But eventually you actually do follow one of the train lines, or program it into the AI, which uniquely specifies that as what you Really Want! Problem solved.
A: “Whatever I do is what I wanted to do” doesn’t help you make choices, though.
Thanks for explaining, your position makes more sense now. I think I agree with your overall point that there isn’t a “platonic Want” than can be directly inferred from physical state, at least without substantial additional psychology/philosophy investigation (which could, among other things, define bargaining solutions among the different wants).
So, there are at least a few different issues here for contingent wants:
Wants vary over time.
OK, so add a time parameter, and do what I want right now.
People could potentially use different “wanting” models for themselves.
Yes, but some models are better than others. (There’s a discussion of arbitrariness of models here which seems relevant)
In practice the brain is going to use some weighting procedure between them. If this procedure isn’t doing necessary messy work (it’s really not clear if it is), then it can be replaced with an algorithm. If it is, then perhaps the top priority for value learning is “figure out what this thingy is doing and form moral opinions about it”.
“Wanting” models are fallible.
Not necessarily a problem (but see next point); the main thing with AI alignment is to do much better than the “default” policy of having aligned humans continue to take actions, using whatever brain they have, without using AGI assistance. If people manage with having fallible “wanting” models, then perhaps the machinery people use to manage this can be understood?
“Wanting” models have limited domains of applicability.
This seems like Wei’s partial utility function problem and is related to the ontology identification problem. It’s pretty serious and is also a problem independently of value learning. Solving this problem would require either directly solving the philosophical problem, or doing psychology to figure out what machinery does ontology updates (and form moral opinions about that).
Does it seem clear to you that if you model a human as a somewhat complicated thermostat (perhaps making decisions according to some kind of flowchart) then you aren’t going to predict that a human would write a post about humans being somewhat complicated thermostats?
Is my flowchart model complicated enough to emulate a RNN? Then I’m not sure.
Or one might imagine a model that has psychological parts, but distributes the function fulfilled by “wants” in an agent model among several different pieces, which might conflict or reinforce each other depending on context. This model could reproduce human verbal behavior about “wanting” with no actual component in the model that formalizes wanting.
If this kind of model works well, it’s a counterexample (less compute-intensive than a microphysical model) of the idea I think you’re gesturing towards, which is that the data really privileges models in which there’s an agent-like formalization of wanting.
Or one might imagine a model that has psychological parts, but distributes the function fulfilled by “wants” in an agent model among several different pieces, which might conflict or reinforce each other depending on context.
Hmm, so with enough compute (like, using parts of your brain to model the different psychological parts), perhaps you could do something like this for yourself. But you couldn’t predict the results of the behavior of people smarter than you. For example, you would have a hard time predicting that Kasparov would win a chess game against a random chess player, without being as good at chess as Kasparov yourself, though even with the intentional stance you can’t predict his actions. (You could obviously predict this using statistics, but that wouldn’t be based on just the mechanical model itself)
That is, it seems like the intentional stance often involves using much less compute than the person being modeled in order to predict that things will go in the direction of the person’s wants (limited by the person’s capabilities), without predicting each of the person’s actions.
It would generate itself given enough compute, but you can’t, as a human, use physics to predict that humans will invent physics, without using some agency concept. Anyway, there are decision theoretic issues with modeling yourself as a pure mechanism; to make decisions, you think of yourself as controlling what this mechanism does. (This is getting somewhat speculative; I guess my main point here is that you, in practice, have to use the intentional stance to actually predict human behavior as complex as making models of humans, which doesn’t mean an AI would)
Does it seem clear to you that if you model a human as a somewhat complicated thermostat (perhaps making decisions according to some kind of flowchart) then you aren’t going to predict that a human would write a post about humans being somewhat complicated thermostats?
When I say “suppose you want something” I mean “actual wanting” with respect to the purposes of this conversation, which might map to your Wanting. It’s hard to specify exactly. The thing I’m saying here is that a notion of what “wanting” is is implicit in many discourses, including discourse on what AI we should build (notice the word “should” in that sentence).
Relevant: this discussion of proofs of the existence of God makes the similar point that perhaps proofs of God are about revealing a notion of God already implicit in the society’s discourse. I’m proposing a similar thing about “wanting”.
(note: this comment and my previous one should both be read as speculative research idea generation, not solidified confident opinions)
A fictional dialogue to illustrate:
A: Humans aren’t agents, humans don’t want things. It would be bad to make an AI that assumes these things.
B: What do you mean by “bad”?
A: Well, there are multiple metaethical theories, but for this conversation, let’s say “bad” means “not leading to what the agents in this context collectively want”.
B: Aha, but what does “want” mean?
A: …
[EDIT: what I am suggesting is something like “find your wants in your metaphysical orientation, not your ontology, although perhaps use your ontology for more information about your wants”.]
[EDIT2: Also, your metaphysical orientation might be confused, in which case the solution is to resolve that confusion, producing a new metaphysical orientation, plausibly one that doesn’t have “wanting” and for which there is therefore no proper “AI alignment” problem, although it might still have AI-related philosophical problems]
Person A isn’t getting it quite right :P Humans want things, in the usual sense that “humans want things” indicates a useful class of models I use to predict humans. But they don’t Really Want things, the sort of essential Wanting that requires a unique, privileged function from a physical state of the human to the things Wanted.
So here’s the dialogue with A’s views more of an insert of my own:
A: Humans aren’t agents, by which I mean that humans don’t Really Want things. It would be bad to make an AI that assumes they do.
B: What do you mean by “bad”?
A: I mean that there wouldn’t be such a privileged Want for the AI to find in humans—humans want things, but can be modeled as wanting different things depending on the environment and level of detail of the model.
B: No, I mean how could you cash out “bad” if not in terms of what you Really Want?
A: Just in terms of what I regular, contingently want—how I’m modeling myself right now.
B: But isn’t that a privileged model that the AI could figure out and then use to locate your wants? And since these wants so naturally privileged, wouldn’t that make them what you Really Want?
A: The AI could do something like that, but I don’t like to think of that as finding out what I Really Want. The result isn’t going to be truly unique because I use multiple models of myself, and they’re all vague and fallible. And maybe more importantly, programming an AI to understand me “on my own terms” faces a lot of difficult challenges that don’t make sense if you think the goal is just to translate what I Really Want into the AI’s internal ontology.
B: Like what?
A: You remember the Bay Area train analogy at the end of The Tails Coming Apart as Metaphor for Life? When the train lines diverge, thinking of the problem as “figure out what train we Really Wanted” doesn’t help, and might divert people from the possible solutions, which are going to be contingent and sometimes messy.
B: But eventually you actually do follow one of the train lines, or program it into the AI, which uniquely specifies that as what you Really Want! Problem solved.
A: “Whatever I do is what I wanted to do” doesn’t help you make choices, though.
Thanks for explaining, your position makes more sense now. I think I agree with your overall point that there isn’t a “platonic Want” than can be directly inferred from physical state, at least without substantial additional psychology/philosophy investigation (which could, among other things, define bargaining solutions among the different wants).
So, there are at least a few different issues here for contingent wants:
Wants vary over time.
OK, so add a time parameter, and do what I want right now.
People could potentially use different “wanting” models for themselves.
Yes, but some models are better than others. (There’s a discussion of arbitrariness of models here which seems relevant)
In practice the brain is going to use some weighting procedure between them. If this procedure isn’t doing necessary messy work (it’s really not clear if it is), then it can be replaced with an algorithm. If it is, then perhaps the top priority for value learning is “figure out what this thingy is doing and form moral opinions about it”.
“Wanting” models are fallible.
Not necessarily a problem (but see next point); the main thing with AI alignment is to do much better than the “default” policy of having aligned humans continue to take actions, using whatever brain they have, without using AGI assistance. If people manage with having fallible “wanting” models, then perhaps the machinery people use to manage this can be understood?
“Wanting” models have limited domains of applicability.
This seems like Wei’s partial utility function problem and is related to the ontology identification problem. It’s pretty serious and is also a problem independently of value learning. Solving this problem would require either directly solving the philosophical problem, or doing psychology to figure out what machinery does ontology updates (and form moral opinions about that).
Is my flowchart model complicated enough to emulate a RNN? Then I’m not sure.
Or one might imagine a model that has psychological parts, but distributes the function fulfilled by “wants” in an agent model among several different pieces, which might conflict or reinforce each other depending on context. This model could reproduce human verbal behavior about “wanting” with no actual component in the model that formalizes wanting.
If this kind of model works well, it’s a counterexample (less compute-intensive than a microphysical model) of the idea I think you’re gesturing towards, which is that the data really privileges models in which there’s an agent-like formalization of wanting.
Hmm, so with enough compute (like, using parts of your brain to model the different psychological parts), perhaps you could do something like this for yourself. But you couldn’t predict the results of the behavior of people smarter than you. For example, you would have a hard time predicting that Kasparov would win a chess game against a random chess player, without being as good at chess as Kasparov yourself, though even with the intentional stance you can’t predict his actions. (You could obviously predict this using statistics, but that wouldn’t be based on just the mechanical model itself)
That is, it seems like the intentional stance often involves using much less compute than the person being modeled in order to predict that things will go in the direction of the person’s wants (limited by the person’s capabilities), without predicting each of the person’s actions.