By “short” I mean short in sense (1) and (2). “Short” doesn’t imply anything about senses (3), (4), (5), or (6) (and “short” and “long” don’t seem like good words to describe those axes, though I’ll keep using them in this comment for consistency).
By “preferences-on-reflection” I mean long in sense (3) and neither in sense (6). There is a hypothesis that “humans with AI help” is a reasonable way to capture preferences-on-reflection, but they aren’t defined to be the same. I don’t use understandable and evaluable in this way.
I think (4) and (5) are independent axes. (4) just sounds like “is your AI good at optimizing,” not a statement about what it’s optimizing. In the discussion with Eliezer I’m arguing against it being linked to any of these other axes. (5) is a distinction about two senses in which an AI can be “optimizing my short-term preferences-on-reflection”
When discussing perfect estimations of preferences-on-reflection, I don’t think the short vs. long distinction is that important. “Short” is mostly important when talking about ways in which an AI can fall short of perfectly estimating preferences-on-reflection.
Assuming my interpretation is correct, my confusion is that you say we shouldn’t expect a situation where “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy” (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.
I introduced the term “preferences-on-reflection” in the previous comment to make a particular distinction. It’s probably better to say something like “actual preferences” (though this is also likely to be misinterpreted). The important property is that I’d prefer to have an AI that satisfies my actual preferences than to have any other kind of AI. We could also say “better by my lights” or something else.
There’s a hypothesis that “what I’d say after some particular idealized process of reflection” is a reasonable way to capture “actual preferences,” but I think that’s up for debate—e.g. it could fail if me-on-reflection is selfish and has values opposed to current-me, and certainly it could fail for any particular process of reflection and so it might just happen to be the case that there is no process of reflection that satisfies it.
The claim I usually make is that “what I’d say after some particular idealized process of reflection” describes the best mechanism we can hope to find for capturing “actual preferences,” because whatever else we might do to capture “actual preferences” can just be absorbed into that process of reflection.
“Actual preferences” is a pretty important concept here, I don’t think we could get around the need for it, I’m not sure if there is disagreement about this concept or just about the term being used for it.
I’m really confused why “short” world include sense (1) rather than only sense (2). If “corrigibly is about short-term preferences on reflection” then this seems to be a claim that corrigible AI should understand us as preferring to eat candy and junk food, because on reflection we do like how it tastes, we just choose not to eat it because of longer-term concerns—so a corrigible system ignores the longer-term concerns and interpretations us as wanting candy and junk food.
Perhaps you intend sense (1) where “short” means ~100 years, rather than ~10 minutes, so that the system doesn’t interpret us as wanting candy and junk food. But this similarly creates problems when we think longer than 100 years; the system wouldn’t take those thoughts seriously.
It seems much more sensible to me for “short” in the context of this discussion to mean (2) only. But perhaps I misunderstood something.
One of us just misunderstood (1), I don’t think there is any difference.
I mean preferences about what happens over the near future, but the way I rank “what happens in the near future” will likely be based on its consequences (further in the future, and in other possible worlds, and etc.). So I took (1) to be basically equivalent to (2).
“Terminal preferences over the near future” is not a thing I often think about and I didn’t realize it was a candidate interpretation (normally when I write about short-term preferences I’m writing about things like control, knowledge, and resource acquisition).
It does not make any effort to pursue wildly different short-term goals than I would in order to better realize my long-term values, though it may help me correct some errors that I would be able to recognize as such.
which made me think that when you say “short-term” or “narrow” (I’m assuming you use these interchangeably?) values you are talking about an AI that doesn’t do anything the end user can’t understand the rationale of. But then I read Concrete approval-directed agents where you wrote:
Efficacy: By getting help from additional approval-directed agents, the human operator can evaluate proposals as if she were as smart as those agents. In particular, the human can evaluate the given rationale for a proposed action and determine whether the action really does what the human wants.
and this made me think that you’re also including AIs that do things that the user can merely evaluate the rationale of (i.e., not be able to have an internal understanding of, even hypothetically). Since this “evaluable” interpretation also seems more compatible with strategy-stealing (because an AI that only performs actions that a human can understand can’t “steal” a superhuman strategy), I’m currently guessing this is what you actually have in mind, at least when you’re thinking about how to make a corrigible AI competitive.
Like I mentioned above, I mostly think of narrow value learning is a substitute for imitation learning or approval-direction, realistically to be used as a distillation step rather than as your whole AI. In particular, an agent trained with narrow value learning absolutely is probably not aligned+competitive in a way that might allow you to apply this kind of strategy-stealing argument.
In concrete approval-directed agents I’m talking about a different design, it’s not related to narrow value learning.
I don’t use narrow and short-term interchangeably. I’ve only ever used it in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning.
Ah, that clears up a lot of things for me. (I saw your earlier comment but was quite confused by it due to not realizing your narrow / short-term distinction.) One reason I thought you used “short-term” and “narrow” interchangeably is due to Act-based agents where you seemed to be doing that:
These proposals all focus on the short-term instrumental preferences of their users. [...]
What is “narrow” anyway?
There is clearly a difference between act-based agents and traditional rational agents. But it’s not entirely clear what the key difference is.
And in that post it also seemed like “narrow value learners” were meant to be the whole AI since it talked a lot about “users” of such AI.
(In that post I did use narrow in the way we are currently using short-term, contrary to my claim the grandparent. Sorry for the confusion this caused.)
By “short” I mean short in sense (1) and (2). “Short” doesn’t imply anything about senses (3), (4), (5), or (6) (and “short” and “long” don’t seem like good words to describe those axes, though I’ll keep using them in this comment for consistency).
By “preferences-on-reflection” I mean long in sense (3) and neither in sense (6). There is a hypothesis that “humans with AI help” is a reasonable way to capture preferences-on-reflection, but they aren’t defined to be the same. I don’t use understandable and evaluable in this way.
I think (4) and (5) are independent axes. (4) just sounds like “is your AI good at optimizing,” not a statement about what it’s optimizing. In the discussion with Eliezer I’m arguing against it being linked to any of these other axes. (5) is a distinction about two senses in which an AI can be “optimizing my short-term preferences-on-reflection”
When discussing perfect estimations of preferences-on-reflection, I don’t think the short vs. long distinction is that important. “Short” is mostly important when talking about ways in which an AI can fall short of perfectly estimating preferences-on-reflection.
I introduced the term “preferences-on-reflection” in the previous comment to make a particular distinction. It’s probably better to say something like “actual preferences” (though this is also likely to be misinterpreted). The important property is that I’d prefer to have an AI that satisfies my actual preferences than to have any other kind of AI. We could also say “better by my lights” or something else.
There’s a hypothesis that “what I’d say after some particular idealized process of reflection” is a reasonable way to capture “actual preferences,” but I think that’s up for debate—e.g. it could fail if me-on-reflection is selfish and has values opposed to current-me, and certainly it could fail for any particular process of reflection and so it might just happen to be the case that there is no process of reflection that satisfies it.
The claim I usually make is that “what I’d say after some particular idealized process of reflection” describes the best mechanism we can hope to find for capturing “actual preferences,” because whatever else we might do to capture “actual preferences” can just be absorbed into that process of reflection.
“Actual preferences” is a pretty important concept here, I don’t think we could get around the need for it, I’m not sure if there is disagreement about this concept or just about the term being used for it.
I’m really confused why “short” world include sense (1) rather than only sense (2). If “corrigibly is about short-term preferences on reflection” then this seems to be a claim that corrigible AI should understand us as preferring to eat candy and junk food, because on reflection we do like how it tastes, we just choose not to eat it because of longer-term concerns—so a corrigible system ignores the longer-term concerns and interpretations us as wanting candy and junk food.
Perhaps you intend sense (1) where “short” means ~100 years, rather than ~10 minutes, so that the system doesn’t interpret us as wanting candy and junk food. But this similarly creates problems when we think longer than 100 years; the system wouldn’t take those thoughts seriously.
It seems much more sensible to me for “short” in the context of this discussion to mean (2) only. But perhaps I misunderstood something.
One of us just misunderstood (1), I don’t think there is any difference.
I mean preferences about what happens over the near future, but the way I rank “what happens in the near future” will likely be based on its consequences (further in the future, and in other possible worlds, and etc.). So I took (1) to be basically equivalent to (2).
“Terminal preferences over the near future” is not a thing I often think about and I didn’t realize it was a candidate interpretation (normally when I write about short-term preferences I’m writing about things like control, knowledge, and resource acquisition).
The reason I brought up this distinction was that in Ambitious vs. narrow value learning you wrote:
which made me think that when you say “short-term” or “narrow” (I’m assuming you use these interchangeably?) values you are talking about an AI that doesn’t do anything the end user can’t understand the rationale of. But then I read Concrete approval-directed agents where you wrote:
and this made me think that you’re also including AIs that do things that the user can merely evaluate the rationale of (i.e., not be able to have an internal understanding of, even hypothetically). Since this “evaluable” interpretation also seems more compatible with strategy-stealing (because an AI that only performs actions that a human can understand can’t “steal” a superhuman strategy), I’m currently guessing this is what you actually have in mind, at least when you’re thinking about how to make a corrigible AI competitive.
Like I mentioned above, I mostly think of narrow value learning is a substitute for imitation learning or approval-direction, realistically to be used as a distillation step rather than as your whole AI. In particular, an agent trained with narrow value learning absolutely is probably not aligned+competitive in a way that might allow you to apply this kind of strategy-stealing argument.
In concrete approval-directed agents I’m talking about a different design, it’s not related to narrow value learning.
I don’t use narrow and short-term interchangeably. I’ve only ever used it in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning.
Ah, that clears up a lot of things for me. (I saw your earlier comment but was quite confused by it due to not realizing your narrow / short-term distinction.) One reason I thought you used “short-term” and “narrow” interchangeably is due to Act-based agents where you seemed to be doing that:
And in that post it also seemed like “narrow value learners” were meant to be the whole AI since it talked a lot about “users” of such AI.
(In that post I did use narrow in the way we are currently using short-term, contrary to my claim the grandparent. Sorry for the confusion this caused.)