Like Wei Dai, I am also finding this discussion pretty confusing. To summarize my state of confusion, I came up with the following list of ways in which preferences can be short or long:
time horizon and time discounting: how far in the future is the preference about? More generally, how much weight do we place on the present vs the future?
act-based (“short”) vs goal-based (“long”): using the human’s (or more generally, the human-plus-AI-assistants’; see (6) below) estimate of the value of the next action (act-based) or doing more open-ended optimization of the future based on some goal, e.g. using a utility function (goal-based)
amount of reflection the human has undergone: “short” would be the current human (I think this is what you call “preferences-as-elicited”), and this would get “longer” as we give the human more time to think, with something like CEV/Long Reflection/Great Deliberation being the “longest” in this sense (I think this is what you call “preference-on-idealized-reflection”). This sense further breaks down into whether the human itself is actually doing the reflection, or if the AI is instead predicting what the human would think after reflection.
how far the search happens: “short” would be a limited search (that lacks insight/doesn’t see interesting consequences) and “long” would be a search that has insight/sees interesting consequences. This is a distinction you made in a discussion with Eliezer a while back. This distinction also isn’t strictly about preferences, but rather about how one would achieve those preferences.
de dicto (“short”) vs de re (“long”): This is a distinction you made in this post. I think this is the same distinction as (2) or (3), but I’m not sure which. (But if my interpretation of you below is correct, I guess this must be the same as (2) or else a completely different distinction.)
understandable (“short”) vs evaluable (“long”): A course of action is understandable if the human (without any AI assistants) can understand the rationale behind it; a course of action is evaluable if there is some procedure the human can implement to evaluate the rationale using AI assistants. I guess there is also a “not even evaluable” option here that is even “longer”. (Thanks to Wei Dai for bringing up this distinction, although I may have misunderstood the actual distinction.)
My interpretation is that when you say “short-term preferences-on-reflection”, you mean short in sense (1), except when the AI needs to gather resources, in which case either the human or the AI will need to do more long-term planning; short in sense (2); long in sense (3), with the AI predicting what the human would think after reflection; long in sense (4); short in sense (5); long in sense (6). Does this sound right to you? If not, I think it would help me a lot if you could “fill in the list” with which of short or long you choose for each point.
Assuming my interpretation is correct, my confusion is that you say we shouldn’t expect a situation where “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy” (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.
By “short” I mean short in sense (1) and (2). “Short” doesn’t imply anything about senses (3), (4), (5), or (6) (and “short” and “long” don’t seem like good words to describe those axes, though I’ll keep using them in this comment for consistency).
By “preferences-on-reflection” I mean long in sense (3) and neither in sense (6). There is a hypothesis that “humans with AI help” is a reasonable way to capture preferences-on-reflection, but they aren’t defined to be the same. I don’t use understandable and evaluable in this way.
I think (4) and (5) are independent axes. (4) just sounds like “is your AI good at optimizing,” not a statement about what it’s optimizing. In the discussion with Eliezer I’m arguing against it being linked to any of these other axes. (5) is a distinction about two senses in which an AI can be “optimizing my short-term preferences-on-reflection”
When discussing perfect estimations of preferences-on-reflection, I don’t think the short vs. long distinction is that important. “Short” is mostly important when talking about ways in which an AI can fall short of perfectly estimating preferences-on-reflection.
Assuming my interpretation is correct, my confusion is that you say we shouldn’t expect a situation where “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy” (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.
I introduced the term “preferences-on-reflection” in the previous comment to make a particular distinction. It’s probably better to say something like “actual preferences” (though this is also likely to be misinterpreted). The important property is that I’d prefer to have an AI that satisfies my actual preferences than to have any other kind of AI. We could also say “better by my lights” or something else.
There’s a hypothesis that “what I’d say after some particular idealized process of reflection” is a reasonable way to capture “actual preferences,” but I think that’s up for debate—e.g. it could fail if me-on-reflection is selfish and has values opposed to current-me, and certainly it could fail for any particular process of reflection and so it might just happen to be the case that there is no process of reflection that satisfies it.
The claim I usually make is that “what I’d say after some particular idealized process of reflection” describes the best mechanism we can hope to find for capturing “actual preferences,” because whatever else we might do to capture “actual preferences” can just be absorbed into that process of reflection.
“Actual preferences” is a pretty important concept here, I don’t think we could get around the need for it, I’m not sure if there is disagreement about this concept or just about the term being used for it.
I’m really confused why “short” world include sense (1) rather than only sense (2). If “corrigibly is about short-term preferences on reflection” then this seems to be a claim that corrigible AI should understand us as preferring to eat candy and junk food, because on reflection we do like how it tastes, we just choose not to eat it because of longer-term concerns—so a corrigible system ignores the longer-term concerns and interpretations us as wanting candy and junk food.
Perhaps you intend sense (1) where “short” means ~100 years, rather than ~10 minutes, so that the system doesn’t interpret us as wanting candy and junk food. But this similarly creates problems when we think longer than 100 years; the system wouldn’t take those thoughts seriously.
It seems much more sensible to me for “short” in the context of this discussion to mean (2) only. But perhaps I misunderstood something.
One of us just misunderstood (1), I don’t think there is any difference.
I mean preferences about what happens over the near future, but the way I rank “what happens in the near future” will likely be based on its consequences (further in the future, and in other possible worlds, and etc.). So I took (1) to be basically equivalent to (2).
“Terminal preferences over the near future” is not a thing I often think about and I didn’t realize it was a candidate interpretation (normally when I write about short-term preferences I’m writing about things like control, knowledge, and resource acquisition).
It does not make any effort to pursue wildly different short-term goals than I would in order to better realize my long-term values, though it may help me correct some errors that I would be able to recognize as such.
which made me think that when you say “short-term” or “narrow” (I’m assuming you use these interchangeably?) values you are talking about an AI that doesn’t do anything the end user can’t understand the rationale of. But then I read Concrete approval-directed agents where you wrote:
Efficacy: By getting help from additional approval-directed agents, the human operator can evaluate proposals as if she were as smart as those agents. In particular, the human can evaluate the given rationale for a proposed action and determine whether the action really does what the human wants.
and this made me think that you’re also including AIs that do things that the user can merely evaluate the rationale of (i.e., not be able to have an internal understanding of, even hypothetically). Since this “evaluable” interpretation also seems more compatible with strategy-stealing (because an AI that only performs actions that a human can understand can’t “steal” a superhuman strategy), I’m currently guessing this is what you actually have in mind, at least when you’re thinking about how to make a corrigible AI competitive.
Like I mentioned above, I mostly think of narrow value learning is a substitute for imitation learning or approval-direction, realistically to be used as a distillation step rather than as your whole AI. In particular, an agent trained with narrow value learning absolutely is probably not aligned+competitive in a way that might allow you to apply this kind of strategy-stealing argument.
In concrete approval-directed agents I’m talking about a different design, it’s not related to narrow value learning.
I don’t use narrow and short-term interchangeably. I’ve only ever used it in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning.
Ah, that clears up a lot of things for me. (I saw your earlier comment but was quite confused by it due to not realizing your narrow / short-term distinction.) One reason I thought you used “short-term” and “narrow” interchangeably is due to Act-based agents where you seemed to be doing that:
These proposals all focus on the short-term instrumental preferences of their users. [...]
What is “narrow” anyway?
There is clearly a difference between act-based agents and traditional rational agents. But it’s not entirely clear what the key difference is.
And in that post it also seemed like “narrow value learners” were meant to be the whole AI since it talked a lot about “users” of such AI.
(In that post I did use narrow in the way we are currently using short-term, contrary to my claim the grandparent. Sorry for the confusion this caused.)
(BTW Paul, if you’re reading this, Issa and I and a few others have been chatting about this on MIRIxDiscord. I’m sure you’re more than welcome to join if you’re interested, but I figured you probably don’t have time for it. PM me if you do want an invite.)
Issa, I think my current understanding of what Paul means is roughly the same as yours, and I also share your confusion about “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy”.
To summarize my own understanding (quoting myself from the Discord), what Paul means by “satisfying short-term preferences-on-reflection” seems to cash out as “do the action for which the AI can produce an explanation such that a hypothetical human would evaluate it as good (possibly using other AI assistants), with the evaluation procedure itself being the result of a hypothetical deliberation which is controlled by the preferences-for-deliberation that the AI learned/inferred from a real human.”
(I still have other confusions around this. For example is the “hypothetical human” here (the human being predicted in Issa’s 3) a hypothetical end user evaluating the action based on what they themselves want, or is it a hypothetical overseer evaluating the action based on what the overseer thinks the end user wants? Or is the “hypothetical human” just a metaphor for some abstract, distributed, or not recognizably-human deliberative/evaluative process at this point?)
Thanks to Wei Dai for bringing up this distinction, although I may have misunderstood the actual distinction.
I think maybe it would make sense to further break (6) down into 2 sub-dimensions: (6a) understandable vs evaluable and (6b) how much AI assistance. “Understandable” means the human achieves an understanding of the (outer/main) AI’s rationale for action within their own brain, with or without (other) AI assistance (which can for example answer questions for the human or give video lectures, etc.). And “evaluable” means the human runs or participates in a procedure that returns a score for how good the action is, but doesn’t necessarily achieve a holistic understanding of the rationale in their own brain. (If the external procedure involves other real or hypothetical humans, then it gets fuzzy but basically I want to rule out Chinese Room scenarios as “understandable”.) Based on https://ai-alignment.com/concrete-approval-directed-agents-89e247df7f1b I’m guessing Paul has “evaluable” and “with AI assistance” in mind here. (In other words I agree with what you mean by “long in sense (6)”.)
Like Wei Dai, I am also finding this discussion pretty confusing. To summarize my state of confusion, I came up with the following list of ways in which preferences can be short or long:
time horizon and time discounting: how far in the future is the preference about? More generally, how much weight do we place on the present vs the future?
act-based (“short”) vs goal-based (“long”): using the human’s (or more generally, the human-plus-AI-assistants’; see (6) below) estimate of the value of the next action (act-based) or doing more open-ended optimization of the future based on some goal, e.g. using a utility function (goal-based)
amount of reflection the human has undergone: “short” would be the current human (I think this is what you call “preferences-as-elicited”), and this would get “longer” as we give the human more time to think, with something like CEV/Long Reflection/Great Deliberation being the “longest” in this sense (I think this is what you call “preference-on-idealized-reflection”). This sense further breaks down into whether the human itself is actually doing the reflection, or if the AI is instead predicting what the human would think after reflection.
how far the search happens: “short” would be a limited search (that lacks insight/doesn’t see interesting consequences) and “long” would be a search that has insight/sees interesting consequences. This is a distinction you made in a discussion with Eliezer a while back. This distinction also isn’t strictly about preferences, but rather about how one would achieve those preferences.
de dicto (“short”) vs de re (“long”): This is a distinction you made in this post. I think this is the same distinction as (2) or (3), but I’m not sure which. (But if my interpretation of you below is correct, I guess this must be the same as (2) or else a completely different distinction.)
understandable (“short”) vs evaluable (“long”): A course of action is understandable if the human (without any AI assistants) can understand the rationale behind it; a course of action is evaluable if there is some procedure the human can implement to evaluate the rationale using AI assistants. I guess there is also a “not even evaluable” option here that is even “longer”. (Thanks to Wei Dai for bringing up this distinction, although I may have misunderstood the actual distinction.)
My interpretation is that when you say “short-term preferences-on-reflection”, you mean short in sense (1), except when the AI needs to gather resources, in which case either the human or the AI will need to do more long-term planning; short in sense (2); long in sense (3), with the AI predicting what the human would think after reflection; long in sense (4); short in sense (5); long in sense (6). Does this sound right to you? If not, I think it would help me a lot if you could “fill in the list” with which of short or long you choose for each point.
Assuming my interpretation is correct, my confusion is that you say we shouldn’t expect a situation where “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy” (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.
By “short” I mean short in sense (1) and (2). “Short” doesn’t imply anything about senses (3), (4), (5), or (6) (and “short” and “long” don’t seem like good words to describe those axes, though I’ll keep using them in this comment for consistency).
By “preferences-on-reflection” I mean long in sense (3) and neither in sense (6). There is a hypothesis that “humans with AI help” is a reasonable way to capture preferences-on-reflection, but they aren’t defined to be the same. I don’t use understandable and evaluable in this way.
I think (4) and (5) are independent axes. (4) just sounds like “is your AI good at optimizing,” not a statement about what it’s optimizing. In the discussion with Eliezer I’m arguing against it being linked to any of these other axes. (5) is a distinction about two senses in which an AI can be “optimizing my short-term preferences-on-reflection”
When discussing perfect estimations of preferences-on-reflection, I don’t think the short vs. long distinction is that important. “Short” is mostly important when talking about ways in which an AI can fall short of perfectly estimating preferences-on-reflection.
I introduced the term “preferences-on-reflection” in the previous comment to make a particular distinction. It’s probably better to say something like “actual preferences” (though this is also likely to be misinterpreted). The important property is that I’d prefer to have an AI that satisfies my actual preferences than to have any other kind of AI. We could also say “better by my lights” or something else.
There’s a hypothesis that “what I’d say after some particular idealized process of reflection” is a reasonable way to capture “actual preferences,” but I think that’s up for debate—e.g. it could fail if me-on-reflection is selfish and has values opposed to current-me, and certainly it could fail for any particular process of reflection and so it might just happen to be the case that there is no process of reflection that satisfies it.
The claim I usually make is that “what I’d say after some particular idealized process of reflection” describes the best mechanism we can hope to find for capturing “actual preferences,” because whatever else we might do to capture “actual preferences” can just be absorbed into that process of reflection.
“Actual preferences” is a pretty important concept here, I don’t think we could get around the need for it, I’m not sure if there is disagreement about this concept or just about the term being used for it.
I’m really confused why “short” world include sense (1) rather than only sense (2). If “corrigibly is about short-term preferences on reflection” then this seems to be a claim that corrigible AI should understand us as preferring to eat candy and junk food, because on reflection we do like how it tastes, we just choose not to eat it because of longer-term concerns—so a corrigible system ignores the longer-term concerns and interpretations us as wanting candy and junk food.
Perhaps you intend sense (1) where “short” means ~100 years, rather than ~10 minutes, so that the system doesn’t interpret us as wanting candy and junk food. But this similarly creates problems when we think longer than 100 years; the system wouldn’t take those thoughts seriously.
It seems much more sensible to me for “short” in the context of this discussion to mean (2) only. But perhaps I misunderstood something.
One of us just misunderstood (1), I don’t think there is any difference.
I mean preferences about what happens over the near future, but the way I rank “what happens in the near future” will likely be based on its consequences (further in the future, and in other possible worlds, and etc.). So I took (1) to be basically equivalent to (2).
“Terminal preferences over the near future” is not a thing I often think about and I didn’t realize it was a candidate interpretation (normally when I write about short-term preferences I’m writing about things like control, knowledge, and resource acquisition).
The reason I brought up this distinction was that in Ambitious vs. narrow value learning you wrote:
which made me think that when you say “short-term” or “narrow” (I’m assuming you use these interchangeably?) values you are talking about an AI that doesn’t do anything the end user can’t understand the rationale of. But then I read Concrete approval-directed agents where you wrote:
and this made me think that you’re also including AIs that do things that the user can merely evaluate the rationale of (i.e., not be able to have an internal understanding of, even hypothetically). Since this “evaluable” interpretation also seems more compatible with strategy-stealing (because an AI that only performs actions that a human can understand can’t “steal” a superhuman strategy), I’m currently guessing this is what you actually have in mind, at least when you’re thinking about how to make a corrigible AI competitive.
Like I mentioned above, I mostly think of narrow value learning is a substitute for imitation learning or approval-direction, realistically to be used as a distillation step rather than as your whole AI. In particular, an agent trained with narrow value learning absolutely is probably not aligned+competitive in a way that might allow you to apply this kind of strategy-stealing argument.
In concrete approval-directed agents I’m talking about a different design, it’s not related to narrow value learning.
I don’t use narrow and short-term interchangeably. I’ve only ever used it in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning.
Ah, that clears up a lot of things for me. (I saw your earlier comment but was quite confused by it due to not realizing your narrow / short-term distinction.) One reason I thought you used “short-term” and “narrow” interchangeably is due to Act-based agents where you seemed to be doing that:
And in that post it also seemed like “narrow value learners” were meant to be the whole AI since it talked a lot about “users” of such AI.
(In that post I did use narrow in the way we are currently using short-term, contrary to my claim the grandparent. Sorry for the confusion this caused.)
(BTW Paul, if you’re reading this, Issa and I and a few others have been chatting about this on MIRIxDiscord. I’m sure you’re more than welcome to join if you’re interested, but I figured you probably don’t have time for it. PM me if you do want an invite.)
Issa, I think my current understanding of what Paul means is roughly the same as yours, and I also share your confusion about “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy”.
To summarize my own understanding (quoting myself from the Discord), what Paul means by “satisfying short-term preferences-on-reflection” seems to cash out as “do the action for which the AI can produce an explanation such that a hypothetical human would evaluate it as good (possibly using other AI assistants), with the evaluation procedure itself being the result of a hypothetical deliberation which is controlled by the preferences-for-deliberation that the AI learned/inferred from a real human.”
(I still have other confusions around this. For example is the “hypothetical human” here (the human being predicted in Issa’s 3) a hypothetical end user evaluating the action based on what they themselves want, or is it a hypothetical overseer evaluating the action based on what the overseer thinks the end user wants? Or is the “hypothetical human” just a metaphor for some abstract, distributed, or not recognizably-human deliberative/evaluative process at this point?)
I think maybe it would make sense to further break (6) down into 2 sub-dimensions: (6a) understandable vs evaluable and (6b) how much AI assistance. “Understandable” means the human achieves an understanding of the (outer/main) AI’s rationale for action within their own brain, with or without (other) AI assistance (which can for example answer questions for the human or give video lectures, etc.). And “evaluable” means the human runs or participates in a procedure that returns a score for how good the action is, but doesn’t necessarily achieve a holistic understanding of the rationale in their own brain. (If the external procedure involves other real or hypothetical humans, then it gets fuzzy but basically I want to rule out Chinese Room scenarios as “understandable”.) Based on https://ai-alignment.com/concrete-approval-directed-agents-89e247df7f1b I’m guessing Paul has “evaluable” and “with AI assistance” in mind here. (In other words I agree with what you mean by “long in sense (6)”.)