To sum up, I think there’s a fundamental tension between corrigibility (in the sense of respecting the human user’s short-term preferences) and long-term success/competitiveness, which underlies many of the specific failure scenarios described in the OP, and worse, makes it unclear how “strategy-stealing” can work at all.
By short-term preference I don’t mean “Start a car company, I hear those are profitable,” I mean more like “Make me money, and then make sure that I remain in control of that company and its profits,” or even better “acquire flexible influence that I can use to get what I want.”
(This is probably not the response you were looking for. I’m still mostly intending to give up on communication here over the short term, because it seems too hard. If you are confused by particular things I’ve said feel free to quote them so that I can either clarify, register a disagreement, or write them off as sloppy or mistaken comments.)
By short-term preference I don’t mean “Start a car company, I hear those are profitable,”
But in your earlier writings it sure seems that’s the kind of the thing that you meant, or even narrower preferences than this. Your corrigibility post said:
An act-based agent considers our short-term preferences, including (amongst others) our preference for the agent to be corrigible.
All three of these corrigible AIs deal with much narrower preferences than “acquire flexible influence that I can use to get what I want”. The narrow value learner post for example says:
The AI learns the narrower subgoals and instrumental values I am pursuing. It learns that I am trying to schedule an appointment for Tuesday and that I want to avoid inconveniencing anyone, or that I am trying to fix a particular bug without introducing new problems, etc. It does not make any effort to pursue wildly different short-term goals than I would in order to better realize my long-term values, though it may help me correct some errors that I would be able to recognize as such.
I may be misunderstanding something, and you’re probably not doing this intentionally, but it looks a lot like having a vague notion of “short-term preferences” is allowing you to equivocate between really narrow preferences when you’re trying to argue for safety, and much broader preferences when you’re trying to argue for competitiveness. Wouldn’t it be a good idea (as I’ve repeated suggested) to make a priority of nailing down the concept of “short-term preferences” given how central it is to your approach?
All three of these corrigible AIs deal with much narrower preferences than “acquire flexible influence that I can use to get what I want”. The narrow value learner post for example says:
Imitation learning, approval-direction, and narrow value learning are not intended to exceed the overseer’s capabilities. These are three candidates for the distillation step in iterated distillation and amplification.
The AI we actually deploy, which I’m discussing in the OP, is produced by imitating (or learning the values of, or maximizing the approval of) an even smarter AI—whose valuations of resources reflect everything that unaligned AIs know about which resources will be helpful.
Corrigibility is about short-term preferences-on-reflection. I see how this is confusing. Note that the article doesn’t make sense at all when interpreted in the other way. For example, the user can’t even tell whether they are in control of the situation, so what does it mean to talk about their preference to be in control of the situation if these aren’t supposed to be preferences-on-reflection? (Similarly for “preference to be well-informed” and so on.) The desiderata discussed in the original corrigibility post seem basically the same as the user not being able to tell what resources will help them achieve their long-term goals, but still wanting the AI to accumulate those resources.
I also think the act-based agents post is correct if “preferences” means preferences-on-reflection. It’s just that the three approaches listed at the top are limited to the capabilities of the overseer. I think that distinguishing between preferences-as-elicited and preferences-on-reflection is the most important thing to disambiguate here. I usually use “preference” to mean preference-on-idealized-reflection (or whatever “actual preference” should mean, acknowledging that we don’t have a real ground truth definition), which I think is the more typical usage. I’d be fine with suggestions for disambiguation.
If there’s somewhere else I’ve equivocated in the way you suggest, then I’m happy to correct it. It seems like a thing I might have done in a way that introduces an error. I’d be surprised if it hides an important problem (I think the big problems in my proposal are lurking other places, not here), and I think in the corrigibility post I think that I have these concepts straight.
One thing you might have in mind is the following kind of comment:
If on average we are unhappy with the level of corrigibility of a benign act-based agent, then by construction it is mistaken about our short-term preferences.
That is, you might be concerned: “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy.” I’m saying that you shouldn’t expect this to happen, if the AI is well-calibrated and has enough of an understanding of humans to understand e.g. this discussion we are currently having—if it decides not to be corrigible, we should expect it to be right on average.
Like Wei Dai, I am also finding this discussion pretty confusing. To summarize my state of confusion, I came up with the following list of ways in which preferences can be short or long:
time horizon and time discounting: how far in the future is the preference about? More generally, how much weight do we place on the present vs the future?
act-based (“short”) vs goal-based (“long”): using the human’s (or more generally, the human-plus-AI-assistants’; see (6) below) estimate of the value of the next action (act-based) or doing more open-ended optimization of the future based on some goal, e.g. using a utility function (goal-based)
amount of reflection the human has undergone: “short” would be the current human (I think this is what you call “preferences-as-elicited”), and this would get “longer” as we give the human more time to think, with something like CEV/Long Reflection/Great Deliberation being the “longest” in this sense (I think this is what you call “preference-on-idealized-reflection”). This sense further breaks down into whether the human itself is actually doing the reflection, or if the AI is instead predicting what the human would think after reflection.
how far the search happens: “short” would be a limited search (that lacks insight/doesn’t see interesting consequences) and “long” would be a search that has insight/sees interesting consequences. This is a distinction you made in a discussion with Eliezer a while back. This distinction also isn’t strictly about preferences, but rather about how one would achieve those preferences.
de dicto (“short”) vs de re (“long”): This is a distinction you made in this post. I think this is the same distinction as (2) or (3), but I’m not sure which. (But if my interpretation of you below is correct, I guess this must be the same as (2) or else a completely different distinction.)
understandable (“short”) vs evaluable (“long”): A course of action is understandable if the human (without any AI assistants) can understand the rationale behind it; a course of action is evaluable if there is some procedure the human can implement to evaluate the rationale using AI assistants. I guess there is also a “not even evaluable” option here that is even “longer”. (Thanks to Wei Dai for bringing up this distinction, although I may have misunderstood the actual distinction.)
My interpretation is that when you say “short-term preferences-on-reflection”, you mean short in sense (1), except when the AI needs to gather resources, in which case either the human or the AI will need to do more long-term planning; short in sense (2); long in sense (3), with the AI predicting what the human would think after reflection; long in sense (4); short in sense (5); long in sense (6). Does this sound right to you? If not, I think it would help me a lot if you could “fill in the list” with which of short or long you choose for each point.
Assuming my interpretation is correct, my confusion is that you say we shouldn’t expect a situation where “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy” (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.
By “short” I mean short in sense (1) and (2). “Short” doesn’t imply anything about senses (3), (4), (5), or (6) (and “short” and “long” don’t seem like good words to describe those axes, though I’ll keep using them in this comment for consistency).
By “preferences-on-reflection” I mean long in sense (3) and neither in sense (6). There is a hypothesis that “humans with AI help” is a reasonable way to capture preferences-on-reflection, but they aren’t defined to be the same. I don’t use understandable and evaluable in this way.
I think (4) and (5) are independent axes. (4) just sounds like “is your AI good at optimizing,” not a statement about what it’s optimizing. In the discussion with Eliezer I’m arguing against it being linked to any of these other axes. (5) is a distinction about two senses in which an AI can be “optimizing my short-term preferences-on-reflection”
When discussing perfect estimations of preferences-on-reflection, I don’t think the short vs. long distinction is that important. “Short” is mostly important when talking about ways in which an AI can fall short of perfectly estimating preferences-on-reflection.
Assuming my interpretation is correct, my confusion is that you say we shouldn’t expect a situation where “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy” (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.
I introduced the term “preferences-on-reflection” in the previous comment to make a particular distinction. It’s probably better to say something like “actual preferences” (though this is also likely to be misinterpreted). The important property is that I’d prefer to have an AI that satisfies my actual preferences than to have any other kind of AI. We could also say “better by my lights” or something else.
There’s a hypothesis that “what I’d say after some particular idealized process of reflection” is a reasonable way to capture “actual preferences,” but I think that’s up for debate—e.g. it could fail if me-on-reflection is selfish and has values opposed to current-me, and certainly it could fail for any particular process of reflection and so it might just happen to be the case that there is no process of reflection that satisfies it.
The claim I usually make is that “what I’d say after some particular idealized process of reflection” describes the best mechanism we can hope to find for capturing “actual preferences,” because whatever else we might do to capture “actual preferences” can just be absorbed into that process of reflection.
“Actual preferences” is a pretty important concept here, I don’t think we could get around the need for it, I’m not sure if there is disagreement about this concept or just about the term being used for it.
I’m really confused why “short” world include sense (1) rather than only sense (2). If “corrigibly is about short-term preferences on reflection” then this seems to be a claim that corrigible AI should understand us as preferring to eat candy and junk food, because on reflection we do like how it tastes, we just choose not to eat it because of longer-term concerns—so a corrigible system ignores the longer-term concerns and interpretations us as wanting candy and junk food.
Perhaps you intend sense (1) where “short” means ~100 years, rather than ~10 minutes, so that the system doesn’t interpret us as wanting candy and junk food. But this similarly creates problems when we think longer than 100 years; the system wouldn’t take those thoughts seriously.
It seems much more sensible to me for “short” in the context of this discussion to mean (2) only. But perhaps I misunderstood something.
One of us just misunderstood (1), I don’t think there is any difference.
I mean preferences about what happens over the near future, but the way I rank “what happens in the near future” will likely be based on its consequences (further in the future, and in other possible worlds, and etc.). So I took (1) to be basically equivalent to (2).
“Terminal preferences over the near future” is not a thing I often think about and I didn’t realize it was a candidate interpretation (normally when I write about short-term preferences I’m writing about things like control, knowledge, and resource acquisition).
It does not make any effort to pursue wildly different short-term goals than I would in order to better realize my long-term values, though it may help me correct some errors that I would be able to recognize as such.
which made me think that when you say “short-term” or “narrow” (I’m assuming you use these interchangeably?) values you are talking about an AI that doesn’t do anything the end user can’t understand the rationale of. But then I read Concrete approval-directed agents where you wrote:
Efficacy: By getting help from additional approval-directed agents, the human operator can evaluate proposals as if she were as smart as those agents. In particular, the human can evaluate the given rationale for a proposed action and determine whether the action really does what the human wants.
and this made me think that you’re also including AIs that do things that the user can merely evaluate the rationale of (i.e., not be able to have an internal understanding of, even hypothetically). Since this “evaluable” interpretation also seems more compatible with strategy-stealing (because an AI that only performs actions that a human can understand can’t “steal” a superhuman strategy), I’m currently guessing this is what you actually have in mind, at least when you’re thinking about how to make a corrigible AI competitive.
Like I mentioned above, I mostly think of narrow value learning is a substitute for imitation learning or approval-direction, realistically to be used as a distillation step rather than as your whole AI. In particular, an agent trained with narrow value learning absolutely is probably not aligned+competitive in a way that might allow you to apply this kind of strategy-stealing argument.
In concrete approval-directed agents I’m talking about a different design, it’s not related to narrow value learning.
I don’t use narrow and short-term interchangeably. I’ve only ever used it in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning.
Ah, that clears up a lot of things for me. (I saw your earlier comment but was quite confused by it due to not realizing your narrow / short-term distinction.) One reason I thought you used “short-term” and “narrow” interchangeably is due to Act-based agents where you seemed to be doing that:
These proposals all focus on the short-term instrumental preferences of their users. [...]
What is “narrow” anyway?
There is clearly a difference between act-based agents and traditional rational agents. But it’s not entirely clear what the key difference is.
And in that post it also seemed like “narrow value learners” were meant to be the whole AI since it talked a lot about “users” of such AI.
(In that post I did use narrow in the way we are currently using short-term, contrary to my claim the grandparent. Sorry for the confusion this caused.)
(BTW Paul, if you’re reading this, Issa and I and a few others have been chatting about this on MIRIxDiscord. I’m sure you’re more than welcome to join if you’re interested, but I figured you probably don’t have time for it. PM me if you do want an invite.)
Issa, I think my current understanding of what Paul means is roughly the same as yours, and I also share your confusion about “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy”.
To summarize my own understanding (quoting myself from the Discord), what Paul means by “satisfying short-term preferences-on-reflection” seems to cash out as “do the action for which the AI can produce an explanation such that a hypothetical human would evaluate it as good (possibly using other AI assistants), with the evaluation procedure itself being the result of a hypothetical deliberation which is controlled by the preferences-for-deliberation that the AI learned/inferred from a real human.”
(I still have other confusions around this. For example is the “hypothetical human” here (the human being predicted in Issa’s 3) a hypothetical end user evaluating the action based on what they themselves want, or is it a hypothetical overseer evaluating the action based on what the overseer thinks the end user wants? Or is the “hypothetical human” just a metaphor for some abstract, distributed, or not recognizably-human deliberative/evaluative process at this point?)
Thanks to Wei Dai for bringing up this distinction, although I may have misunderstood the actual distinction.
I think maybe it would make sense to further break (6) down into 2 sub-dimensions: (6a) understandable vs evaluable and (6b) how much AI assistance. “Understandable” means the human achieves an understanding of the (outer/main) AI’s rationale for action within their own brain, with or without (other) AI assistance (which can for example answer questions for the human or give video lectures, etc.). And “evaluable” means the human runs or participates in a procedure that returns a score for how good the action is, but doesn’t necessarily achieve a holistic understanding of the rationale in their own brain. (If the external procedure involves other real or hypothetical humans, then it gets fuzzy but basically I want to rule out Chinese Room scenarios as “understandable”.) Based on https://ai-alignment.com/concrete-approval-directed-agents-89e247df7f1b I’m guessing Paul has “evaluable” and “with AI assistance” in mind here. (In other words I agree with what you mean by “long in sense (6)”.)
Corrigibility is about short-term preferences-on-reflection.
Now that I (hopefully) better understand what you mean by “short-term preferences-on-reflection” my next big confusion (that hopefully can be cleared up relatively easily) is that this version of “corrigibility” seems very different from the original MIRI/Armstrong “corrigibility”. (You cited that paper as a narrower version of your corrigibility in your Corrigibility post, but it actually seems completely different to me at this point.) Here’s the MIRI definition (from the abstract):
We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences.
As I understand it, the original motivation for corrigibility_MIRI was to make sure that someone can always physically press the shutdown button, and the AI would shut off. But if a corrigible_Paul AI thinks (correctly or incorrectly) that my preferences-on-reflection (or “true” preferences) is to let the AI keep running, it will act against my (actual physical) attempts to shut down the AI, and therefore it’s not corrigible_MIRI.
Do you agree with this, and if so can you explain whether your concept of corrigibility evolved over time (e.g., are there older posts where “corrigibility” referred to a concept closer to corrigibility_MIRI), or was it always about “short-term preferences-on-reflection”?
Here’s a longer definition of “corrigible” from the body of MIRI’s paper (which also seems to support my point):
We say that an agent is “corrigible” if it tolerates or assists many forms of outside correction, including at least the following: (1) A corrigible reasoner must at least tolerate and preferably assist the programmers in their attempts to alter or turn off the system. (2) It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so. (3) It should have a tendency to repair safety measures (such as shutdown buttons) if they break, or at least to notify programmers that this breakage has occurred. (4) It must preserve the programmers’ ability to correct or shut down the system (even as the system creates new subsystems or self-modifies).
As I understand it, the original motivation for corrigibility_MIRI was to make sure that someone can always physically press the shutdown button, and the AI would shut off. But if a corrigible_Paul AI thinks (correctly or incorrectly) that my preferences-on-reflection (or “true” preferences) is to let the AI keep running, it will act against my (actual physical) attempts to shut down the AI, and therefore it’s not corrigible_MIRI.
Note that “corrigible” is not synonymous with “satisfying my short-term preferences-on-reflection” (that’s why I said: “our short-term preferences, including (amongst others) our preference for the agent to be corrigible.”)
I’m just saying that when we talk about concepts like “remain in control” or “become better informed” or “shut down,” those all need to be taken as concepts-on-reflection. We’re not satisfying current-Paul’s judgment of “did I remain in control?” they are the on-reflection notion of “did I remain in control”?
Whether an act-based agent is corrigible depends on our preferences-on-reflection (this is why the corrigibility post says that act-based agents “can be corrigible”). It may be that our preferences-on-reflection are for an agent to not be corrigible. It seems to me that for robustness reasons we may want to enforce corrigibility in all cases even if it’s not what we’d prefer-on-reflection, for robustness reasons.
That said, even without any special measures, saying “corrigibility is relatively easy to learn” is still an important argument about the behavior of our agents, since it hopefully means that either (i) our agents will behave corrigibly, (ii) our agents will do something better than behaving corriglby, according to our preferences-on-reflection, (iii) our agents are making a predictable mistake in optimizing our preferences-on-reflection (which might be ruled out by them simply being smart enough and understanding the kinds of argument we are currently making).
By “corrigible” I think we mean “corrigible by X” with the X implicit. It could be “corrigible by some particular physical human.”
Note that “corrigible” is not synonymous with “satisfying my short-term preferences-on-reflection” (that’s why I said: “our short-term preferences, including (amongst others) our preference for the agent to be corrigible.”)
Ah, ok. I think in this case my confusion was caused by not having a short term for “satisfying X’s short-term preferences-on-reflection” so I started thinking that “corrigible” meant this. (Unless there is a term for this that I missed? Is “act-based” synonymous with this? I guess not, because “act-based” seems broader and isn’t necessarily about “preferences-on-reflection”?)
That said, even without any special measures, saying “corrigibility is relatively easy to learn” is still an important argument about the behavior of our agents, since it hopefully means that either [...]
Now that I understand “corrigible” isn’t synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn’t seem enough to imply these things, because we also need “reflection or preferences-for-reflection are relatively easy to learn” (otherwise the AI might correctly learn that the user currently wants corrigibility, but learns the wrong way to do reflection and incorrectly concludes that the user-on-reflection doesn’t want corrigibility) and also “it’s relatively easy to point the AI to the intended person whose reflection it should infer/extrapolate” (e.g., it’s not pointing to a user who exists in some alien simulation, or the AI models the user’s mind-state incorrectly and therefore begins the reflection process from a wrong starting point). These other things don’t seem obviously true and I’m not sure if they’ve been defended/justified or even explicitly stated.
I think this might be another reason for my confusion, because if “corrigible” was synonymous with “satisfying my short-term preferences-on-reflection” then “corrigibility is relatively easy to learn” would seem to imply these things.
Now that I understand “corrigible” isn’t synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn’t seem enough to imply these things
I agree that you still need the AI to be trying to do the right thing (even though we don’t e.g. have any clear definition of “the right thing”), and that seems like the main way that you are going to fail.
By short-term preference I don’t mean “Start a car company, I hear those are profitable,” I mean more like “Make me money, and then make sure that I remain in control of that company and its profits,” or even better “acquire flexible influence that I can use to get what I want.”
(This is probably not the response you were looking for. I’m still mostly intending to give up on communication here over the short term, because it seems too hard. If you are confused by particular things I’ve said feel free to quote them so that I can either clarify, register a disagreement, or write them off as sloppy or mistaken comments.)
But in your earlier writings it sure seems that’s the kind of the thing that you meant, or even narrower preferences than this. Your corrigibility post said:
From Act-based agents:
All three of these corrigible AIs deal with much narrower preferences than “acquire flexible influence that I can use to get what I want”. The narrow value learner post for example says:
I may be misunderstanding something, and you’re probably not doing this intentionally, but it looks a lot like having a vague notion of “short-term preferences” is allowing you to equivocate between really narrow preferences when you’re trying to argue for safety, and much broader preferences when you’re trying to argue for competitiveness. Wouldn’t it be a good idea (as I’ve repeated suggested) to make a priority of nailing down the concept of “short-term preferences” given how central it is to your approach?
Imitation learning, approval-direction, and narrow value learning are not intended to exceed the overseer’s capabilities. These are three candidates for the distillation step in iterated distillation and amplification.
The AI we actually deploy, which I’m discussing in the OP, is produced by imitating (or learning the values of, or maximizing the approval of) an even smarter AI—whose valuations of resources reflect everything that unaligned AIs know about which resources will be helpful.
Corrigibility is about short-term preferences-on-reflection. I see how this is confusing. Note that the article doesn’t make sense at all when interpreted in the other way. For example, the user can’t even tell whether they are in control of the situation, so what does it mean to talk about their preference to be in control of the situation if these aren’t supposed to be preferences-on-reflection? (Similarly for “preference to be well-informed” and so on.) The desiderata discussed in the original corrigibility post seem basically the same as the user not being able to tell what resources will help them achieve their long-term goals, but still wanting the AI to accumulate those resources.
I also think the act-based agents post is correct if “preferences” means preferences-on-reflection. It’s just that the three approaches listed at the top are limited to the capabilities of the overseer. I think that distinguishing between preferences-as-elicited and preferences-on-reflection is the most important thing to disambiguate here. I usually use “preference” to mean preference-on-idealized-reflection (or whatever “actual preference” should mean, acknowledging that we don’t have a real ground truth definition), which I think is the more typical usage. I’d be fine with suggestions for disambiguation.
If there’s somewhere else I’ve equivocated in the way you suggest, then I’m happy to correct it. It seems like a thing I might have done in a way that introduces an error. I’d be surprised if it hides an important problem (I think the big problems in my proposal are lurking other places, not here), and I think in the corrigibility post I think that I have these concepts straight.
One thing you might have in mind is the following kind of comment:
That is, you might be concerned: “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy.” I’m saying that you shouldn’t expect this to happen, if the AI is well-calibrated and has enough of an understanding of humans to understand e.g. this discussion we are currently having—if it decides not to be corrigible, we should expect it to be right on average.
Like Wei Dai, I am also finding this discussion pretty confusing. To summarize my state of confusion, I came up with the following list of ways in which preferences can be short or long:
time horizon and time discounting: how far in the future is the preference about? More generally, how much weight do we place on the present vs the future?
act-based (“short”) vs goal-based (“long”): using the human’s (or more generally, the human-plus-AI-assistants’; see (6) below) estimate of the value of the next action (act-based) or doing more open-ended optimization of the future based on some goal, e.g. using a utility function (goal-based)
amount of reflection the human has undergone: “short” would be the current human (I think this is what you call “preferences-as-elicited”), and this would get “longer” as we give the human more time to think, with something like CEV/Long Reflection/Great Deliberation being the “longest” in this sense (I think this is what you call “preference-on-idealized-reflection”). This sense further breaks down into whether the human itself is actually doing the reflection, or if the AI is instead predicting what the human would think after reflection.
how far the search happens: “short” would be a limited search (that lacks insight/doesn’t see interesting consequences) and “long” would be a search that has insight/sees interesting consequences. This is a distinction you made in a discussion with Eliezer a while back. This distinction also isn’t strictly about preferences, but rather about how one would achieve those preferences.
de dicto (“short”) vs de re (“long”): This is a distinction you made in this post. I think this is the same distinction as (2) or (3), but I’m not sure which. (But if my interpretation of you below is correct, I guess this must be the same as (2) or else a completely different distinction.)
understandable (“short”) vs evaluable (“long”): A course of action is understandable if the human (without any AI assistants) can understand the rationale behind it; a course of action is evaluable if there is some procedure the human can implement to evaluate the rationale using AI assistants. I guess there is also a “not even evaluable” option here that is even “longer”. (Thanks to Wei Dai for bringing up this distinction, although I may have misunderstood the actual distinction.)
My interpretation is that when you say “short-term preferences-on-reflection”, you mean short in sense (1), except when the AI needs to gather resources, in which case either the human or the AI will need to do more long-term planning; short in sense (2); long in sense (3), with the AI predicting what the human would think after reflection; long in sense (4); short in sense (5); long in sense (6). Does this sound right to you? If not, I think it would help me a lot if you could “fill in the list” with which of short or long you choose for each point.
Assuming my interpretation is correct, my confusion is that you say we shouldn’t expect a situation where “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy” (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.
By “short” I mean short in sense (1) and (2). “Short” doesn’t imply anything about senses (3), (4), (5), or (6) (and “short” and “long” don’t seem like good words to describe those axes, though I’ll keep using them in this comment for consistency).
By “preferences-on-reflection” I mean long in sense (3) and neither in sense (6). There is a hypothesis that “humans with AI help” is a reasonable way to capture preferences-on-reflection, but they aren’t defined to be the same. I don’t use understandable and evaluable in this way.
I think (4) and (5) are independent axes. (4) just sounds like “is your AI good at optimizing,” not a statement about what it’s optimizing. In the discussion with Eliezer I’m arguing against it being linked to any of these other axes. (5) is a distinction about two senses in which an AI can be “optimizing my short-term preferences-on-reflection”
When discussing perfect estimations of preferences-on-reflection, I don’t think the short vs. long distinction is that important. “Short” is mostly important when talking about ways in which an AI can fall short of perfectly estimating preferences-on-reflection.
I introduced the term “preferences-on-reflection” in the previous comment to make a particular distinction. It’s probably better to say something like “actual preferences” (though this is also likely to be misinterpreted). The important property is that I’d prefer to have an AI that satisfies my actual preferences than to have any other kind of AI. We could also say “better by my lights” or something else.
There’s a hypothesis that “what I’d say after some particular idealized process of reflection” is a reasonable way to capture “actual preferences,” but I think that’s up for debate—e.g. it could fail if me-on-reflection is selfish and has values opposed to current-me, and certainly it could fail for any particular process of reflection and so it might just happen to be the case that there is no process of reflection that satisfies it.
The claim I usually make is that “what I’d say after some particular idealized process of reflection” describes the best mechanism we can hope to find for capturing “actual preferences,” because whatever else we might do to capture “actual preferences” can just be absorbed into that process of reflection.
“Actual preferences” is a pretty important concept here, I don’t think we could get around the need for it, I’m not sure if there is disagreement about this concept or just about the term being used for it.
I’m really confused why “short” world include sense (1) rather than only sense (2). If “corrigibly is about short-term preferences on reflection” then this seems to be a claim that corrigible AI should understand us as preferring to eat candy and junk food, because on reflection we do like how it tastes, we just choose not to eat it because of longer-term concerns—so a corrigible system ignores the longer-term concerns and interpretations us as wanting candy and junk food.
Perhaps you intend sense (1) where “short” means ~100 years, rather than ~10 minutes, so that the system doesn’t interpret us as wanting candy and junk food. But this similarly creates problems when we think longer than 100 years; the system wouldn’t take those thoughts seriously.
It seems much more sensible to me for “short” in the context of this discussion to mean (2) only. But perhaps I misunderstood something.
One of us just misunderstood (1), I don’t think there is any difference.
I mean preferences about what happens over the near future, but the way I rank “what happens in the near future” will likely be based on its consequences (further in the future, and in other possible worlds, and etc.). So I took (1) to be basically equivalent to (2).
“Terminal preferences over the near future” is not a thing I often think about and I didn’t realize it was a candidate interpretation (normally when I write about short-term preferences I’m writing about things like control, knowledge, and resource acquisition).
The reason I brought up this distinction was that in Ambitious vs. narrow value learning you wrote:
which made me think that when you say “short-term” or “narrow” (I’m assuming you use these interchangeably?) values you are talking about an AI that doesn’t do anything the end user can’t understand the rationale of. But then I read Concrete approval-directed agents where you wrote:
and this made me think that you’re also including AIs that do things that the user can merely evaluate the rationale of (i.e., not be able to have an internal understanding of, even hypothetically). Since this “evaluable” interpretation also seems more compatible with strategy-stealing (because an AI that only performs actions that a human can understand can’t “steal” a superhuman strategy), I’m currently guessing this is what you actually have in mind, at least when you’re thinking about how to make a corrigible AI competitive.
Like I mentioned above, I mostly think of narrow value learning is a substitute for imitation learning or approval-direction, realistically to be used as a distillation step rather than as your whole AI. In particular, an agent trained with narrow value learning absolutely is probably not aligned+competitive in a way that might allow you to apply this kind of strategy-stealing argument.
In concrete approval-directed agents I’m talking about a different design, it’s not related to narrow value learning.
I don’t use narrow and short-term interchangeably. I’ve only ever used it in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning.
Ah, that clears up a lot of things for me. (I saw your earlier comment but was quite confused by it due to not realizing your narrow / short-term distinction.) One reason I thought you used “short-term” and “narrow” interchangeably is due to Act-based agents where you seemed to be doing that:
And in that post it also seemed like “narrow value learners” were meant to be the whole AI since it talked a lot about “users” of such AI.
(In that post I did use narrow in the way we are currently using short-term, contrary to my claim the grandparent. Sorry for the confusion this caused.)
(BTW Paul, if you’re reading this, Issa and I and a few others have been chatting about this on MIRIxDiscord. I’m sure you’re more than welcome to join if you’re interested, but I figured you probably don’t have time for it. PM me if you do want an invite.)
Issa, I think my current understanding of what Paul means is roughly the same as yours, and I also share your confusion about “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy”.
To summarize my own understanding (quoting myself from the Discord), what Paul means by “satisfying short-term preferences-on-reflection” seems to cash out as “do the action for which the AI can produce an explanation such that a hypothetical human would evaluate it as good (possibly using other AI assistants), with the evaluation procedure itself being the result of a hypothetical deliberation which is controlled by the preferences-for-deliberation that the AI learned/inferred from a real human.”
(I still have other confusions around this. For example is the “hypothetical human” here (the human being predicted in Issa’s 3) a hypothetical end user evaluating the action based on what they themselves want, or is it a hypothetical overseer evaluating the action based on what the overseer thinks the end user wants? Or is the “hypothetical human” just a metaphor for some abstract, distributed, or not recognizably-human deliberative/evaluative process at this point?)
I think maybe it would make sense to further break (6) down into 2 sub-dimensions: (6a) understandable vs evaluable and (6b) how much AI assistance. “Understandable” means the human achieves an understanding of the (outer/main) AI’s rationale for action within their own brain, with or without (other) AI assistance (which can for example answer questions for the human or give video lectures, etc.). And “evaluable” means the human runs or participates in a procedure that returns a score for how good the action is, but doesn’t necessarily achieve a holistic understanding of the rationale in their own brain. (If the external procedure involves other real or hypothetical humans, then it gets fuzzy but basically I want to rule out Chinese Room scenarios as “understandable”.) Based on https://ai-alignment.com/concrete-approval-directed-agents-89e247df7f1b I’m guessing Paul has “evaluable” and “with AI assistance” in mind here. (In other words I agree with what you mean by “long in sense (6)”.)
Now that I (hopefully) better understand what you mean by “short-term preferences-on-reflection” my next big confusion (that hopefully can be cleared up relatively easily) is that this version of “corrigibility” seems very different from the original MIRI/Armstrong “corrigibility”. (You cited that paper as a narrower version of your corrigibility in your Corrigibility post, but it actually seems completely different to me at this point.) Here’s the MIRI definition (from the abstract):
As I understand it, the original motivation for corrigibility_MIRI was to make sure that someone can always physically press the shutdown button, and the AI would shut off. But if a corrigible_Paul AI thinks (correctly or incorrectly) that my preferences-on-reflection (or “true” preferences) is to let the AI keep running, it will act against my (actual physical) attempts to shut down the AI, and therefore it’s not corrigible_MIRI.
Do you agree with this, and if so can you explain whether your concept of corrigibility evolved over time (e.g., are there older posts where “corrigibility” referred to a concept closer to corrigibility_MIRI), or was it always about “short-term preferences-on-reflection”?
Here’s a longer definition of “corrigible” from the body of MIRI’s paper (which also seems to support my point):
Note that “corrigible” is not synonymous with “satisfying my short-term preferences-on-reflection” (that’s why I said: “our short-term preferences, including (amongst others) our preference for the agent to be corrigible.”)
I’m just saying that when we talk about concepts like “remain in control” or “become better informed” or “shut down,” those all need to be taken as concepts-on-reflection. We’re not satisfying current-Paul’s judgment of “did I remain in control?” they are the on-reflection notion of “did I remain in control”?
Whether an act-based agent is corrigible depends on our preferences-on-reflection (this is why the corrigibility post says that act-based agents “can be corrigible”). It may be that our preferences-on-reflection are for an agent to not be corrigible. It seems to me that for robustness reasons we may want to enforce corrigibility in all cases even if it’s not what we’d prefer-on-reflection, for robustness reasons.
That said, even without any special measures, saying “corrigibility is relatively easy to learn” is still an important argument about the behavior of our agents, since it hopefully means that either (i) our agents will behave corrigibly, (ii) our agents will do something better than behaving corriglby, according to our preferences-on-reflection, (iii) our agents are making a predictable mistake in optimizing our preferences-on-reflection (which might be ruled out by them simply being smart enough and understanding the kinds of argument we are currently making).
By “corrigible” I think we mean “corrigible by X” with the X implicit. It could be “corrigible by some particular physical human.”
Ah, ok. I think in this case my confusion was caused by not having a short term for “satisfying X’s short-term preferences-on-reflection” so I started thinking that “corrigible” meant this. (Unless there is a term for this that I missed? Is “act-based” synonymous with this? I guess not, because “act-based” seems broader and isn’t necessarily about “preferences-on-reflection”?)
Now that I understand “corrigible” isn’t synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn’t seem enough to imply these things, because we also need “reflection or preferences-for-reflection are relatively easy to learn” (otherwise the AI might correctly learn that the user currently wants corrigibility, but learns the wrong way to do reflection and incorrectly concludes that the user-on-reflection doesn’t want corrigibility) and also “it’s relatively easy to point the AI to the intended person whose reflection it should infer/extrapolate” (e.g., it’s not pointing to a user who exists in some alien simulation, or the AI models the user’s mind-state incorrectly and therefore begins the reflection process from a wrong starting point). These other things don’t seem obviously true and I’m not sure if they’ve been defended/justified or even explicitly stated.
I think this might be another reason for my confusion, because if “corrigible” was synonymous with “satisfying my short-term preferences-on-reflection” then “corrigibility is relatively easy to learn” would seem to imply these things.
I agree that you still need the AI to be trying to do the right thing (even though we don’t e.g. have any clear definition of “the right thing”), and that seems like the main way that you are going to fail.