Overall, I’m hoping that we can solve “human safety problems” by training the humans supervising the AI to not have those problems, because it sure does make the technical problem of aligning AI seem a lot easier.
Note that humans play two distinct roles in IDA, and I think it’s important to separate them:
1. They are used to train corrigible reasoning, because we don’t have a sufficiently good explicit understanding of corrigible reasoning. This role could be removed if e.g. MIRI’s work on agent foundations were sufficiently successful.
2. The AI that we’ve trained is then tasked with the job of helping the user get what they “really” want, which is indirectly encoded in the user.
Solving safety problems for humans in step #1 is necessary to solve intent alignment. This likely involves both training (whether to reduce failure probabilities or to reach appropriate universality thresholds), and using humans in a way that is robust to their remaining safety problems (since it seems clear that most of them cannot be removed).
Solving safety problems for humans in step #2 is something else altogether. At this point you have a bunch of humans in the world who want AIs that are going to help them get what they want, and I don’t think it makes that much sense to talk about replacing those humans with highly-trained supervisors—the supervisors might play a role in step #1 as a way of getting an AI that is trying to help the user get what they want, but can’t replace the user themselves in step #2 . I think relevant measures at this point are things like:
Learn more about how to deliberate “correctly,” or about what kinds of pressures corrupt human values, or about how to avoid such corruption, or etc. If more such understanding is available, then both AIs and humans can use them to avoid corruption. In the long run AI systems will do much more work on this problem than we will, but a lot of damage could be done between now and the time when AI systems are powerful enough to obsolete all of the thinking that we do today on this topic.
Figure out how to build AIs that are better at tasks like “help humans clarify what they really want.” Differential progress in this area could be a huge win. (Again, in the long run all of the AI-design work will itself be done by AI systems, but lots of damage could be dealt in the interim as we deploy human-desinged AIs that are particularly good at manipulation relative to helping humans clarify and realize their “real” values.)
Change institutions/policy/environment to reduce the risk of value corruption, especially for users that don’t have strong short-term preferences about how their short-term preferences change, or who don’t have a clear picture of how their current choices will affect that. For example, the designers of potentially-manipulative technologies may be able to set defaults that make a huge difference in how humanity’s values evolve.
You could also try give highly wise people more influence over what actually happens, whether by acquiring resources, earning others’ trust, or whatever.
Learning from idealized humans might address this to some extent, but in many circumstances I think I would trust the real humans who are actually in those circumstances more than the idealized humans who must reason about those circumstances from afar (in their safe, familiar environment).
This objection may work for some forms of idealization, but I don’t think it holds up in general. If you think that experiencing X makes your views better, then your idealization can opt to experience X. The whole point of the idealization is that the idealized humans get to have the set of experiences that they believe are best for arriving at correct views, rather than a set of experiences that are constrained by technological feasibility / competitiveness constraints / etc.
(I agree that there can be some senses in which the idealization itself unavoidably “breaks” the idealized human—e.g. Vladimir Slepnev points out that an idealized human might conclude that they are most likely in a simulation, which may change their behavior; Wei Dai points out that they may behave selfishly towards the idealized human rather than towards the unidealized human, if selfishness is part of the values we’d converge to—but I don’t think this is one of them.)
Note that humans play two distinct roles in IDA, and I think it’s important to separate them
Yeah, I was talking entirely about the first role, thanks for the clarification.
This objection may work for some forms of idealization, but I don’t think it holds up in general. If you think that experiencing X makes your views better, then your idealization can opt to experience X.
I agree now, I misunderstood what the point of the idealization in the original point was. (I thought it was to avoid having experiences that could cause value corruption, whereas it was actually about having experiences only when ready for them.)
Wei Dai points out that they may behave selfishly towards the idealized human rather than towards the unidealized human
I think this was tangentially related to my objection, for example that an idealized human would choose eg. not to be waterboarded even though that experience is important for deciding what to do for the unidealized human. Though the particular objection I wrote was based on a misunderstanding.
Note that humans play two distinct roles in IDA, and I think it’s important to separate them
This seems like a really important clarification, but in your article on corrigibility, you only ever talk about one human, the overseer, and the whole argument about “basin of attraction” seems to rely on having one human be both the trainer for corrigibility, target of corrigibility, and source of preferences:
But a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.
I think in that post, the overseer is training the AI to specifically be corrigible to herself, which makes the AI aligned to herself. I’m not sure what is happening in the new scheme with two humans. Is the overseer now still training the AI to be corrigible to herself, which produces an AI that’s aligned to the overseer which then helps out the user because the overseer has a preference to help out the user? Or is the overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time? If the latter, have you checked that the “basin of attraction” argument still applies? If it does, maybe that post needs to be rewritten to make that clearer?
This seems like a really important clarification, but in your article on corrigibility, you only ever talk about one human, the overseer, and the whole argument about “basin of attraction” seems to rely on having one human be both the trainer for corrigibility, target of corrigibility, and source of preferences:
Corrigibility plays a role both within amplification and in the final agent.
The post is mostly talking about the final agent without talking about IDA specifically.
The section titled Amplification is about the internal dynamics, where behavior is corrigible by the question-asker. It doesn’t seem important to me that these be the same. Corrigibility to the overseer only leads to corrigibility to the end user if the overseer is appropriately motivated. I usually imagine the overseer as something like a Google engineer and the end user as something like a visitor to google.com today. The resulting agent will likely be imperfectly corrigible because of the imperfect motives of Google engineers (this is pretty similar to human relationships around other technologies).
I’m no longer as convinced that corrigibility is the right abstraction for reasoning about internal behavior within amplification (but am still pretty convinced that it’s a good way to reason about the external behavior, and I do think “corrigible” is closer to what we want than “benign” was). I’ve been thinking about these issues recently and it will be touched on in an upcoming post.
Is the overseer now still training the AI to be corrigible to herself, which produces an AI that’s aligned to the overseer which then helps out the user because the overseer has a preference to help out the user?
This is basically right. I’m usually imagining the overseer training a general question-answering system, with the AI trained to be corrigible to the question-asker. We then use that question-answering system to implement a corrigible agent, by using it to answer questions like “What should the agent do next?” (with an appropriate specification of ‘should’), which is where external corrigibility comes in.
This is basically right. I’m usually imagining the overseer training a general question-answering system, with the AI trained to be corrigible to the question-asker.
This confuses me because you’re saying “basically right” to something but then you say something that seems very different, and which actually seems closer to the other option I was suggesting. Isn’t it very different for the overseer to train the AI to be corrigible to herself as a specific individual, versus training the AI to be corrigible to whoever is asking the current question? Since the AI can’t know who is asking the current question (which seems necessary to be corrigible to them?) without that being passed in as additional information, this seems closer to ‘overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time’.
I also have a bunch of other confusions, but it’s probably easier to talk about them after resolving this one.
(Also, just in case, is there a difference between “corrigible to” and “corrigible by”?)
this seems closer to ‘overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time’.
The overseer asks the question “what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?”, and indeed even at training time the overseer is training the system to answer this question. There is no swapping out at test time. (The distributions at train and test time are identical, and I normally talk about the version where you keep training online.)
When the user asks a question to the agent it is being answered by indirection, by using the question-answering system to answer “what should the agent do [in the situation when it has been asked question Q by the user]?”
The overseer asks the question “what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?“
Ok, I’ve been trying to figure out what would make the most sense and came to the same conclusion. I would also note that this “corrigible” is substantially different from the “corrigible” in “the AI is corrigible to the question asker” because it has to be an explicit form of corribility that is limited by things like corporate policy. For example if Alice asks “What are your design specs and source code?” or “How do I hack into this bank?” then the AI wouldn’t answer even though it’s supposed to be “corrigible” to the user, right? Maybe we need modifiers to indicate which corrigibility we’re talking about, like “full corrigibility” vs “limited corrigibility”?
ETA: Actually, does it even make sense to use the word “corrigible” in “to be corrigible to the Google customer Alice it is currently working for”? Originally “corrigible” meant:
A corrigible agent experiences no preference or instrumental pressure to interfere with attempts by the programmers or operators to modify the agent, impede its operation, or halt its execution.
But obviously Google’s AI is not going to allow a user to “modify the agent, impede its operation, or halt its execution”. Why use “corrigible” here instead of different language altogether, like “helpful to the extent allowed by Google policies”?
(Also, just in case, is there a difference between “corrigible to” and “corrigible by”?)
No. I was just saying “corrigible by” originally because that seems more grammatical, and sometimes saying “corrigible to” because it seems more natural. Probably “to” is better.
My interpretation of what you’re saying here is that the overseer in step #1 can do a lot of things to bake in having the AI interpret “help the user get what they really want” in ways that get the AI to try to eliminate human safety problems for the step #2 user (possibly entirely), but problems might still occur in the short term before the AI is able to think/act to remove those safety problems.
It seems to me that this implies that IDA essentially solves the AI alignment portion of points 1 and 2 in the original post (modulo things happening before the AI is in control).
Note that humans play two distinct roles in IDA, and I think it’s important to separate them:
1. They are used to train corrigible reasoning, because we don’t have a sufficiently good explicit understanding of corrigible reasoning. This role could be removed if e.g. MIRI’s work on agent foundations were sufficiently successful.
2. The AI that we’ve trained is then tasked with the job of helping the user get what they “really” want, which is indirectly encoded in the user.
Solving safety problems for humans in step #1 is necessary to solve intent alignment. This likely involves both training (whether to reduce failure probabilities or to reach appropriate universality thresholds), and using humans in a way that is robust to their remaining safety problems (since it seems clear that most of them cannot be removed).
Solving safety problems for humans in step #2 is something else altogether. At this point you have a bunch of humans in the world who want AIs that are going to help them get what they want, and I don’t think it makes that much sense to talk about replacing those humans with highly-trained supervisors—the supervisors might play a role in step #1 as a way of getting an AI that is trying to help the user get what they want, but can’t replace the user themselves in step #2 . I think relevant measures at this point are things like:
Learn more about how to deliberate “correctly,” or about what kinds of pressures corrupt human values, or about how to avoid such corruption, or etc. If more such understanding is available, then both AIs and humans can use them to avoid corruption. In the long run AI systems will do much more work on this problem than we will, but a lot of damage could be done between now and the time when AI systems are powerful enough to obsolete all of the thinking that we do today on this topic.
Figure out how to build AIs that are better at tasks like “help humans clarify what they really want.” Differential progress in this area could be a huge win. (Again, in the long run all of the AI-design work will itself be done by AI systems, but lots of damage could be dealt in the interim as we deploy human-desinged AIs that are particularly good at manipulation relative to helping humans clarify and realize their “real” values.)
Change institutions/policy/environment to reduce the risk of value corruption, especially for users that don’t have strong short-term preferences about how their short-term preferences change, or who don’t have a clear picture of how their current choices will affect that. For example, the designers of potentially-manipulative technologies may be able to set defaults that make a huge difference in how humanity’s values evolve.
You could also try give highly wise people more influence over what actually happens, whether by acquiring resources, earning others’ trust, or whatever.
This objection may work for some forms of idealization, but I don’t think it holds up in general. If you think that experiencing X makes your views better, then your idealization can opt to experience X. The whole point of the idealization is that the idealized humans get to have the set of experiences that they believe are best for arriving at correct views, rather than a set of experiences that are constrained by technological feasibility / competitiveness constraints / etc.
(I agree that there can be some senses in which the idealization itself unavoidably “breaks” the idealized human—e.g. Vladimir Slepnev points out that an idealized human might conclude that they are most likely in a simulation, which may change their behavior; Wei Dai points out that they may behave selfishly towards the idealized human rather than towards the unidealized human, if selfishness is part of the values we’d converge to—but I don’t think this is one of them.)
Yeah, I was talking entirely about the first role, thanks for the clarification.
I agree now, I misunderstood what the point of the idealization in the original point was. (I thought it was to avoid having experiences that could cause value corruption, whereas it was actually about having experiences only when ready for them.)
I think this was tangentially related to my objection, for example that an idealized human would choose eg. not to be waterboarded even though that experience is important for deciding what to do for the unidealized human. Though the particular objection I wrote was based on a misunderstanding.
This seems like a really important clarification, but in your article on corrigibility, you only ever talk about one human, the overseer, and the whole argument about “basin of attraction” seems to rely on having one human be both the trainer for corrigibility, target of corrigibility, and source of preferences:
I think in that post, the overseer is training the AI to specifically be corrigible to herself, which makes the AI aligned to herself. I’m not sure what is happening in the new scheme with two humans. Is the overseer now still training the AI to be corrigible to herself, which produces an AI that’s aligned to the overseer which then helps out the user because the overseer has a preference to help out the user? Or is the overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time? If the latter, have you checked that the “basin of attraction” argument still applies? If it does, maybe that post needs to be rewritten to make that clearer?
Corrigibility plays a role both within amplification and in the final agent.
The post is mostly talking about the final agent without talking about IDA specifically.
The section titled
Amplification
is about the internal dynamics, where behavior is corrigible by the question-asker. It doesn’t seem important to me that these be the same. Corrigibility to the overseer only leads to corrigibility to the end user if the overseer is appropriately motivated. I usually imagine the overseer as something like a Google engineer and the end user as something like a visitor to google.com today. The resulting agent will likely be imperfectly corrigible because of the imperfect motives of Google engineers (this is pretty similar to human relationships around other technologies).I’m no longer as convinced that corrigibility is the right abstraction for reasoning about internal behavior within amplification (but am still pretty convinced that it’s a good way to reason about the external behavior, and I do think “corrigible” is closer to what we want than “benign” was). I’ve been thinking about these issues recently and it will be touched on in an upcoming post.
This is basically right. I’m usually imagining the overseer training a general question-answering system, with the AI trained to be corrigible to the question-asker. We then use that question-answering system to implement a corrigible agent, by using it to answer questions like “What should the agent do next?” (with an appropriate specification of ‘should’), which is where external corrigibility comes in.
This confuses me because you’re saying “basically right” to something but then you say something that seems very different, and which actually seems closer to the other option I was suggesting. Isn’t it very different for the overseer to train the AI to be corrigible to herself as a specific individual, versus training the AI to be corrigible to whoever is asking the current question? Since the AI can’t know who is asking the current question (which seems necessary to be corrigible to them?) without that being passed in as additional information, this seems closer to ‘overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time’.
I also have a bunch of other confusions, but it’s probably easier to talk about them after resolving this one.
(Also, just in case, is there a difference between “corrigible to” and “corrigible by”?)
The overseer asks the question “what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?”, and indeed even at training time the overseer is training the system to answer this question. There is no swapping out at test time. (The distributions at train and test time are identical, and I normally talk about the version where you keep training online.)
When the user asks a question to the agent it is being answered by indirection, by using the question-answering system to answer “what should the agent do [in the situation when it has been asked question Q by the user]?”
Ok, I’ve been trying to figure out what would make the most sense and came to the same conclusion. I would also note that this “corrigible” is substantially different from the “corrigible” in “the AI is corrigible to the question asker” because it has to be an explicit form of corribility that is limited by things like corporate policy. For example if Alice asks “What are your design specs and source code?” or “How do I hack into this bank?” then the AI wouldn’t answer even though it’s supposed to be “corrigible” to the user, right? Maybe we need modifiers to indicate which corrigibility we’re talking about, like “full corrigibility” vs “limited corrigibility”?
ETA: Actually, does it even make sense to use the word “corrigible” in “to be corrigible to the Google customer Alice it is currently working for”? Originally “corrigible” meant:
But obviously Google’s AI is not going to allow a user to “modify the agent, impede its operation, or halt its execution”. Why use “corrigible” here instead of different language altogether, like “helpful to the extent allowed by Google policies”?
No. I was just saying “corrigible by” originally because that seems more grammatical, and sometimes saying “corrigible to” because it seems more natural. Probably “to” is better.
My interpretation of what you’re saying here is that the overseer in step #1 can do a lot of things to bake in having the AI interpret “help the user get what they really want” in ways that get the AI to try to eliminate human safety problems for the step #2 user (possibly entirely), but problems might still occur in the short term before the AI is able to think/act to remove those safety problems.
It seems to me that this implies that IDA essentially solves the AI alignment portion of points 1 and 2 in the original post (modulo things happening before the AI is in control).