We might worry that if any particular human got a lot of power, or was able to think a lot faster, then there’s a decent chance that they do something that we would consider bad. Perhaps power corrupts them, or perhaps they get so excited about the potential technologies they can develop that they do so without thinking seriously about the consequences. We now have the opportunity to design AI systems that operate more cautiously, that aren’t prone to the same biases of reasoning and heuristics that we are, such that the future actually goes better than it would if we magically made humans more intelligent.
If it’s too hard to make AI systems in this way and we need to have them learn goals from humans, we could at least have them learn from idealized humans rather than real ones. Human values don’t extrapolate well—just look at the myriad answers that people give to the various hypotheticals like the trolley problem. So, it’s better to learn from humans that are kept in safe, familiar environment with all their basic needs taken care of. These are our idealized humans—and in practice the AI system would learn preferences from real humans, since that should be a very good indicator of the preferences of idealized humans. But if real humans end up in situations they’ve never encountered before, and start contradicting each other a lot, then the AI system can ignore these “corrupted” values and try to infer what the idealized humans would say about the situation.
More generally, it seems important for our AI systems to help us figure out what we care about before we make drastic and irreversible changes to our environment, especially changes that prevent us from figuring out what we care about. For example, if we create a hedonic paradise where everyone is on side-effect-free recreational drugs all the time, it seems unlikely that we check whether this is actually what we wanted. This suggests that we need to work on AI systems that differentially advance our philosophical capabilities relative to other capabilities, such as technological ones.
It seems quite difficult to me to build AI systems that are safe, without having them rely on humans in some way to determine what is and is not good behavior. As a result, I’m pessimistic about our chances at creating AI systems that can address “human safety problems”. Learning from idealized humans might address this to some extent, but in many circumstances I think I would trust the real humans who are actually in those circumstances more than the idealized humans who must reason about those circumstances from afar (in their safe, familiar environment).
I do think we want to have a general approach where we try to figure out how AIs and humans should reason, such that the resulting system behaves well. On the human side, this might mean that the human needs to be more cautious for longer timescales, or to have more epistemic and moral humility. Idealized humans can be thought of an instance of this approach where rather than change the policy of humans in reality, we change their policy in a hypothetical.
Overall, I’m hoping that we can solve “human safety problems” by training the humans supervising the AI to not have those problems, because it sure does make the technical problem of aligning AI seem a lot easier.
Overall, I’m hoping that we can solve “human safety problems” by training the humans supervising the AI to not have those problems, because it sure does make the technical problem of aligning AI seem a lot easier.
Note that humans play two distinct roles in IDA, and I think it’s important to separate them:
1. They are used to train corrigible reasoning, because we don’t have a sufficiently good explicit understanding of corrigible reasoning. This role could be removed if e.g. MIRI’s work on agent foundations were sufficiently successful.
2. The AI that we’ve trained is then tasked with the job of helping the user get what they “really” want, which is indirectly encoded in the user.
Solving safety problems for humans in step #1 is necessary to solve intent alignment. This likely involves both training (whether to reduce failure probabilities or to reach appropriate universality thresholds), and using humans in a way that is robust to their remaining safety problems (since it seems clear that most of them cannot be removed).
Solving safety problems for humans in step #2 is something else altogether. At this point you have a bunch of humans in the world who want AIs that are going to help them get what they want, and I don’t think it makes that much sense to talk about replacing those humans with highly-trained supervisors—the supervisors might play a role in step #1 as a way of getting an AI that is trying to help the user get what they want, but can’t replace the user themselves in step #2 . I think relevant measures at this point are things like:
Learn more about how to deliberate “correctly,” or about what kinds of pressures corrupt human values, or about how to avoid such corruption, or etc. If more such understanding is available, then both AIs and humans can use them to avoid corruption. In the long run AI systems will do much more work on this problem than we will, but a lot of damage could be done between now and the time when AI systems are powerful enough to obsolete all of the thinking that we do today on this topic.
Figure out how to build AIs that are better at tasks like “help humans clarify what they really want.” Differential progress in this area could be a huge win. (Again, in the long run all of the AI-design work will itself be done by AI systems, but lots of damage could be dealt in the interim as we deploy human-desinged AIs that are particularly good at manipulation relative to helping humans clarify and realize their “real” values.)
Change institutions/policy/environment to reduce the risk of value corruption, especially for users that don’t have strong short-term preferences about how their short-term preferences change, or who don’t have a clear picture of how their current choices will affect that. For example, the designers of potentially-manipulative technologies may be able to set defaults that make a huge difference in how humanity’s values evolve.
You could also try give highly wise people more influence over what actually happens, whether by acquiring resources, earning others’ trust, or whatever.
Learning from idealized humans might address this to some extent, but in many circumstances I think I would trust the real humans who are actually in those circumstances more than the idealized humans who must reason about those circumstances from afar (in their safe, familiar environment).
This objection may work for some forms of idealization, but I don’t think it holds up in general. If you think that experiencing X makes your views better, then your idealization can opt to experience X. The whole point of the idealization is that the idealized humans get to have the set of experiences that they believe are best for arriving at correct views, rather than a set of experiences that are constrained by technological feasibility / competitiveness constraints / etc.
(I agree that there can be some senses in which the idealization itself unavoidably “breaks” the idealized human—e.g. Vladimir Slepnev points out that an idealized human might conclude that they are most likely in a simulation, which may change their behavior; Wei Dai points out that they may behave selfishly towards the idealized human rather than towards the unidealized human, if selfishness is part of the values we’d converge to—but I don’t think this is one of them.)
Note that humans play two distinct roles in IDA, and I think it’s important to separate them
Yeah, I was talking entirely about the first role, thanks for the clarification.
This objection may work for some forms of idealization, but I don’t think it holds up in general. If you think that experiencing X makes your views better, then your idealization can opt to experience X.
I agree now, I misunderstood what the point of the idealization in the original point was. (I thought it was to avoid having experiences that could cause value corruption, whereas it was actually about having experiences only when ready for them.)
Wei Dai points out that they may behave selfishly towards the idealized human rather than towards the unidealized human
I think this was tangentially related to my objection, for example that an idealized human would choose eg. not to be waterboarded even though that experience is important for deciding what to do for the unidealized human. Though the particular objection I wrote was based on a misunderstanding.
Note that humans play two distinct roles in IDA, and I think it’s important to separate them
This seems like a really important clarification, but in your article on corrigibility, you only ever talk about one human, the overseer, and the whole argument about “basin of attraction” seems to rely on having one human be both the trainer for corrigibility, target of corrigibility, and source of preferences:
But a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.
I think in that post, the overseer is training the AI to specifically be corrigible to herself, which makes the AI aligned to herself. I’m not sure what is happening in the new scheme with two humans. Is the overseer now still training the AI to be corrigible to herself, which produces an AI that’s aligned to the overseer which then helps out the user because the overseer has a preference to help out the user? Or is the overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time? If the latter, have you checked that the “basin of attraction” argument still applies? If it does, maybe that post needs to be rewritten to make that clearer?
This seems like a really important clarification, but in your article on corrigibility, you only ever talk about one human, the overseer, and the whole argument about “basin of attraction” seems to rely on having one human be both the trainer for corrigibility, target of corrigibility, and source of preferences:
Corrigibility plays a role both within amplification and in the final agent.
The post is mostly talking about the final agent without talking about IDA specifically.
The section titled Amplification is about the internal dynamics, where behavior is corrigible by the question-asker. It doesn’t seem important to me that these be the same. Corrigibility to the overseer only leads to corrigibility to the end user if the overseer is appropriately motivated. I usually imagine the overseer as something like a Google engineer and the end user as something like a visitor to google.com today. The resulting agent will likely be imperfectly corrigible because of the imperfect motives of Google engineers (this is pretty similar to human relationships around other technologies).
I’m no longer as convinced that corrigibility is the right abstraction for reasoning about internal behavior within amplification (but am still pretty convinced that it’s a good way to reason about the external behavior, and I do think “corrigible” is closer to what we want than “benign” was). I’ve been thinking about these issues recently and it will be touched on in an upcoming post.
Is the overseer now still training the AI to be corrigible to herself, which produces an AI that’s aligned to the overseer which then helps out the user because the overseer has a preference to help out the user?
This is basically right. I’m usually imagining the overseer training a general question-answering system, with the AI trained to be corrigible to the question-asker. We then use that question-answering system to implement a corrigible agent, by using it to answer questions like “What should the agent do next?” (with an appropriate specification of ‘should’), which is where external corrigibility comes in.
This is basically right. I’m usually imagining the overseer training a general question-answering system, with the AI trained to be corrigible to the question-asker.
This confuses me because you’re saying “basically right” to something but then you say something that seems very different, and which actually seems closer to the other option I was suggesting. Isn’t it very different for the overseer to train the AI to be corrigible to herself as a specific individual, versus training the AI to be corrigible to whoever is asking the current question? Since the AI can’t know who is asking the current question (which seems necessary to be corrigible to them?) without that being passed in as additional information, this seems closer to ‘overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time’.
I also have a bunch of other confusions, but it’s probably easier to talk about them after resolving this one.
(Also, just in case, is there a difference between “corrigible to” and “corrigible by”?)
this seems closer to ‘overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time’.
The overseer asks the question “what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?”, and indeed even at training time the overseer is training the system to answer this question. There is no swapping out at test time. (The distributions at train and test time are identical, and I normally talk about the version where you keep training online.)
When the user asks a question to the agent it is being answered by indirection, by using the question-answering system to answer “what should the agent do [in the situation when it has been asked question Q by the user]?”
The overseer asks the question “what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?“
Ok, I’ve been trying to figure out what would make the most sense and came to the same conclusion. I would also note that this “corrigible” is substantially different from the “corrigible” in “the AI is corrigible to the question asker” because it has to be an explicit form of corribility that is limited by things like corporate policy. For example if Alice asks “What are your design specs and source code?” or “How do I hack into this bank?” then the AI wouldn’t answer even though it’s supposed to be “corrigible” to the user, right? Maybe we need modifiers to indicate which corrigibility we’re talking about, like “full corrigibility” vs “limited corrigibility”?
ETA: Actually, does it even make sense to use the word “corrigible” in “to be corrigible to the Google customer Alice it is currently working for”? Originally “corrigible” meant:
A corrigible agent experiences no preference or instrumental pressure to interfere with attempts by the programmers or operators to modify the agent, impede its operation, or halt its execution.
But obviously Google’s AI is not going to allow a user to “modify the agent, impede its operation, or halt its execution”. Why use “corrigible” here instead of different language altogether, like “helpful to the extent allowed by Google policies”?
(Also, just in case, is there a difference between “corrigible to” and “corrigible by”?)
No. I was just saying “corrigible by” originally because that seems more grammatical, and sometimes saying “corrigible to” because it seems more natural. Probably “to” is better.
My interpretation of what you’re saying here is that the overseer in step #1 can do a lot of things to bake in having the AI interpret “help the user get what they really want” in ways that get the AI to try to eliminate human safety problems for the step #2 user (possibly entirely), but problems might still occur in the short term before the AI is able to think/act to remove those safety problems.
It seems to me that this implies that IDA essentially solves the AI alignment portion of points 1 and 2 in the original post (modulo things happening before the AI is in control).
It seems quite difficult to me to build AI systems that are safe, without having them rely on humans in some way to determine what is and is not good behavior. As a result, I’m pessimistic about our chances at creating AI systems that can address “human safety problems”.
The second sentence doesn’t seem to follow logically from the first. There are many ways that an AI can rely on humans to determine what is and is not good behavior, some of which put more stress on human safety than others. At the most ideal (least reliant on humans being safe) end of the spectrum, we could solve metaphilosophy in a white-box way, and have the AI extract true/normative values from humans using its own philosophical understanding of what that means.
Aside from the “as a result” part, I’m also pessimistic but I don’t see a reason to be so pessimistic that the current level of neglect could be justified.
in many circumstances I think I would trust the real humans who are actually in those circumstances more than the idealized humans who must reason about those circumstances from afar (in their safe, familiar environment).
This is not obvious to me. Can you elaborate or give an example?
Idealized humans can be thought of an instance of this approach where rather than change the policy of humans in reality, we change their policy in a hypothetical.
Idealized humans can be safer not just because they have better policies, but also because they have safer environments.
Overall, I’m hoping that we can solve “human safety problems” by training the humans supervising the AI to not have those problems
Can you explain how one might train a human not to have the second problem in this post?
ETA: Also, it seems like you’re saying that a small group of highly trained supervisors will act as gatekeepers to all AIs, and use their judgement to, for example, deny requests for technologies that they don’t think humans are ready for yet. Is that right?
The second sentence doesn’t seem to follow logically from the first.
True. I don’t know exactly how to convey this intuition. I think the part that seems very hard to do is to get the AI system to outperform humans at figuring out what is good to do. But this is not exactly true—certainly at chess an AI system can outperform us choosing the best action. So maybe we could say that it’s very hard to get the AI system to outperform humans at figuring out the best criteria for figuring out what is good to do. But I don’t really believe that either—I expect that AI systems will be able to do philosophy better than we will at some point.
Coming at it from another angle, I think I would say that thousands of years of human intellectual effort has been put into trying to figure out what is good to do, without much consensus. It seems very hard to build an AI system that is going to outperform all of that intellectual effort and be completely correct. (This very directly applies to the most ideal scenario that you laid out.)
This is not obvious to me. Can you elaborate or give an example?
A particular example from the post: “A really good example of this is Hitchens’ waterboarding – he said he thought it was an acceptable interrogation technique, some people offered to waterboard him to see if he changed his mind, and he quickly did. I’m fascinated by this incident because it’s hard for me to understand, in a Mary’s Room style sense, what he learned from the experience. He must have already known it was very unpleasant – otherwise why would you even support it as useful for interrogation? But somehow there’s a difference between having someone explain to you that waterboarding is horrible, and undergoing it yourself.”
Also all of the worries about paternalism in the charity sector.
Also there’s the simple distributional shift argument: the circumstances that real humans are in form a distributional shift for the idealized humans, so we shouldn’t trust their judgment.
Idealized humans can be safer not just because they have better policies, but also because they have safer environments.
I was imagining that the better policies are a result of the safer environments. (Like, if you’re using something like IRL to infer values, then the safe environments cause different behavior and answers to questions which cause better inferred values.)
Changed “we change their policy in a hypothetical” to “we change their policy in a hypothetical by putting them in safer environments”.
Can you explain how one might train a human not to have the second problem in this post?
Good epistemics goes some of the way, but I mostly wasn’t thinking about those sorts of situations. I’ll comment on that post as well, let’s discuss further there.
ETA: Also, it seems like you’re saying that a small group of highly trained supervisors will act as gatekeepers to all AIs, and use their judgement to, for example, deny requests for technologies that they don’t think humans are ready for yet. Is that right?
Basically yes. I was imagining that each AI has its own highly trained supervisor(s). I’m not sure that the supervisors will be denying requests for technologies, more that the AI will learn patterns of reasoning and thinking from the supervisors that cause it to actually evaluate whether humans are ready for some particular technology rather than just giving it to humans without thinking about the safety implications. (This does require that the supervisors would make a similar decision if the problem was posed to them.)
Coming at it from another angle, I think I would say that thousands of years of human intellectual effort has been put into trying to figure out what is good to do, without much consensus. It seems very hard to build an AI system that is going to outperform all of that intellectual effort and be completely correct.
This doesn’t seem to follow logically either. Thousands of years of intellectual effort amounts to a tiny fraction of the computation that the universe could do. Why is that a reason to think it very hard to outperform humans? And even if the white-box metaphilosophical approach doesn’t work out, there are likely other ways to improve upon human performance, even if the result isn’t “completely correct”. (Did you perhaps interpret “address human safety problems” as “fully solve human safety problems”? I actually picked the word “address” over “solve” so as to not imply that a full solution must be possible.)
Personally, I’m pessimistic because there probably won’t be enough time to figure out and implement these ideas in a safe way before someone builds an unaligned or aligned-but-unsafe AI.
Also there’s the simple distributional shift argument: the circumstances that real humans are in form a distributional shift for the idealized humans, so we shouldn’t trust their judgment.
But the idealized humans are free to broaden their “distribution” (e.g., experience waterboarding), when they figure out how to do that in a safe way. The difference is that unlike the real humans, they won’t be forced to deal with strange new inputs before they are ready.
I was imagining that each AI has its own highly trained supervisor(s).
And these AIs/supervisors also act as a cabal to stop anyone else from running an AI without going through the same training, right? I guess this is another approach that should be considered, but don’t you think putting humans in such positions of power in itself carries a high risk of corruption, and that it will be hard to come up with training that can reliably prevent such corruption?
This doesn’t seem to follow logically either. Thousands of years of intellectual effort amounts to a tiny fraction of the computation that the universe could do. Why is that a reason to think it very hard to outperform humans?
I meant more that there has been a lot of effort into figuring out how to normatively talk about what “good” is, and it seems necessary for us to figure that out if you want to write down code that we can know beforehand will correctly extrapolate our values even though it cannot rely on us avoiding corrupted values. Now admittedly in the past people weren’t specifically thinking about the problem of how to use a lot of computation to do this, but many of the thought experiments they propose seem to be of a similar nature (eg. suppose that we were logically omniscient and knew the consequences of our beliefs).
It’s possible that this is mostly me misunderstanding terminology. Previously it sounded like you were against letting humans discover their own values through experience and instead having them figure it out through deliberation and reflection from afar. Now it sounds like you actually do want humans to discover their own values through experience, but you want them to be able to control the experiences and take them at their own pace.
But the idealized humans are free to broaden their “distribution” (e.g., experience waterboarding), when they figure out how to do that in a safe way. The difference is that unlike the real humans, they won’t be forced to deal with strange new inputs before they are ready.
Thanks, that clarifies things a lot. I take back the statement about trusting real humans more.
And these AIs/supervisors also act as a cabal to stop anyone else from running an AI without going through the same training, right ?
Probably? I haven’t thought about it much.
don’t you think putting humans in such positions of power in itself carries a high risk of corruption, and that it will be hard to come up with training that can reliably prevent such corruption?
I agree this seems potentially problematic, I’m not sure that it’s a deal-breaker.
I meant more that there has been a lot of effort into figuring out how to normatively talk about what “good” is, and it seems necessary for us to figure that out if you want to write down code that we can know beforehand will correctly extrapolate our values even though it cannot rely on us avoiding corrupted values.
I don’t think it’s necessary for us to figure out what “good” is, instead my “white-box metaphilosophical approach” is to figure out what “doing philosophy” is, program/teach an AI to “do philosophy” which would let it figure out what “good” is on its own. I think there has not been a lot of effort invested into that problem so there’s a somewhat reasonable chance that it might be tractable. It seems worth investigating especially given that it might be the only way to fully solve the human safety problem.
It’s possible that this is mostly me misunderstanding terminology. Previously it sounded like you were against letting humans discover their own values through experience and instead having them figure it out through deliberation and reflection from afar. Now it sounds like you actually do want humans to discover their own values through experience, but you want them to be able to control the experiences and take them at their own pace.
To clarify, this is a separate approach from the white-box metaphilosophical approach. And I’m not sure whether I want the humans to discover their values through experience or through deliberation. I guess they should first deliberate on that choice and then do whatever they decide is best.
(I’m not sure if you’re just checking your own understanding of the post, or if you’re offering suggestions for how to express the ideas more clearly, or if you’re trying to improve the ideas. If the latter two, I’d also welcome more direct feedback pointing out issues in my use of language or my ideas.)
I think the first paragraph of your rewrite is missing the “obligation” part of my post. It seems that even aligned AI could exacerbate human safety problems (and make the future worse than if we magically or technologically made humans more intelligent) so I think AI designers at least have an obligation to prevent that.
For the second paragraph, I think under the proposed approach, the AI should start inferring what the idealized humans would say (or calculate how it should optimize for the idealized humans’ values-in-reflective-equilibrium, depending on details of how the AI is designed) as soon as it can, and not wait until the real humans start contradicting each other a lot, because the real humans could all be corrupted in the same direction. Even before that, it should start taking measures to protect itself from the real humans (under the assumption that the real humans might become corrupt at any time in a way that it can’t yet detect). For example, it should resist any attempts by the real humans to change its terminal goal.
(I’m not sure if you’re just checking your own understanding of the post, or if you’re offering suggestions for how to express the ideas more clearly, or if you’re trying to improve the ideas. If the latter two, I’d also welcome more direct feedback pointing out issues in my use of language or my ideas.)
Sorry, I should have explained. With most posts, there are enough details and examples that when I summarize the post for the newsletter, I’m quite confident that I got the details mostly right. This post was short enough that I wasn’t confident this was true, so I pasted it here to make sure I wasn’t changing the meaning too much.
I suppose you could think of this as a suggestion on how to express the ideas more clearly to me/the audience of the newsletter, but I think that’s misleading. It’s more that I try to use consistent language in the newsletter to make it easier for readers to follow, and the language you use is different from the language I use. (For example, you have short words like “human safety problems” for a large class of concepts, each of which I explain out in full sentences with examples.)
I think the first paragraph of your rewrite is missing the “obligation” part of my post. It seems that even aligned AI could exacerbate human safety problems (and make the future worse than if we magically or technologically made humans more intelligent) so I think AI designers at least have an obligation to prevent that.
Good point, added it in the newsletter summary.
For the second paragraph, I think under the proposed approach, the AI should start inferring what the idealized humans would say (or calculate how it should optimize for the idealized humans’ values-in-reflective-equilibrium, depending on details of how the AI is designed) as soon as it can, and not wait until the real humans start contradicting each other a lot, because the real humans could all be corrupted in the same direction. Even before that, it should start taking measures to protect itself from the real humans (under the assumption that the real humans might become corrupt at any time in a way that it can’t yet detect). For example, it should resist any attempts by the real humans to change its terminal goal.
Hmm, that’s what I was trying to say. I’ve changed the last sentence of that paragraph to:
But if the idealized humans begin to have different preferences from real humans, then the AI system should ignore the “corrupted” values of the real humans.
If it’s too hard to make AI systems in this way and we need to have them learn goals from humans, we could at least have them learn from idealized humans rather than real ones.
My interpretation of how the term is used here and elsewhere is that idealized humans are usually in themselves, and when we ignore costs, worse than real ones. For example, they could be based on predictions of human behavior that are not quite accurate, or they may only remain sane for an hour of continuous operation from some initial state. They are only better because they can be used in situations where real humans can’t be used, such as in an infinite HCH, an indirect normativity style definition of AI goals, or a simulation of how a human develops when exposed to a certain environment (training). Their nature as inaccurate predictions may make them much more computationally tractable and actually available in situations where real humans aren’t, and so more useful when we can compensate for the errors. So a better term might be “abstract humans” or “models of humans”.
If these artificial environments with models of humans are good enough, they may also be able to bootstrap more accurate models of humans and put them into environments that produce better decisions, so that the initial errors in prediction won’t affect the eventual outcomes.
Perhaps Wei Dai could clarify, but I thought the point of idealized humans was to avoid problems of value corruption or manipulation, which makes them better than real ones.
I agree that idealized humans have the benefit of making things like infinite HCH possible, but that doesn’t seem to be a main point of this post.
I thought the point of idealized humans was to avoid problems of value corruption or manipulation
Among other things, yes.
which makes them better than real ones
This framing loses the distinction I’m making. More useful when taken together with their environment, but not necessarily better in themselves. These are essentially real humans that behave better because of environments where they operate and lack of direct influence from the outside world, which in some settings could also apply to the environment where they were raised. But they share the same vulnerabilities (to outside influence or unusual situations) as real humans, which can affect them if they are taken outside their safe environments. And in themselves, when abstracted from their environment, they may be worse than real humans, in the sense that they make less aligned or correct decisions, if the idealized humans are inaccurate predictions of hypothetical behavior of real humans.
Attempted rewrite of this post in my language:
We might worry that if any particular human got a lot of power, or was able to think a lot faster, then there’s a decent chance that they do something that we would consider bad. Perhaps power corrupts them, or perhaps they get so excited about the potential technologies they can develop that they do so without thinking seriously about the consequences. We now have the opportunity to design AI systems that operate more cautiously, that aren’t prone to the same biases of reasoning and heuristics that we are, such that the future actually goes better than it would if we magically made humans more intelligent.
If it’s too hard to make AI systems in this way and we need to have them learn goals from humans, we could at least have them learn from idealized humans rather than real ones. Human values don’t extrapolate well—just look at the myriad answers that people give to the various hypotheticals like the trolley problem. So, it’s better to learn from humans that are kept in safe, familiar environment with all their basic needs taken care of. These are our idealized humans—and in practice the AI system would learn preferences from real humans, since that should be a very good indicator of the preferences of idealized humans. But if real humans end up in situations they’ve never encountered before, and start contradicting each other a lot, then the AI system can ignore these “corrupted” values and try to infer what the idealized humans would say about the situation.
More generally, it seems important for our AI systems to help us figure out what we care about before we make drastic and irreversible changes to our environment, especially changes that prevent us from figuring out what we care about. For example, if we create a hedonic paradise where everyone is on side-effect-free recreational drugs all the time, it seems unlikely that we check whether this is actually what we wanted. This suggests that we need to work on AI systems that differentially advance our philosophical capabilities relative to other capabilities, such as technological ones.
And my actual comment on the post:
It seems quite difficult to me to build AI systems that are safe, without having them rely on humans in some way to determine what is and is not good behavior. As a result, I’m pessimistic about our chances at creating AI systems that can address “human safety problems”. Learning from idealized humans might address this to some extent, but in many circumstances I think I would trust the real humans who are actually in those circumstances more than the idealized humans who must reason about those circumstances from afar (in their safe, familiar environment).
I do think we want to have a general approach where we try to figure out how AIs and humans should reason, such that the resulting system behaves well. On the human side, this might mean that the human needs to be more cautious for longer timescales, or to have more epistemic and moral humility. Idealized humans can be thought of an instance of this approach where rather than change the policy of humans in reality, we change their policy in a hypothetical.
Overall, I’m hoping that we can solve “human safety problems” by training the humans supervising the AI to not have those problems, because it sure does make the technical problem of aligning AI seem a lot easier.
Note that humans play two distinct roles in IDA, and I think it’s important to separate them:
1. They are used to train corrigible reasoning, because we don’t have a sufficiently good explicit understanding of corrigible reasoning. This role could be removed if e.g. MIRI’s work on agent foundations were sufficiently successful.
2. The AI that we’ve trained is then tasked with the job of helping the user get what they “really” want, which is indirectly encoded in the user.
Solving safety problems for humans in step #1 is necessary to solve intent alignment. This likely involves both training (whether to reduce failure probabilities or to reach appropriate universality thresholds), and using humans in a way that is robust to their remaining safety problems (since it seems clear that most of them cannot be removed).
Solving safety problems for humans in step #2 is something else altogether. At this point you have a bunch of humans in the world who want AIs that are going to help them get what they want, and I don’t think it makes that much sense to talk about replacing those humans with highly-trained supervisors—the supervisors might play a role in step #1 as a way of getting an AI that is trying to help the user get what they want, but can’t replace the user themselves in step #2 . I think relevant measures at this point are things like:
Learn more about how to deliberate “correctly,” or about what kinds of pressures corrupt human values, or about how to avoid such corruption, or etc. If more such understanding is available, then both AIs and humans can use them to avoid corruption. In the long run AI systems will do much more work on this problem than we will, but a lot of damage could be done between now and the time when AI systems are powerful enough to obsolete all of the thinking that we do today on this topic.
Figure out how to build AIs that are better at tasks like “help humans clarify what they really want.” Differential progress in this area could be a huge win. (Again, in the long run all of the AI-design work will itself be done by AI systems, but lots of damage could be dealt in the interim as we deploy human-desinged AIs that are particularly good at manipulation relative to helping humans clarify and realize their “real” values.)
Change institutions/policy/environment to reduce the risk of value corruption, especially for users that don’t have strong short-term preferences about how their short-term preferences change, or who don’t have a clear picture of how their current choices will affect that. For example, the designers of potentially-manipulative technologies may be able to set defaults that make a huge difference in how humanity’s values evolve.
You could also try give highly wise people more influence over what actually happens, whether by acquiring resources, earning others’ trust, or whatever.
This objection may work for some forms of idealization, but I don’t think it holds up in general. If you think that experiencing X makes your views better, then your idealization can opt to experience X. The whole point of the idealization is that the idealized humans get to have the set of experiences that they believe are best for arriving at correct views, rather than a set of experiences that are constrained by technological feasibility / competitiveness constraints / etc.
(I agree that there can be some senses in which the idealization itself unavoidably “breaks” the idealized human—e.g. Vladimir Slepnev points out that an idealized human might conclude that they are most likely in a simulation, which may change their behavior; Wei Dai points out that they may behave selfishly towards the idealized human rather than towards the unidealized human, if selfishness is part of the values we’d converge to—but I don’t think this is one of them.)
Yeah, I was talking entirely about the first role, thanks for the clarification.
I agree now, I misunderstood what the point of the idealization in the original point was. (I thought it was to avoid having experiences that could cause value corruption, whereas it was actually about having experiences only when ready for them.)
I think this was tangentially related to my objection, for example that an idealized human would choose eg. not to be waterboarded even though that experience is important for deciding what to do for the unidealized human. Though the particular objection I wrote was based on a misunderstanding.
This seems like a really important clarification, but in your article on corrigibility, you only ever talk about one human, the overseer, and the whole argument about “basin of attraction” seems to rely on having one human be both the trainer for corrigibility, target of corrigibility, and source of preferences:
I think in that post, the overseer is training the AI to specifically be corrigible to herself, which makes the AI aligned to herself. I’m not sure what is happening in the new scheme with two humans. Is the overseer now still training the AI to be corrigible to herself, which produces an AI that’s aligned to the overseer which then helps out the user because the overseer has a preference to help out the user? Or is the overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time? If the latter, have you checked that the “basin of attraction” argument still applies? If it does, maybe that post needs to be rewritten to make that clearer?
Corrigibility plays a role both within amplification and in the final agent.
The post is mostly talking about the final agent without talking about IDA specifically.
The section titled
Amplification
is about the internal dynamics, where behavior is corrigible by the question-asker. It doesn’t seem important to me that these be the same. Corrigibility to the overseer only leads to corrigibility to the end user if the overseer is appropriately motivated. I usually imagine the overseer as something like a Google engineer and the end user as something like a visitor to google.com today. The resulting agent will likely be imperfectly corrigible because of the imperfect motives of Google engineers (this is pretty similar to human relationships around other technologies).I’m no longer as convinced that corrigibility is the right abstraction for reasoning about internal behavior within amplification (but am still pretty convinced that it’s a good way to reason about the external behavior, and I do think “corrigible” is closer to what we want than “benign” was). I’ve been thinking about these issues recently and it will be touched on in an upcoming post.
This is basically right. I’m usually imagining the overseer training a general question-answering system, with the AI trained to be corrigible to the question-asker. We then use that question-answering system to implement a corrigible agent, by using it to answer questions like “What should the agent do next?” (with an appropriate specification of ‘should’), which is where external corrigibility comes in.
This confuses me because you’re saying “basically right” to something but then you say something that seems very different, and which actually seems closer to the other option I was suggesting. Isn’t it very different for the overseer to train the AI to be corrigible to herself as a specific individual, versus training the AI to be corrigible to whoever is asking the current question? Since the AI can’t know who is asking the current question (which seems necessary to be corrigible to them?) without that being passed in as additional information, this seems closer to ‘overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time’.
I also have a bunch of other confusions, but it’s probably easier to talk about them after resolving this one.
(Also, just in case, is there a difference between “corrigible to” and “corrigible by”?)
The overseer asks the question “what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?”, and indeed even at training time the overseer is training the system to answer this question. There is no swapping out at test time. (The distributions at train and test time are identical, and I normally talk about the version where you keep training online.)
When the user asks a question to the agent it is being answered by indirection, by using the question-answering system to answer “what should the agent do [in the situation when it has been asked question Q by the user]?”
Ok, I’ve been trying to figure out what would make the most sense and came to the same conclusion. I would also note that this “corrigible” is substantially different from the “corrigible” in “the AI is corrigible to the question asker” because it has to be an explicit form of corribility that is limited by things like corporate policy. For example if Alice asks “What are your design specs and source code?” or “How do I hack into this bank?” then the AI wouldn’t answer even though it’s supposed to be “corrigible” to the user, right? Maybe we need modifiers to indicate which corrigibility we’re talking about, like “full corrigibility” vs “limited corrigibility”?
ETA: Actually, does it even make sense to use the word “corrigible” in “to be corrigible to the Google customer Alice it is currently working for”? Originally “corrigible” meant:
But obviously Google’s AI is not going to allow a user to “modify the agent, impede its operation, or halt its execution”. Why use “corrigible” here instead of different language altogether, like “helpful to the extent allowed by Google policies”?
No. I was just saying “corrigible by” originally because that seems more grammatical, and sometimes saying “corrigible to” because it seems more natural. Probably “to” is better.
My interpretation of what you’re saying here is that the overseer in step #1 can do a lot of things to bake in having the AI interpret “help the user get what they really want” in ways that get the AI to try to eliminate human safety problems for the step #2 user (possibly entirely), but problems might still occur in the short term before the AI is able to think/act to remove those safety problems.
It seems to me that this implies that IDA essentially solves the AI alignment portion of points 1 and 2 in the original post (modulo things happening before the AI is in control).
The second sentence doesn’t seem to follow logically from the first. There are many ways that an AI can rely on humans to determine what is and is not good behavior, some of which put more stress on human safety than others. At the most ideal (least reliant on humans being safe) end of the spectrum, we could solve metaphilosophy in a white-box way, and have the AI extract true/normative values from humans using its own philosophical understanding of what that means.
Aside from the “as a result” part, I’m also pessimistic but I don’t see a reason to be so pessimistic that the current level of neglect could be justified.
This is not obvious to me. Can you elaborate or give an example?
Idealized humans can be safer not just because they have better policies, but also because they have safer environments.
Can you explain how one might train a human not to have the second problem in this post?
ETA: Also, it seems like you’re saying that a small group of highly trained supervisors will act as gatekeepers to all AIs, and use their judgement to, for example, deny requests for technologies that they don’t think humans are ready for yet. Is that right?
True. I don’t know exactly how to convey this intuition. I think the part that seems very hard to do is to get the AI system to outperform humans at figuring out what is good to do. But this is not exactly true—certainly at chess an AI system can outperform us choosing the best action. So maybe we could say that it’s very hard to get the AI system to outperform humans at figuring out the best criteria for figuring out what is good to do. But I don’t really believe that either—I expect that AI systems will be able to do philosophy better than we will at some point.
Coming at it from another angle, I think I would say that thousands of years of human intellectual effort has been put into trying to figure out what is good to do, without much consensus. It seems very hard to build an AI system that is going to outperform all of that intellectual effort and be completely correct. (This very directly applies to the most ideal scenario that you laid out.)
Skin in the Game (SSC)
A particular example from the post: “A really good example of this is Hitchens’ waterboarding – he said he thought it was an acceptable interrogation technique, some people offered to waterboard him to see if he changed his mind, and he quickly did. I’m fascinated by this incident because it’s hard for me to understand, in a Mary’s Room style sense, what he learned from the experience. He must have already known it was very unpleasant – otherwise why would you even support it as useful for interrogation? But somehow there’s a difference between having someone explain to you that waterboarding is horrible, and undergoing it yourself.”
Also all of the worries about paternalism in the charity sector.
Also there’s the simple distributional shift argument: the circumstances that real humans are in form a distributional shift for the idealized humans, so we shouldn’t trust their judgment.
I was imagining that the better policies are a result of the safer environments. (Like, if you’re using something like IRL to infer values, then the safe environments cause different behavior and answers to questions which cause better inferred values.)
Changed “we change their policy in a hypothetical” to “we change their policy in a hypothetical by putting them in safer environments”.
Good epistemics goes some of the way, but I mostly wasn’t thinking about those sorts of situations. I’ll comment on that post as well, let’s discuss further there.
Basically yes. I was imagining that each AI has its own highly trained supervisor(s). I’m not sure that the supervisors will be denying requests for technologies, more that the AI will learn patterns of reasoning and thinking from the supervisors that cause it to actually evaluate whether humans are ready for some particular technology rather than just giving it to humans without thinking about the safety implications. (This does require that the supervisors would make a similar decision if the problem was posed to them.)
This doesn’t seem to follow logically either. Thousands of years of intellectual effort amounts to a tiny fraction of the computation that the universe could do. Why is that a reason to think it very hard to outperform humans? And even if the white-box metaphilosophical approach doesn’t work out, there are likely other ways to improve upon human performance, even if the result isn’t “completely correct”. (Did you perhaps interpret “address human safety problems” as “fully solve human safety problems”? I actually picked the word “address” over “solve” so as to not imply that a full solution must be possible.)
Personally, I’m pessimistic because there probably won’t be enough time to figure out and implement these ideas in a safe way before someone builds an unaligned or aligned-but-unsafe AI.
But the idealized humans are free to broaden their “distribution” (e.g., experience waterboarding), when they figure out how to do that in a safe way. The difference is that unlike the real humans, they won’t be forced to deal with strange new inputs before they are ready.
And these AIs/supervisors also act as a cabal to stop anyone else from running an AI without going through the same training, right? I guess this is another approach that should be considered, but don’t you think putting humans in such positions of power in itself carries a high risk of corruption, and that it will be hard to come up with training that can reliably prevent such corruption?
I meant more that there has been a lot of effort into figuring out how to normatively talk about what “good” is, and it seems necessary for us to figure that out if you want to write down code that we can know beforehand will correctly extrapolate our values even though it cannot rely on us avoiding corrupted values. Now admittedly in the past people weren’t specifically thinking about the problem of how to use a lot of computation to do this, but many of the thought experiments they propose seem to be of a similar nature (eg. suppose that we were logically omniscient and knew the consequences of our beliefs).
It’s possible that this is mostly me misunderstanding terminology. Previously it sounded like you were against letting humans discover their own values through experience and instead having them figure it out through deliberation and reflection from afar. Now it sounds like you actually do want humans to discover their own values through experience, but you want them to be able to control the experiences and take them at their own pace.
Thanks, that clarifies things a lot. I take back the statement about trusting real humans more.
Probably? I haven’t thought about it much.
I agree this seems potentially problematic, I’m not sure that it’s a deal-breaker.
I don’t think it’s necessary for us to figure out what “good” is, instead my “white-box metaphilosophical approach” is to figure out what “doing philosophy” is, program/teach an AI to “do philosophy” which would let it figure out what “good” is on its own. I think there has not been a lot of effort invested into that problem so there’s a somewhat reasonable chance that it might be tractable. It seems worth investigating especially given that it might be the only way to fully solve the human safety problem.
To clarify, this is a separate approach from the white-box metaphilosophical approach. And I’m not sure whether I want the humans to discover their values through experience or through deliberation. I guess they should first deliberate on that choice and then do whatever they decide is best.
(I’m not sure if you’re just checking your own understanding of the post, or if you’re offering suggestions for how to express the ideas more clearly, or if you’re trying to improve the ideas. If the latter two, I’d also welcome more direct feedback pointing out issues in my use of language or my ideas.)
I think the first paragraph of your rewrite is missing the “obligation” part of my post. It seems that even aligned AI could exacerbate human safety problems (and make the future worse than if we magically or technologically made humans more intelligent) so I think AI designers at least have an obligation to prevent that.
For the second paragraph, I think under the proposed approach, the AI should start inferring what the idealized humans would say (or calculate how it should optimize for the idealized humans’ values-in-reflective-equilibrium, depending on details of how the AI is designed) as soon as it can, and not wait until the real humans start contradicting each other a lot, because the real humans could all be corrupted in the same direction. Even before that, it should start taking measures to protect itself from the real humans (under the assumption that the real humans might become corrupt at any time in a way that it can’t yet detect). For example, it should resist any attempts by the real humans to change its terminal goal.
Sorry, I should have explained. With most posts, there are enough details and examples that when I summarize the post for the newsletter, I’m quite confident that I got the details mostly right. This post was short enough that I wasn’t confident this was true, so I pasted it here to make sure I wasn’t changing the meaning too much.
I suppose you could think of this as a suggestion on how to express the ideas more clearly to me/the audience of the newsletter, but I think that’s misleading. It’s more that I try to use consistent language in the newsletter to make it easier for readers to follow, and the language you use is different from the language I use. (For example, you have short words like “human safety problems” for a large class of concepts, each of which I explain out in full sentences with examples.)
Good point, added it in the newsletter summary.
Hmm, that’s what I was trying to say. I’ve changed the last sentence of that paragraph to:
But if the idealized humans begin to have different preferences from real humans, then the AI system should ignore the “corrupted” values of the real humans.
My interpretation of how the term is used here and elsewhere is that idealized humans are usually in themselves, and when we ignore costs, worse than real ones. For example, they could be based on predictions of human behavior that are not quite accurate, or they may only remain sane for an hour of continuous operation from some initial state. They are only better because they can be used in situations where real humans can’t be used, such as in an infinite HCH, an indirect normativity style definition of AI goals, or a simulation of how a human develops when exposed to a certain environment (training). Their nature as inaccurate predictions may make them much more computationally tractable and actually available in situations where real humans aren’t, and so more useful when we can compensate for the errors. So a better term might be “abstract humans” or “models of humans”.
If these artificial environments with models of humans are good enough, they may also be able to bootstrap more accurate models of humans and put them into environments that produce better decisions, so that the initial errors in prediction won’t affect the eventual outcomes.
Perhaps Wei Dai could clarify, but I thought the point of idealized humans was to avoid problems of value corruption or manipulation, which makes them better than real ones.
I agree that idealized humans have the benefit of making things like infinite HCH possible, but that doesn’t seem to be a main point of this post.
Among other things, yes.
This framing loses the distinction I’m making. More useful when taken together with their environment, but not necessarily better in themselves. These are essentially real humans that behave better because of environments where they operate and lack of direct influence from the outside world, which in some settings could also apply to the environment where they were raised. But they share the same vulnerabilities (to outside influence or unusual situations) as real humans, which can affect them if they are taken outside their safe environments. And in themselves, when abstracted from their environment, they may be worse than real humans, in the sense that they make less aligned or correct decisions, if the idealized humans are inaccurate predictions of hypothetical behavior of real humans.
Yeah, I agree with all of this. How would you rewrite my sentence/paragraph to be clearer, without making it too much longer?