I am weakly optimistic that humans will actually be able to coordinate and solve these problems as a result.
Why isn’t that also an argument against the urgency of solving AI motivation? I.e., we don’t need to urgently solve AI motivation because humans will be able to coordinate to stop or delay AI development long enough to solve AI motivation at leisure?
It seems to me that coordination is really hard. Yes we have to push on that, but we also have to push on potential technical solutions because most likely coordination will fail, and there is enough uncertainty about the difficulty of technical solutions that I think we urgently need more people to investigate the problems to see how hard they really are.
Aside from that, I think it’s also really important to better predict/understand just how difficult solving those problems are (both socially and technically) because that understanding is highly relevant to strategic decisions we have to make today. For example if those problems are very difficult to solve so that in expectation we end up losing most of the potential value of the universe even if we solve AI motivation, then that greatly reduces the value of working on motivation relative to something like producing evidence of the difficulty of those problems in order to convince policymakers to try to coordinate on stopping/delaying AI progress, or trying to create a singleton AI. That’s why I was asking you for details of what you think the social solutions would look like.
so some of the claims (like “the urgent part is in the motivation subproblem”) are not meant to quantify over the problems you’re identifying
I see, in that case I would appreciate disclaimers or clearer ways of stating that, so that people who might want to work on these problems are not discouraged from doing so more strongly than you intend.
I’m quite uncertain about this area of problem-space
Why isn’t that also an argument against the urgency of solving AI motivation? I.e., we don’t need to urgently solve AI motivation because humans will be able to coordinate to stop or delay AI development long enough to solve AI motivation at leisure?
Two reasons come to mind:
Stopping or delaying AI development feels more like trying to interfere with an already-running process, whereas there are no existing norms on what we use AI for that we would have to fight against, and debates on those norms are already beginning. For new things, I expect the public to be particularly risk-averse.
Relatedly, it is a lot easier to make norms/laws/regulations now that bind our future selves. On an individual level, it seems easier to delay your chance of going to Mars if you know you’re going to get a hovercar soon. On a societal scale, it seems easier to delay space colonization if we’re going to have lives of leisure due to automation, or to delay full automation if we’re soon going to get 4 hour workdays. Looking at the things governments and corporations say, it seems like they would be likely to do things like this. I think it makes a lot of sense to try and direct these efforts at the right target.
I want to emphasize though that my method here was having an intuition and querying for reasons behind the intuition. I would be a little surprised if someone could convince me my intuition is wrong in ~half an hour of conversation. I would not be surprised if someone could convince me that my reasons are wrong in ~half an hour of conversation.
It seems to me that coordination is really hard. Yes we have to push on that, but we also have to push on potential technical solutions because most likely coordination will fail, and there is enough uncertainty about the difficulty of technical solutions that I think we urgently need more people to investigate the problems to see how hard they really are.
I think it would help me if you suggested some ways that technical solutions could help with these problems. For example, with coordinating to prevent/delay corrupting technologies, the fundamental problem to me seems to be that with any technical solution, the thing that the AI does will be against the operator’s wishes-upon-reflection. (If your technical solution is in line with the operator’s wishes-upon-reflection, then I think you could also solve the problem by solving motivation.) This seems both hard to design (where does the AI get the information about what to do, if not from the operator’s wishes-upon-reflection?) as well as hard to implement (why would the operator use a system that’s going to do something they don’t want?).
You might argue that there are things that the operator would want if they could get it (eg. global coordination), but they can’t achieve it now, and so we need a technical solution for that. However, it seems like a we are in the same position as a well-motivated AI w.r.t. that operator. For example, if we try to cede control to FairBots that rationally cooperate with each other, a well-motivated AI could also do that.
Aside from that, I think it’s also really important to better predict/understand just how difficult solving those problems are (both socially and technically) because that understanding is highly relevant to strategic decisions we have to make today. For example if those problems are very difficult to solve so that in expectation we end up losing most of the potential value of the universe even if we solve AI motivation, then that greatly reduces the value of working on motivation relative to something like producing evidence of the difficulty of those problems in order to convince policymakers to try to coordinate on stopping/delaying AI progress, or trying to create a singleton AI. That’s why I was asking you for details of what you think the social solutions would look like.
Agreed. I view a lot of strategy research (eg. from FHI and OpenAI) as figuring this out from the social side, and some of my optimism is based on conversations with those researchers. On the technical side, I feel quite stuck (for the reasons above), though I haven’t tried hard enough to say that it’s too difficult to do.
I see, in that case I would appreciate disclaimers or clearer ways of stating that, so that people who might want to work on these problems are not discouraged from doing so more strongly than you intend.
I’ll keep that in mind. When I wrote the original comment, I wasn’t even thinking about problems like the ones you mention, because I categorize them as “strategy” by default, and I was trying to talk about the technical problem.
Stopping or delaying AI development feels more like trying to interfere with an already-running process, whereas there are no existing norms on what we use AI for that we would have to fight against, and debates on those norms are already beginning. For new things, I expect the public to be particularly risk-averse.
Do you think that at the time when AI development wasn’t an already-running process, and AI was still a new thing that the public could be expected to be risk-averse about (when would you say that was?), the argument “working on alignment isn’t urgent because humans can probably coordinate to stop AI development” would have been a good one?
Relatedly, it is a lot easier to make norms/laws/regulations now that bind our future selves.
Same question here. Back when “don’t develop AI” was still a binding on our future selves, should we have expected that we will coordinate to stop AI development, and it’s just bad luck that we haven’t succeeded in doing that?
Looking at the things governments and corporations say, it seems like they would be likely to do things like this.
Can you be more specific? What global agreement do you think would be reached, that is both realistic and would solve the kinds of problems that I’m worried about (e.g., unintentional corruption of humans by “aligned” AIs who give humans too much power or options that they can’t handle, and deliberate manipulation of humans by unaligned AIs or AIs aligned to other users)?
I think it would help me if you suggested some ways that technical solutions could help with these problems.
For example, create an AI that can help the user with philosophical questions at least as much as technical questions. (This could be done for example by figuring out how to better use Iterated Amplification to answer philosophical questions, or how to do imitation learning of human philosophers, or how to apply inverse reinforcement learning to philosophical reasoning.) Then the user could ask questions like “Am I likely to be corrupted by access to this technology? What can I do to prevent that while still taking advantage of it?” Or “Is this just an extremely persuasive attempt at manipulation or an actually good moral argument?”
As another example, solve metaethics and build that into the AI so that the AI can figure out or learn the actual terminal values of the user, which would make it easier to protect the user from manipulation and self-corruption. And even if the human user is corrupted, the AI still has the correct utility function, and when it has made enough technological progress it can uncorrupt the human.
I view a lot of strategy research (eg. from FHI and OpenAI) as figuring this out from the social side, and some of my optimism is based on conversations with those researchers.
Can you point me to any relevant results that have been written down, or explain what you learned from those conversations?
On the technical side, I feel quite stuck (for the reasons above), though I haven’t tried hard enough to say that it’s too difficult to do.
To address this and the question (from the parallel thread) of whether you should personally work on this, I think we need people to either solve the technical problems or at least to collectively try hard enough to convincingly say that it’s too difficult to do. (Otherwise who is going to convince policymakers to adopt the very costly social solutions? Who is going to convince people to start/join a social movement to influence policymakers to consider those costly social solutions? The fact that those things tend to take a lot of time seems like sufficient reason for urgency on the technical side, even if you expect the social solutions to be feasible.) Who are these people going to be, especially the first ones to join the field and help grow it? Probably existing AI alignment researchers, right? (I can probably make stronger arguments in this direction but I don’t want to be too “pushy” so I’ll stop here.)
Why isn’t that also an argument against the urgency of solving AI motivation? I.e., we don’t need to urgently solve AI motivation because humans will be able to coordinate to stop or delay AI development long enough to solve AI motivation at leisure?
It seems to me that coordination is really hard. Yes we have to push on that, but we also have to push on potential technical solutions because most likely coordination will fail, and there is enough uncertainty about the difficulty of technical solutions that I think we urgently need more people to investigate the problems to see how hard they really are.
Aside from that, I think it’s also really important to better predict/understand just how difficult solving those problems are (both socially and technically) because that understanding is highly relevant to strategic decisions we have to make today. For example if those problems are very difficult to solve so that in expectation we end up losing most of the potential value of the universe even if we solve AI motivation, then that greatly reduces the value of working on motivation relative to something like producing evidence of the difficulty of those problems in order to convince policymakers to try to coordinate on stopping/delaying AI progress, or trying to create a singleton AI. That’s why I was asking you for details of what you think the social solutions would look like.
I see, in that case I would appreciate disclaimers or clearer ways of stating that, so that people who might want to work on these problems are not discouraged from doing so more strongly than you intend.
Ok, I appreciate that.
Two reasons come to mind:
Stopping or delaying AI development feels more like trying to interfere with an already-running process, whereas there are no existing norms on what we use AI for that we would have to fight against, and debates on those norms are already beginning. For new things, I expect the public to be particularly risk-averse.
Relatedly, it is a lot easier to make norms/laws/regulations now that bind our future selves. On an individual level, it seems easier to delay your chance of going to Mars if you know you’re going to get a hovercar soon. On a societal scale, it seems easier to delay space colonization if we’re going to have lives of leisure due to automation, or to delay full automation if we’re soon going to get 4 hour workdays. Looking at the things governments and corporations say, it seems like they would be likely to do things like this. I think it makes a lot of sense to try and direct these efforts at the right target.
I want to emphasize though that my method here was having an intuition and querying for reasons behind the intuition. I would be a little surprised if someone could convince me my intuition is wrong in ~half an hour of conversation. I would not be surprised if someone could convince me that my reasons are wrong in ~half an hour of conversation.
I think it would help me if you suggested some ways that technical solutions could help with these problems. For example, with coordinating to prevent/delay corrupting technologies, the fundamental problem to me seems to be that with any technical solution, the thing that the AI does will be against the operator’s wishes-upon-reflection. (If your technical solution is in line with the operator’s wishes-upon-reflection, then I think you could also solve the problem by solving motivation.) This seems both hard to design (where does the AI get the information about what to do, if not from the operator’s wishes-upon-reflection?) as well as hard to implement (why would the operator use a system that’s going to do something they don’t want?).
You might argue that there are things that the operator would want if they could get it (eg. global coordination), but they can’t achieve it now, and so we need a technical solution for that. However, it seems like a we are in the same position as a well-motivated AI w.r.t. that operator. For example, if we try to cede control to FairBots that rationally cooperate with each other, a well-motivated AI could also do that.
Agreed. I view a lot of strategy research (eg. from FHI and OpenAI) as figuring this out from the social side, and some of my optimism is based on conversations with those researchers. On the technical side, I feel quite stuck (for the reasons above), though I haven’t tried hard enough to say that it’s too difficult to do.
I’ll keep that in mind. When I wrote the original comment, I wasn’t even thinking about problems like the ones you mention, because I categorize them as “strategy” by default, and I was trying to talk about the technical problem.
Do you think that at the time when AI development wasn’t an already-running process, and AI was still a new thing that the public could be expected to be risk-averse about (when would you say that was?), the argument “working on alignment isn’t urgent because humans can probably coordinate to stop AI development” would have been a good one?
Same question here. Back when “don’t develop AI” was still a binding on our future selves, should we have expected that we will coordinate to stop AI development, and it’s just bad luck that we haven’t succeeded in doing that?
Can you be more specific? What global agreement do you think would be reached, that is both realistic and would solve the kinds of problems that I’m worried about (e.g., unintentional corruption of humans by “aligned” AIs who give humans too much power or options that they can’t handle, and deliberate manipulation of humans by unaligned AIs or AIs aligned to other users)?
For example, create an AI that can help the user with philosophical questions at least as much as technical questions. (This could be done for example by figuring out how to better use Iterated Amplification to answer philosophical questions, or how to do imitation learning of human philosophers, or how to apply inverse reinforcement learning to philosophical reasoning.) Then the user could ask questions like “Am I likely to be corrupted by access to this technology? What can I do to prevent that while still taking advantage of it?” Or “Is this just an extremely persuasive attempt at manipulation or an actually good moral argument?”
As another example, solve metaethics and build that into the AI so that the AI can figure out or learn the actual terminal values of the user, which would make it easier to protect the user from manipulation and self-corruption. And even if the human user is corrupted, the AI still has the correct utility function, and when it has made enough technological progress it can uncorrupt the human.
Can you point me to any relevant results that have been written down, or explain what you learned from those conversations?
To address this and the question (from the parallel thread) of whether you should personally work on this, I think we need people to either solve the technical problems or at least to collectively try hard enough to convincingly say that it’s too difficult to do. (Otherwise who is going to convince policymakers to adopt the very costly social solutions? Who is going to convince people to start/join a social movement to influence policymakers to consider those costly social solutions? The fact that those things tend to take a lot of time seems like sufficient reason for urgency on the technical side, even if you expect the social solutions to be feasible.) Who are these people going to be, especially the first ones to join the field and help grow it? Probably existing AI alignment researchers, right? (I can probably make stronger arguments in this direction but I don’t want to be too “pushy” so I’ll stop here.)