we need to have humanity (especially the AI’s operators) agree that metaphilosophy is hard and needs to be solved
I’m not sure I understand your proposal here. What are they agreeing to exactly? Stopping technological development at a certain level until metaphilosophy is solved?
But it should be able to predict what things a human would find crazy for which it should probably get the human’s approval before doing the thing
Think of the human as a really badly designed AI with a convoluted architecture that nobody understands, spaghetti code, full of security holes, has no idea what its terminal values are and is really confused even about its “interim” values, has all kinds of potential safety problems like not being robust to distributional shifts, and is only “safe” in the sense of having passed certain tests for a very narrow distribution of inputs.
Clearly it’s not safe for a much more powerful outer AI to query the human about arbitrary actions that it’s considering, right? Instead, if the human is to contribute anything at all to safety in this situation, the outer AI has to figure out how to generate a bunch of smaller queries that the human can safely handle, from which it would then infer what the human would say if it could safely consider the actual choice under consideration. If the AI is bad at this “competence” problem it could send unsafe queries to the human and corrupt the human, and/or infer the wrong thing about what the human would approve of.
Is it clearer now why this doesn’t seem like an easy problem to me?
for example, it seems relatively easy for an AGI to figure out that power corrupts and that humanity has not liked it when that happened
I’m not sure what you think the AGI would figure out, and what it would do in response to that. Are you suggesting something like, based on historical data, it would learn a classifier to predict what kind of new technologies or choices would change human values in a way that we would not like, and restrict those technologies/choices from us? It seems far from easy to do this in a robust way. I mean this classifier would be facing lots of unpredictable distributional shifts… I guess you made a similar point when you said “On the other hand, there may be similar types of events in the future that we can’t back out by looking at the past.”
ETA: Do you expect that different AIs would do different things in this regard depending on how cautious their operators are? Like some AIs would learn from their operators to be really cautious, and restrict technologies/choices that it isn’t sure won’t corrupt humans, but other operators and their AIs won’t be so cautious so a bunch of humans will be corrupted as a result, but that’s a lower priority problem because you think most AI operators will be really cautious so the percentage of value lost in the universe isn’t very high? (This is my current understanding of Paul’s position, and I wonder if you have a different position or a different way of putting it that would convince me more.) What about the problem that the corrupted humans/AIs could produce a lot of negative utility even if they are small in numbers? What about the problem of the cautious AIs being at a competitive disadvantage against other AIs who are less cautious about what they are willing to do?
I think sufficiently narrow AI systems have essentially no hope of solving or avoiding these problems in general, regardless of safety techniques we develop, and so in the short term to avoid these problems you want to intervene on the humans who are deploying AI systems.
This seems right.
We can also aim to have the world mostly run by aligned AI systems rather than unaligned ones, which would mean that there isn’t much opportunity for us to be manipulated.
Manipulation doesn’t have to come just from unaligned AIs, it could also come from AIs that are aligned to other people. For example, if an AI is aligned to Alice, and Alice sees something to be gained by manipulating Bob, the AI being aligned won’t stop Alice from using it to manipulate Bob.
ETA: I forgot to mention that I don’t understand this part, can you please explain more:
One reason it might not be urgent is because we need to aim for competitiveness anyway—our AI systems need to be competitive so that economic incentives don’t cause us to use unaligned variants.
I’m not sure I understand your proposal here. What are they agreeing to exactly? Stopping technological development at a certain level until metaphilosophy is solved?
I don’t know, I want to outsource that decision to humans + AI at the time where it is relevant. Perhaps it involves stopping technological development. Perhaps it means continuing technological development, but not doing any space colonization. My point is simply that if humans agree that metaphilosophy needs to be solved, and the AI is trying to help humans, then metaphilosophy will probably be solved, even if I don’t know how exactly it will happen.
Is it clearer now why this doesn’t seem like an easy problem to me?
Yes. It seems to me like you’re considering the case where a human has to be able to give the correct answer to any question of the form “is this action a good thing to do?” I’m claiming that we could instead grow the set of things the AI does gradually, to give time for humans to figure out what it is they want. So I was imagining that humans would answer the AI’s questions in a frame where they have a lot of risk aversion, so anything that seemed particularly impactful would require a lot of deliberation before being approved.
I’m not sure what you think the AGI would figure out, and what it would do in response to that. Are you suggesting something like, based on historical data, it would learn a classifier to predict what kind of new technologies or choices would change human values in a way that we would not like, and restrict those technologies/choices from us?
I was thinking more of the case where a single human amassed a lot of power. Humans haven’t seemed to solve the problem of predicting how new technologies/choices would change human values, so that seems like quite a hard problem to solve (but perhaps AI could do it). I meant more that conditional on the AI knowing how some new technology or choice would affect us, it seems not too hard to figure out whether we would view it as a good thing.
Do you expect that different AIs would do different things in this regard depending on how cautious their operators are?
Yes.
that’s a lower priority problem because you think most AI operators will be really cautious so the percentage of value lost in the universe isn’t very high?
Kind of? I’d amend that slightly to say that to the extent that I think it is a problem (I’m not sure), I want to solve it in some way that is not technical research. (Possibilities: convince everyone to be cautious, obtain a decisive strategic advantage and enforce that everyone is cautious.)
What about the problem that the corrupted humans/AIs could produce a lot of negative utility even if they are small in numbers?
Same as above.
Manipulation doesn’t have to come just from unaligned AIs, it could also come from AIs that are aligned to other people. For example, if an AI is aligned to Alice, and Alice sees something to be gained by manipulating Bob, the AI being aligned won’t stop Alice from using it to manipulate Bob.
Same as above. All of these problems that you’re talking about would also apply to technology that could make a human smarter. It seems like it would be easiest to address on that level, rather than trying to build an AI system that can deal with these problems even though the operator would not want them to correct for the problem.
What about the problem of the cautious AIs being at a competitive disadvantage against other AIs who are less cautious about what they are willing to do?
This seems like an empirical fact that makes the problems listed above harder to solve.
I forgot to mention that I don’t understand this part, can you please explain more:
One reason it might not be urgent is because we need to aim for competitiveness anyway—our AI systems need to be competitive so that economic incentives don’t cause us to use unaligned variants.
So I broadly agree with Paul’s reasons for aiming for competitiveness. Given competitiveness, you might hope that we would automatically get defense against value manipulation by other AIs, since our aligned AI will defend us from value manipulation by similarly-capable unaligned AIs (or aligned AIs that other people have). Of course, defense might be a lot harder than offense, and you probably do think that, in which case this doesn’t really help us. (As I said, I haven’t really thought about this before.)
Overall view: I don’t think that the problems you’ve mentioned are obviously going to be solved as a part of AI alignment. I think that solving them will require mostly interventions on humans, not on the development of AI. I am weakly optimistic that humans will actually be able to coordinate and solve these problems as a result. If I were substantially more pessimistic, I would put more effort into strategy and governance issues. (Not sure I would change what I’m doing given my comparative advantage at technical research, but it would at least change what I advise other people do.)
Meta-view on our disagreement: I suspect that you have been talking about the problem of “making the future go well” while I’ve been talking about the problem of “getting AIs to do what we want” (which do seem like different problems to me). Most of the problems you’ve been talking about don’t even make it into the bucket of “getting AIs to do what we want” the way I think about it, so some of the claims (like “the urgent part is in the motivation subproblem”) are not meant to quantify over the problems you’re identifying. I think we do disagree on how important the problems you identify are, but not as much as you would think, since I’m quite uncertain about this area of problem-space.
I am weakly optimistic that humans will actually be able to coordinate and solve these problems as a result.
Why isn’t that also an argument against the urgency of solving AI motivation? I.e., we don’t need to urgently solve AI motivation because humans will be able to coordinate to stop or delay AI development long enough to solve AI motivation at leisure?
It seems to me that coordination is really hard. Yes we have to push on that, but we also have to push on potential technical solutions because most likely coordination will fail, and there is enough uncertainty about the difficulty of technical solutions that I think we urgently need more people to investigate the problems to see how hard they really are.
Aside from that, I think it’s also really important to better predict/understand just how difficult solving those problems are (both socially and technically) because that understanding is highly relevant to strategic decisions we have to make today. For example if those problems are very difficult to solve so that in expectation we end up losing most of the potential value of the universe even if we solve AI motivation, then that greatly reduces the value of working on motivation relative to something like producing evidence of the difficulty of those problems in order to convince policymakers to try to coordinate on stopping/delaying AI progress, or trying to create a singleton AI. That’s why I was asking you for details of what you think the social solutions would look like.
so some of the claims (like “the urgent part is in the motivation subproblem”) are not meant to quantify over the problems you’re identifying
I see, in that case I would appreciate disclaimers or clearer ways of stating that, so that people who might want to work on these problems are not discouraged from doing so more strongly than you intend.
I’m quite uncertain about this area of problem-space
Why isn’t that also an argument against the urgency of solving AI motivation? I.e., we don’t need to urgently solve AI motivation because humans will be able to coordinate to stop or delay AI development long enough to solve AI motivation at leisure?
Two reasons come to mind:
Stopping or delaying AI development feels more like trying to interfere with an already-running process, whereas there are no existing norms on what we use AI for that we would have to fight against, and debates on those norms are already beginning. For new things, I expect the public to be particularly risk-averse.
Relatedly, it is a lot easier to make norms/laws/regulations now that bind our future selves. On an individual level, it seems easier to delay your chance of going to Mars if you know you’re going to get a hovercar soon. On a societal scale, it seems easier to delay space colonization if we’re going to have lives of leisure due to automation, or to delay full automation if we’re soon going to get 4 hour workdays. Looking at the things governments and corporations say, it seems like they would be likely to do things like this. I think it makes a lot of sense to try and direct these efforts at the right target.
I want to emphasize though that my method here was having an intuition and querying for reasons behind the intuition. I would be a little surprised if someone could convince me my intuition is wrong in ~half an hour of conversation. I would not be surprised if someone could convince me that my reasons are wrong in ~half an hour of conversation.
It seems to me that coordination is really hard. Yes we have to push on that, but we also have to push on potential technical solutions because most likely coordination will fail, and there is enough uncertainty about the difficulty of technical solutions that I think we urgently need more people to investigate the problems to see how hard they really are.
I think it would help me if you suggested some ways that technical solutions could help with these problems. For example, with coordinating to prevent/delay corrupting technologies, the fundamental problem to me seems to be that with any technical solution, the thing that the AI does will be against the operator’s wishes-upon-reflection. (If your technical solution is in line with the operator’s wishes-upon-reflection, then I think you could also solve the problem by solving motivation.) This seems both hard to design (where does the AI get the information about what to do, if not from the operator’s wishes-upon-reflection?) as well as hard to implement (why would the operator use a system that’s going to do something they don’t want?).
You might argue that there are things that the operator would want if they could get it (eg. global coordination), but they can’t achieve it now, and so we need a technical solution for that. However, it seems like a we are in the same position as a well-motivated AI w.r.t. that operator. For example, if we try to cede control to FairBots that rationally cooperate with each other, a well-motivated AI could also do that.
Aside from that, I think it’s also really important to better predict/understand just how difficult solving those problems are (both socially and technically) because that understanding is highly relevant to strategic decisions we have to make today. For example if those problems are very difficult to solve so that in expectation we end up losing most of the potential value of the universe even if we solve AI motivation, then that greatly reduces the value of working on motivation relative to something like producing evidence of the difficulty of those problems in order to convince policymakers to try to coordinate on stopping/delaying AI progress, or trying to create a singleton AI. That’s why I was asking you for details of what you think the social solutions would look like.
Agreed. I view a lot of strategy research (eg. from FHI and OpenAI) as figuring this out from the social side, and some of my optimism is based on conversations with those researchers. On the technical side, I feel quite stuck (for the reasons above), though I haven’t tried hard enough to say that it’s too difficult to do.
I see, in that case I would appreciate disclaimers or clearer ways of stating that, so that people who might want to work on these problems are not discouraged from doing so more strongly than you intend.
I’ll keep that in mind. When I wrote the original comment, I wasn’t even thinking about problems like the ones you mention, because I categorize them as “strategy” by default, and I was trying to talk about the technical problem.
Stopping or delaying AI development feels more like trying to interfere with an already-running process, whereas there are no existing norms on what we use AI for that we would have to fight against, and debates on those norms are already beginning. For new things, I expect the public to be particularly risk-averse.
Do you think that at the time when AI development wasn’t an already-running process, and AI was still a new thing that the public could be expected to be risk-averse about (when would you say that was?), the argument “working on alignment isn’t urgent because humans can probably coordinate to stop AI development” would have been a good one?
Relatedly, it is a lot easier to make norms/laws/regulations now that bind our future selves.
Same question here. Back when “don’t develop AI” was still a binding on our future selves, should we have expected that we will coordinate to stop AI development, and it’s just bad luck that we haven’t succeeded in doing that?
Looking at the things governments and corporations say, it seems like they would be likely to do things like this.
Can you be more specific? What global agreement do you think would be reached, that is both realistic and would solve the kinds of problems that I’m worried about (e.g., unintentional corruption of humans by “aligned” AIs who give humans too much power or options that they can’t handle, and deliberate manipulation of humans by unaligned AIs or AIs aligned to other users)?
I think it would help me if you suggested some ways that technical solutions could help with these problems.
For example, create an AI that can help the user with philosophical questions at least as much as technical questions. (This could be done for example by figuring out how to better use Iterated Amplification to answer philosophical questions, or how to do imitation learning of human philosophers, or how to apply inverse reinforcement learning to philosophical reasoning.) Then the user could ask questions like “Am I likely to be corrupted by access to this technology? What can I do to prevent that while still taking advantage of it?” Or “Is this just an extremely persuasive attempt at manipulation or an actually good moral argument?”
As another example, solve metaethics and build that into the AI so that the AI can figure out or learn the actual terminal values of the user, which would make it easier to protect the user from manipulation and self-corruption. And even if the human user is corrupted, the AI still has the correct utility function, and when it has made enough technological progress it can uncorrupt the human.
I view a lot of strategy research (eg. from FHI and OpenAI) as figuring this out from the social side, and some of my optimism is based on conversations with those researchers.
Can you point me to any relevant results that have been written down, or explain what you learned from those conversations?
On the technical side, I feel quite stuck (for the reasons above), though I haven’t tried hard enough to say that it’s too difficult to do.
To address this and the question (from the parallel thread) of whether you should personally work on this, I think we need people to either solve the technical problems or at least to collectively try hard enough to convincingly say that it’s too difficult to do. (Otherwise who is going to convince policymakers to adopt the very costly social solutions? Who is going to convince people to start/join a social movement to influence policymakers to consider those costly social solutions? The fact that those things tend to take a lot of time seems like sufficient reason for urgency on the technical side, even if you expect the social solutions to be feasible.) Who are these people going to be, especially the first ones to join the field and help grow it? Probably existing AI alignment researchers, right? (I can probably make stronger arguments in this direction but I don’t want to be too “pushy” so I’ll stop here.)
I forgot to followup on this important part of our discussion:
All of these problems that you’re talking about would also apply to technology that could make a human smarter.
It seems to me that a technology that could make a human smarter is much more likely (compared to AI) to accelerate all forms of intellectual progress (e.g., technological progress and philosophical/moral progress) about equally, and therefore would have a less significant effect on the kinds of problems that I’m talking about (which are largely caused by technological progress outpacing philosophical/moral progress). I could make some arguments about this, but I’m curious if this doesn’t seem obvious to you.
Assuming the above, and assuming that one has moral uncertainty that gives some weight to the concept of moral responsibility, it seems to me that an additional argument for AI researchers to work on these problems is that it’s a moral responsibility of AI researchers/companies to try to solve problems that they create, for example via technological solutions, or by coordinating amongst themselves, or by convincing policymakers to coordinate, or by funding others to work on these problems, etc., and they are currently neglecting to do this (especially with regard to the particular problems that I’m pointing out).
It seems to me that a technology that could make a human smarter is much more likely (compared to AI) to accelerate all forms of intellectual progress (e.g., technological progress and philosophical/moral progress) about equally, and therefore would have a less significant effect on the kinds of problems that I’m talking about (which are largely caused by technological progress outpacing philosophical/moral progress).
Yes, I agree with this. The reason I mentioned that was to make the point that the problems are a function of progress in general and aren’t specific to AI—they are just exacerbated by AI. I think this is a weak reason to expect that solutions are likely to come from outside of AI.
Assuming the above, and assuming that one has moral uncertainty that gives some weight to the concept of moral responsibility, it seems to me that an additional argument for AI researchers to work on these problems is that it’s a moral responsibility of AI researchers/companies to try to solve problems that they create, for example via technological solutions, or by coordinating amongst themselves, or by convincing policymakers to coordinate, or by funding others to work on these problems, etc., and they are currently neglecting to do this.
This seems true. Just to make sure I’m not misunderstanding, this was meant to be an observation, and not meant to argue that I personally should prioritize this, right?
The reason I mentioned that was to make the point that the problems are a function of progress in general and aren’t specific to AI—they are just exacerbated by AI. I think this is a weak reason to expect that solutions are likely to come from outside of AI.
This doesn’t make much sense to me. Why is this any kind of reason to expect that solutions are likely to come from outside of AI? Can you give me an analogy where this kind of reasoning more obviously makes sense?
Just to make sure I’m not misunderstanding, this was meant to be an observation, and not meant to argue that I personally should prioritize this, right?
Right, this argument wasn’t targeted to you, but I think there are other reasons for you to personally prioritize this. See my comment in the parallel thread.
I’m not sure I understand your proposal here. What are they agreeing to exactly? Stopping technological development at a certain level until metaphilosophy is solved?
Think of the human as a really badly designed AI with a convoluted architecture that nobody understands, spaghetti code, full of security holes, has no idea what its terminal values are and is really confused even about its “interim” values, has all kinds of potential safety problems like not being robust to distributional shifts, and is only “safe” in the sense of having passed certain tests for a very narrow distribution of inputs.
Clearly it’s not safe for a much more powerful outer AI to query the human about arbitrary actions that it’s considering, right? Instead, if the human is to contribute anything at all to safety in this situation, the outer AI has to figure out how to generate a bunch of smaller queries that the human can safely handle, from which it would then infer what the human would say if it could safely consider the actual choice under consideration. If the AI is bad at this “competence” problem it could send unsafe queries to the human and corrupt the human, and/or infer the wrong thing about what the human would approve of.
Is it clearer now why this doesn’t seem like an easy problem to me?
I’m not sure what you think the AGI would figure out, and what it would do in response to that. Are you suggesting something like, based on historical data, it would learn a classifier to predict what kind of new technologies or choices would change human values in a way that we would not like, and restrict those technologies/choices from us? It seems far from easy to do this in a robust way. I mean this classifier would be facing lots of unpredictable distributional shifts… I guess you made a similar point when you said “On the other hand, there may be similar types of events in the future that we can’t back out by looking at the past.”
ETA: Do you expect that different AIs would do different things in this regard depending on how cautious their operators are? Like some AIs would learn from their operators to be really cautious, and restrict technologies/choices that it isn’t sure won’t corrupt humans, but other operators and their AIs won’t be so cautious so a bunch of humans will be corrupted as a result, but that’s a lower priority problem because you think most AI operators will be really cautious so the percentage of value lost in the universe isn’t very high? (This is my current understanding of Paul’s position, and I wonder if you have a different position or a different way of putting it that would convince me more.) What about the problem that the corrupted humans/AIs could produce a lot of negative utility even if they are small in numbers? What about the problem of the cautious AIs being at a competitive disadvantage against other AIs who are less cautious about what they are willing to do?
This seems right.
Manipulation doesn’t have to come just from unaligned AIs, it could also come from AIs that are aligned to other people. For example, if an AI is aligned to Alice, and Alice sees something to be gained by manipulating Bob, the AI being aligned won’t stop Alice from using it to manipulate Bob.
ETA: I forgot to mention that I don’t understand this part, can you please explain more:
I don’t know, I want to outsource that decision to humans + AI at the time where it is relevant. Perhaps it involves stopping technological development. Perhaps it means continuing technological development, but not doing any space colonization. My point is simply that if humans agree that metaphilosophy needs to be solved, and the AI is trying to help humans, then metaphilosophy will probably be solved, even if I don’t know how exactly it will happen.
Yes. It seems to me like you’re considering the case where a human has to be able to give the correct answer to any question of the form “is this action a good thing to do?” I’m claiming that we could instead grow the set of things the AI does gradually, to give time for humans to figure out what it is they want. So I was imagining that humans would answer the AI’s questions in a frame where they have a lot of risk aversion, so anything that seemed particularly impactful would require a lot of deliberation before being approved.
I was thinking more of the case where a single human amassed a lot of power. Humans haven’t seemed to solve the problem of predicting how new technologies/choices would change human values, so that seems like quite a hard problem to solve (but perhaps AI could do it). I meant more that conditional on the AI knowing how some new technology or choice would affect us, it seems not too hard to figure out whether we would view it as a good thing.
Yes.
Kind of? I’d amend that slightly to say that to the extent that I think it is a problem (I’m not sure), I want to solve it in some way that is not technical research. (Possibilities: convince everyone to be cautious, obtain a decisive strategic advantage and enforce that everyone is cautious.)
Same as above.
Same as above. All of these problems that you’re talking about would also apply to technology that could make a human smarter. It seems like it would be easiest to address on that level, rather than trying to build an AI system that can deal with these problems even though the operator would not want them to correct for the problem.
This seems like an empirical fact that makes the problems listed above harder to solve.
So I broadly agree with Paul’s reasons for aiming for competitiveness. Given competitiveness, you might hope that we would automatically get defense against value manipulation by other AIs, since our aligned AI will defend us from value manipulation by similarly-capable unaligned AIs (or aligned AIs that other people have). Of course, defense might be a lot harder than offense, and you probably do think that, in which case this doesn’t really help us. (As I said, I haven’t really thought about this before.)
Overall view: I don’t think that the problems you’ve mentioned are obviously going to be solved as a part of AI alignment. I think that solving them will require mostly interventions on humans, not on the development of AI. I am weakly optimistic that humans will actually be able to coordinate and solve these problems as a result. If I were substantially more pessimistic, I would put more effort into strategy and governance issues. (Not sure I would change what I’m doing given my comparative advantage at technical research, but it would at least change what I advise other people do.)
Meta-view on our disagreement: I suspect that you have been talking about the problem of “making the future go well” while I’ve been talking about the problem of “getting AIs to do what we want” (which do seem like different problems to me). Most of the problems you’ve been talking about don’t even make it into the bucket of “getting AIs to do what we want” the way I think about it, so some of the claims (like “the urgent part is in the motivation subproblem”) are not meant to quantify over the problems you’re identifying. I think we do disagree on how important the problems you identify are, but not as much as you would think, since I’m quite uncertain about this area of problem-space.
Why isn’t that also an argument against the urgency of solving AI motivation? I.e., we don’t need to urgently solve AI motivation because humans will be able to coordinate to stop or delay AI development long enough to solve AI motivation at leisure?
It seems to me that coordination is really hard. Yes we have to push on that, but we also have to push on potential technical solutions because most likely coordination will fail, and there is enough uncertainty about the difficulty of technical solutions that I think we urgently need more people to investigate the problems to see how hard they really are.
Aside from that, I think it’s also really important to better predict/understand just how difficult solving those problems are (both socially and technically) because that understanding is highly relevant to strategic decisions we have to make today. For example if those problems are very difficult to solve so that in expectation we end up losing most of the potential value of the universe even if we solve AI motivation, then that greatly reduces the value of working on motivation relative to something like producing evidence of the difficulty of those problems in order to convince policymakers to try to coordinate on stopping/delaying AI progress, or trying to create a singleton AI. That’s why I was asking you for details of what you think the social solutions would look like.
I see, in that case I would appreciate disclaimers or clearer ways of stating that, so that people who might want to work on these problems are not discouraged from doing so more strongly than you intend.
Ok, I appreciate that.
Two reasons come to mind:
Stopping or delaying AI development feels more like trying to interfere with an already-running process, whereas there are no existing norms on what we use AI for that we would have to fight against, and debates on those norms are already beginning. For new things, I expect the public to be particularly risk-averse.
Relatedly, it is a lot easier to make norms/laws/regulations now that bind our future selves. On an individual level, it seems easier to delay your chance of going to Mars if you know you’re going to get a hovercar soon. On a societal scale, it seems easier to delay space colonization if we’re going to have lives of leisure due to automation, or to delay full automation if we’re soon going to get 4 hour workdays. Looking at the things governments and corporations say, it seems like they would be likely to do things like this. I think it makes a lot of sense to try and direct these efforts at the right target.
I want to emphasize though that my method here was having an intuition and querying for reasons behind the intuition. I would be a little surprised if someone could convince me my intuition is wrong in ~half an hour of conversation. I would not be surprised if someone could convince me that my reasons are wrong in ~half an hour of conversation.
I think it would help me if you suggested some ways that technical solutions could help with these problems. For example, with coordinating to prevent/delay corrupting technologies, the fundamental problem to me seems to be that with any technical solution, the thing that the AI does will be against the operator’s wishes-upon-reflection. (If your technical solution is in line with the operator’s wishes-upon-reflection, then I think you could also solve the problem by solving motivation.) This seems both hard to design (where does the AI get the information about what to do, if not from the operator’s wishes-upon-reflection?) as well as hard to implement (why would the operator use a system that’s going to do something they don’t want?).
You might argue that there are things that the operator would want if they could get it (eg. global coordination), but they can’t achieve it now, and so we need a technical solution for that. However, it seems like a we are in the same position as a well-motivated AI w.r.t. that operator. For example, if we try to cede control to FairBots that rationally cooperate with each other, a well-motivated AI could also do that.
Agreed. I view a lot of strategy research (eg. from FHI and OpenAI) as figuring this out from the social side, and some of my optimism is based on conversations with those researchers. On the technical side, I feel quite stuck (for the reasons above), though I haven’t tried hard enough to say that it’s too difficult to do.
I’ll keep that in mind. When I wrote the original comment, I wasn’t even thinking about problems like the ones you mention, because I categorize them as “strategy” by default, and I was trying to talk about the technical problem.
Do you think that at the time when AI development wasn’t an already-running process, and AI was still a new thing that the public could be expected to be risk-averse about (when would you say that was?), the argument “working on alignment isn’t urgent because humans can probably coordinate to stop AI development” would have been a good one?
Same question here. Back when “don’t develop AI” was still a binding on our future selves, should we have expected that we will coordinate to stop AI development, and it’s just bad luck that we haven’t succeeded in doing that?
Can you be more specific? What global agreement do you think would be reached, that is both realistic and would solve the kinds of problems that I’m worried about (e.g., unintentional corruption of humans by “aligned” AIs who give humans too much power or options that they can’t handle, and deliberate manipulation of humans by unaligned AIs or AIs aligned to other users)?
For example, create an AI that can help the user with philosophical questions at least as much as technical questions. (This could be done for example by figuring out how to better use Iterated Amplification to answer philosophical questions, or how to do imitation learning of human philosophers, or how to apply inverse reinforcement learning to philosophical reasoning.) Then the user could ask questions like “Am I likely to be corrupted by access to this technology? What can I do to prevent that while still taking advantage of it?” Or “Is this just an extremely persuasive attempt at manipulation or an actually good moral argument?”
As another example, solve metaethics and build that into the AI so that the AI can figure out or learn the actual terminal values of the user, which would make it easier to protect the user from manipulation and self-corruption. And even if the human user is corrupted, the AI still has the correct utility function, and when it has made enough technological progress it can uncorrupt the human.
Can you point me to any relevant results that have been written down, or explain what you learned from those conversations?
To address this and the question (from the parallel thread) of whether you should personally work on this, I think we need people to either solve the technical problems or at least to collectively try hard enough to convincingly say that it’s too difficult to do. (Otherwise who is going to convince policymakers to adopt the very costly social solutions? Who is going to convince people to start/join a social movement to influence policymakers to consider those costly social solutions? The fact that those things tend to take a lot of time seems like sufficient reason for urgency on the technical side, even if you expect the social solutions to be feasible.) Who are these people going to be, especially the first ones to join the field and help grow it? Probably existing AI alignment researchers, right? (I can probably make stronger arguments in this direction but I don’t want to be too “pushy” so I’ll stop here.)
I forgot to followup on this important part of our discussion:
It seems to me that a technology that could make a human smarter is much more likely (compared to AI) to accelerate all forms of intellectual progress (e.g., technological progress and philosophical/moral progress) about equally, and therefore would have a less significant effect on the kinds of problems that I’m talking about (which are largely caused by technological progress outpacing philosophical/moral progress). I could make some arguments about this, but I’m curious if this doesn’t seem obvious to you.
Assuming the above, and assuming that one has moral uncertainty that gives some weight to the concept of moral responsibility, it seems to me that an additional argument for AI researchers to work on these problems is that it’s a moral responsibility of AI researchers/companies to try to solve problems that they create, for example via technological solutions, or by coordinating amongst themselves, or by convincing policymakers to coordinate, or by funding others to work on these problems, etc., and they are currently neglecting to do this (especially with regard to the particular problems that I’m pointing out).
Yes, I agree with this. The reason I mentioned that was to make the point that the problems are a function of progress in general and aren’t specific to AI—they are just exacerbated by AI. I think this is a weak reason to expect that solutions are likely to come from outside of AI.
This seems true. Just to make sure I’m not misunderstanding, this was meant to be an observation, and not meant to argue that I personally should prioritize this, right?
This doesn’t make much sense to me. Why is this any kind of reason to expect that solutions are likely to come from outside of AI? Can you give me an analogy where this kind of reasoning more obviously makes sense?
Right, this argument wasn’t targeted to you, but I think there are other reasons for you to personally prioritize this. See my comment in the parallel thread.