For a working scheme, I would expect it to be usable by a significant fraction of humans (say, comparable to the fraction that can learn to write a compiler).
That said, I would not expect almost anyone to actually play the role of the overseer, even if a scheme like this one ended up being used widely. An existing analogy would be the human trainers who drive facebook’s M (at least in theory, I don’t know how that actually plays out). The trainers are responsible for getting M to do what the trainers want, and the user trusts the trainers to do what the user wants. From the user’s perspective, this is no different from delegating to the trainers directly, and allowing them to use whatever tools they like.
I don’t yet see why “defer to human judgments and handle uncertainty in a way that they would endorse” requires evaluating complex philosophical arguments or having a correct understanding of metaphilosophy. If the case is unclear, you can punt it to the actual humans.
If I imagine an employee who sucks at philosophy but thinks 100x faster than me, I don’t feel like they are going to fail to understand how to defer to me on philosophical questions. I might run into trouble because now it is comparatively much harder to answer philosophical questions, so to save costs I will often have to do things based on rough guesses about my philosophical views. But the damage from using such guesses depends on the importance of having answers to philosophical questions in the short-term.
It really feels to me like there are two distinct issues:
Philosophical understanding may help us make good decisions in the short term, for example about how to trade off extinction risk vs faster development, or how to prioritize the suffering of non-human animals. So having better philosophical understanding (and machines that can help us build more understanding) is good.
Handing off control of civilization to AI systems might permanently distort society’s values. Understanding how to avoid this problem is good.
These seem like separate issues to me. I am convinced that #2 is very important, since it seems like the largest existential risk by a fair margin and also relatively tractable. I think that #1 does add some value, but am not at all convinced that it is a maximally important problem to work on. As I see it, the value of #1 depends on the importance of the ethical questions we face in the short term (and on how long-lasting are the effects of differential technological progress that accelerates our philosophical ability).
Moreover, it seems like we should evaluate solutions to these two problems separately. You seem to be making an implicit argument that they are linked, such that a solution to #2 should only be considered satisfactory if it also substantially addresses #1. But from my perspective, that seems like a relatively minor consideration when evaluating the goodness of a solution to #2. In my view, solving both problems at once would be at most 2x as good as solving the more important of the two problems. (Neither of them is necessarily a crisp problem rather than an axis along which to measure differential technological development.)
I can see several ways in which #1 and #2 are linked, but none of them seem very compelling to me. Do you have something in particular in mind? Does my position seem somehow more fundamentally mistaken to you?
(This comment was in response to point 1, but it feels like the same underlying disagreement is central to points 2 and 3. Point 4 seems like a different concern, about how the availability of AI would itself change philosophical deliberation. I don’t really see much reason to think that the availability of powerful AI would make the endpoint of deliberation worse rather than better, but probably this is a separate discussion.)
The trainers are responsible for getting M to do what the trainers want, and the user trusts the trainers to do what the user wants.
In that case, there would be severe principle-agent problems, given the disparity between power/intelligence of the trainer/AI systems and the users. If I was someone who couldn’t directly control an AI using your scheme, I’d be very concerned about getting uneven trades or having my property expropriated outright by individual AIs or AI conspiracies, or just ignored and left behind in the race to capture the cosmic commons. I would be really tempted to try another AI design that does purport to have the AI serve my interests directly, even if that scheme is not as “safe”.
If I imagine an employee who sucks at philosophy but thinks 100x faster than me, I don’t feel like they are going to fail to understand how to defer to me on philosophical questions.
If an employee sucks at philosophy, how does he even recognize philosophical problems as problems that he needs to consult you for? Most people have little idea that they should feel confused and uncertain about things like epistemology, decision theory, and ethics. I suppose it might be relatively easy to teach an AI to recognize the specific problems that we currently consider to be philosophical, but what about new problems that we don’t yet recognize as problems today?
Aside from that, a bigger concern for me is that if I was supervising your AI, I would be constantly bombarded with philosophical questions that I’d have to answer under time pressure, and afraid that one wrong move would cause me to lose control, or lock in some wrong idea.
Consider this scenario. Your AI prompts you for guidance because it has received a message from a trading partner with a proposal to merge your AI systems and share resources for greater efficiency and economy of scale. The proposal contains a new AI design and control scheme and arguments that the new design is safer, more efficient, and divides control of the joint AI fairly between the human owners according to your current bargaining power. The message also claims that every second you take to consider the issue has large costs to you because your AI is falling behind the state of the art in both technology and scale, becoming uncompetitive, so your bargaining power for joining the merger is dropping (slowly in the AI’s time-frame, but quickly in yours). Your AI says it can’t find any obvious flaws in the proposal, but it’s not sure that you’d consider the proposal to really be fair under reflective equilibrium or that the new design would preserve your real values in the long run. There are several arguments in the proposal that it doesn’t know how to evaluate, hence the request for guidance. But it also reminds you not to read those arguments directly since they were written by a superintelligent AI and you risk getting mind-hacked if you do.
What do you do? This story ignores the recursive structure in ALBA. I think that would only make the problem even harder, but I could be wrong. If you don’t think it would go like this, let me know how you think this kind of scenario would go.
In terms of your #1, I would divide the decisions requiring philosophical understanding into two main categories. One is decisions involved in designing/improving AI systems, like in the scenario above. The other, which I talked about in an earlier comment, is ethical disasters directly caused by people who are not uncertain, but just wrong. You didn’t reply to that comment, so I’m not sure why you’re unconcerned about this category either.
A general note: I’m not really taking a stand on the importance of a singleton, and I’m open to the possibility that the only way to achieve a good outcome even in the medium-term is to have very good coordination.
A would-be singleton will also need to solve the AI control problem, and I am just as happy to help with that problem as with the version of the AI control problem faced by a whole economy of actors each using their own AI systems.
The main way in which this affects my work is that I don’t want to count on the formation of a singleton to solve the control problem itself.
You could try to work on AI in a way that helps facilitate the formation of a singleton. I don’t think that is really helpful, but moreover it again seems like a separate problem from AI control. (Also don’t think that e.g. MIRI is doing this with their current research, although they are open to solving AI control in a way that only works if there is a singleton.)
every second you take to consider the issue has large costs to you because your AI is falling behind the state of the art in both technology and scale, becoming uncompetitive, so your bargaining power for joining the merger is dropping
If your most powerful learners are strong enough to learn good-enough answers to these kinds of philosophical questions, then you only need to provide philosophical input during training and so synthesizing training data can take off time pressure. If your most powerful AI is not able to learn how to answer these philosophical questions, then the time pressure seems harder to avoid. In that case though, it seems quite hard to avoid the time pressure by any mechanism. (Especially if we are better at learning than we would be at hand-coding an algorithm for philosophical deliberation—if we are better at learning and our learner can’t handle philosophy, then we simply aren’t going to be able to build an AI that can handle philosophy.)
One is decisions involved in designing/improving AI systems, like in the scenario above. The other, which I talked about in an earlier comment, is ethical disasters directly caused by people who are not uncertain, but just wrong. You didn’t reply to that comment, so I’m not sure why you’re unconcerned about this category either.
I replied to your earlier comment.
My overall feeling is still that these are separate problems. We can evaluate a solution to AI control, and we can evaluate philosophical work that improves our understanding of potentially-relevant issues (or metaphilosophical work to automate philosophy).
I am both less pessimistic about philosophical errors doing damage, and more optimistic about my scheme’s ability to do philosophy, but it’s not clear to me that either of those is the real disagreement (since if I imagining caring a lot about philosophy and thinking this scheme didn’t help automate philosophy, I would still feel like we were facing two distinct problems).
If an employee sucks at philosophy, how does he even recognize philosophical problems as problems that he needs to consult you for? Most people have little idea that they should feel confused and uncertain about things like epistemology, decision theory, and ethics. I suppose it might be relatively easy to teach an AI to recognize the specific problems that we currently consider to be philosophical, but what about new problems that we don’t yet recognize as problems today?
Is this your reaction if you imagine delegating your affairs to an employee today? Are you making some claim about the projected increase in the importance of these philosophical decisions? Or do you think that a brilliant employees’ lack of metaphilosophical understanding would in fact cause great damage right now?
I would divide the decisions requiring philosophical understanding into two main categories. One is decisions involved in designing/improving AI systems, like in the scenario above...
I agree that AI may increase the stakes for philosophical decisions. One of my points is that a natural argument that it might increase the stakes—by forcing us to lock in an answer to philosophical questions—doesn’t seem to go through if you pursue this approach to AI control. There might be other arguments that building AI systems force us to lock in important philosophical views, but I am not familiar with those arguments.
I agree there may be other ways in which AI systems increase the stakes for philosophical decisions.
I like the bargaining example. I hadn’t thought about bargaining as competitive advantage before, and instead had just been thinking about the possible upside (so that the cost of philosophical error was bounded by the damage of using a weaker bargaining scheme). I still don’t feel like this is a big cost, but it’s something I want to think about somewhat more.
If you think there are other examples like this that might help move my view. On my current model, these are just facts that increase my estimates for the importance of philosophical work, I don’t really see it as relevant to AI control per se. (See the sibling, which is the better place to discuss that.)
one wrong move would cause me to lose control
I don’t see cases where a philosophical error causes you to lose control, unless you would have some reason to cede control based on philosophical arguments (e.g. in the bargaining case). Failing that, it seems like there is a philosophically simple, apparently adequate notion of “remaining in control” and I would expect to remain in control at least in that sense.
In that case, there would be severe principle-agent problems, given the disparity between power/intelligence of the trainer/AI systems and the users. If I was someone who couldn’t directly control an AI using your scheme, I’d be very concerned about getting uneven trades or having my property expropriated outright by individual AIs or AI conspiracies, or just ignored and left behind in the race to capture the cosmic commons. I would be really tempted to try another AI design that does purport to have the AI serve my interests directly, even if that scheme is not as “safe”.
Are these worse than the principal-agent problems that exist in any industrialized society? Most humans lack effective control over many important technologies, both in terms of economic productivity and especially military might. (They can’t understand the design of a car they use, they can’t understand the programs they use, they don’t understand what is actually going on with their investments...) It seems like the situation is quite analogous.
Moreover, even if we could build AI in a different way, it doesn’t seem to do anything to address the problem, since it is equally opaque to an end user who isn’t involved in the AI development process. In any case, they are in some sense at the mercy of the AI developer. I guess this is probably the key point—I don’t understand the qualitative difference between being at the mercy of the software developer on the one hand, and being at the mercy of the software developer + the engineers who help the software run day-to-day on the other. There is a slightly different set of issues for monitoring/law enforcement/compliance/etc., but it doesn’t seem like a huge change.
(Probably the rest of this comment is irrelevant.)
To talk more concretely about mechanisms in a simple example, you might imagine a handful of companies who provide AI software. The people who use this software are essentially at the mercy of the software providers (since for all they know the software they are using will subvert their interests in arbitrary ways, whether or not there is a human involved in the process). In the most extreme case an AI provider could effectively steal all of their users’ wealth. They would presumably then face legal consequences, which are not qualitatively changed by the development of AI if the AI control problem is solved. If anything we expect the legal system and government to better serve human interests.
We could talk about monitoring/enforcement/etc., but again I don’t see these issues as interestingly different from the current set of issues, or as interestingly dependent on the nature of our AI control techniques. The most interesting change is probably the irrelevance of human labor, which I think is a very interesting issue economically/politically/legally/etc.
I agree with the general point that as technology improves a singleton becomes more likely. I’m agnostic on whether the control mechanisms I describe would be used by a singleton or by a bunch of actors, and as far as I can tell the character of the control problem is essentially the same in either case.
I do think that a singleton is likely eventually. From the perspective of human observers, a singleton will probably be established relatively shortly after wages fall below subsistence (at the latest). This prediction is mostly based on my expectation that political change will accelerate alongside technological change.
I agree with the general point that as technology improves a singleton becomes more likely. I’m agnostic on whether the control mechanisms I describe would be used by a singleton or by a bunch of actors, and as far as I can tell the character of the control problem is essentially the same in either case.
I wonder—are you also relatively indifferent between a hard and slow takeoff, given sufficient time before the takeoff to develop ai control theory?
(One of the reasons a hard takeoff seems scarier to me is that it is more likely to lead to a singleton, with a higher probability of locking in bad values.)
Re 1:
For a working scheme, I would expect it to be usable by a significant fraction of humans (say, comparable to the fraction that can learn to write a compiler).
That said, I would not expect almost anyone to actually play the role of the overseer, even if a scheme like this one ended up being used widely. An existing analogy would be the human trainers who drive facebook’s M (at least in theory, I don’t know how that actually plays out). The trainers are responsible for getting M to do what the trainers want, and the user trusts the trainers to do what the user wants. From the user’s perspective, this is no different from delegating to the trainers directly, and allowing them to use whatever tools they like.
I don’t yet see why “defer to human judgments and handle uncertainty in a way that they would endorse” requires evaluating complex philosophical arguments or having a correct understanding of metaphilosophy. If the case is unclear, you can punt it to the actual humans.
If I imagine an employee who sucks at philosophy but thinks 100x faster than me, I don’t feel like they are going to fail to understand how to defer to me on philosophical questions. I might run into trouble because now it is comparatively much harder to answer philosophical questions, so to save costs I will often have to do things based on rough guesses about my philosophical views. But the damage from using such guesses depends on the importance of having answers to philosophical questions in the short-term.
It really feels to me like there are two distinct issues:
Philosophical understanding may help us make good decisions in the short term, for example about how to trade off extinction risk vs faster development, or how to prioritize the suffering of non-human animals. So having better philosophical understanding (and machines that can help us build more understanding) is good.
Handing off control of civilization to AI systems might permanently distort society’s values. Understanding how to avoid this problem is good.
These seem like separate issues to me. I am convinced that #2 is very important, since it seems like the largest existential risk by a fair margin and also relatively tractable. I think that #1 does add some value, but am not at all convinced that it is a maximally important problem to work on. As I see it, the value of #1 depends on the importance of the ethical questions we face in the short term (and on how long-lasting are the effects of differential technological progress that accelerates our philosophical ability).
Moreover, it seems like we should evaluate solutions to these two problems separately. You seem to be making an implicit argument that they are linked, such that a solution to #2 should only be considered satisfactory if it also substantially addresses #1. But from my perspective, that seems like a relatively minor consideration when evaluating the goodness of a solution to #2. In my view, solving both problems at once would be at most 2x as good as solving the more important of the two problems. (Neither of them is necessarily a crisp problem rather than an axis along which to measure differential technological development.)
I can see several ways in which #1 and #2 are linked, but none of them seem very compelling to me. Do you have something in particular in mind? Does my position seem somehow more fundamentally mistaken to you?
(This comment was in response to point 1, but it feels like the same underlying disagreement is central to points 2 and 3. Point 4 seems like a different concern, about how the availability of AI would itself change philosophical deliberation. I don’t really see much reason to think that the availability of powerful AI would make the endpoint of deliberation worse rather than better, but probably this is a separate discussion.)
In that case, there would be severe principle-agent problems, given the disparity between power/intelligence of the trainer/AI systems and the users. If I was someone who couldn’t directly control an AI using your scheme, I’d be very concerned about getting uneven trades or having my property expropriated outright by individual AIs or AI conspiracies, or just ignored and left behind in the race to capture the cosmic commons. I would be really tempted to try another AI design that does purport to have the AI serve my interests directly, even if that scheme is not as “safe”.
If an employee sucks at philosophy, how does he even recognize philosophical problems as problems that he needs to consult you for? Most people have little idea that they should feel confused and uncertain about things like epistemology, decision theory, and ethics. I suppose it might be relatively easy to teach an AI to recognize the specific problems that we currently consider to be philosophical, but what about new problems that we don’t yet recognize as problems today?
Aside from that, a bigger concern for me is that if I was supervising your AI, I would be constantly bombarded with philosophical questions that I’d have to answer under time pressure, and afraid that one wrong move would cause me to lose control, or lock in some wrong idea.
Consider this scenario. Your AI prompts you for guidance because it has received a message from a trading partner with a proposal to merge your AI systems and share resources for greater efficiency and economy of scale. The proposal contains a new AI design and control scheme and arguments that the new design is safer, more efficient, and divides control of the joint AI fairly between the human owners according to your current bargaining power. The message also claims that every second you take to consider the issue has large costs to you because your AI is falling behind the state of the art in both technology and scale, becoming uncompetitive, so your bargaining power for joining the merger is dropping (slowly in the AI’s time-frame, but quickly in yours). Your AI says it can’t find any obvious flaws in the proposal, but it’s not sure that you’d consider the proposal to really be fair under reflective equilibrium or that the new design would preserve your real values in the long run. There are several arguments in the proposal that it doesn’t know how to evaluate, hence the request for guidance. But it also reminds you not to read those arguments directly since they were written by a superintelligent AI and you risk getting mind-hacked if you do.
What do you do? This story ignores the recursive structure in ALBA. I think that would only make the problem even harder, but I could be wrong. If you don’t think it would go like this, let me know how you think this kind of scenario would go.
In terms of your #1, I would divide the decisions requiring philosophical understanding into two main categories. One is decisions involved in designing/improving AI systems, like in the scenario above. The other, which I talked about in an earlier comment, is ethical disasters directly caused by people who are not uncertain, but just wrong. You didn’t reply to that comment, so I’m not sure why you’re unconcerned about this category either.
A general note: I’m not really taking a stand on the importance of a singleton, and I’m open to the possibility that the only way to achieve a good outcome even in the medium-term is to have very good coordination.
A would-be singleton will also need to solve the AI control problem, and I am just as happy to help with that problem as with the version of the AI control problem faced by a whole economy of actors each using their own AI systems.
The main way in which this affects my work is that I don’t want to count on the formation of a singleton to solve the control problem itself.
You could try to work on AI in a way that helps facilitate the formation of a singleton. I don’t think that is really helpful, but moreover it again seems like a separate problem from AI control. (Also don’t think that e.g. MIRI is doing this with their current research, although they are open to solving AI control in a way that only works if there is a singleton.)
In general I think that counterfactual oversight has problems in really low-latency environments. I think the most natural way to avoid them is synthesizing training data in advance. It’s not clear whether that proposal will work.
If your most powerful learners are strong enough to learn good-enough answers to these kinds of philosophical questions, then you only need to provide philosophical input during training and so synthesizing training data can take off time pressure. If your most powerful AI is not able to learn how to answer these philosophical questions, then the time pressure seems harder to avoid. In that case though, it seems quite hard to avoid the time pressure by any mechanism. (Especially if we are better at learning than we would be at hand-coding an algorithm for philosophical deliberation—if we are better at learning and our learner can’t handle philosophy, then we simply aren’t going to be able to build an AI that can handle philosophy.)
I replied to your earlier comment.
My overall feeling is still that these are separate problems. We can evaluate a solution to AI control, and we can evaluate philosophical work that improves our understanding of potentially-relevant issues (or metaphilosophical work to automate philosophy).
I am both less pessimistic about philosophical errors doing damage, and more optimistic about my scheme’s ability to do philosophy, but it’s not clear to me that either of those is the real disagreement (since if I imagining caring a lot about philosophy and thinking this scheme didn’t help automate philosophy, I would still feel like we were facing two distinct problems).
Is this your reaction if you imagine delegating your affairs to an employee today? Are you making some claim about the projected increase in the importance of these philosophical decisions? Or do you think that a brilliant employees’ lack of metaphilosophical understanding would in fact cause great damage right now?
I agree that AI may increase the stakes for philosophical decisions. One of my points is that a natural argument that it might increase the stakes—by forcing us to lock in an answer to philosophical questions—doesn’t seem to go through if you pursue this approach to AI control. There might be other arguments that building AI systems force us to lock in important philosophical views, but I am not familiar with those arguments.
I agree there may be other ways in which AI systems increase the stakes for philosophical decisions.
I like the bargaining example. I hadn’t thought about bargaining as competitive advantage before, and instead had just been thinking about the possible upside (so that the cost of philosophical error was bounded by the damage of using a weaker bargaining scheme). I still don’t feel like this is a big cost, but it’s something I want to think about somewhat more.
If you think there are other examples like this that might help move my view. On my current model, these are just facts that increase my estimates for the importance of philosophical work, I don’t really see it as relevant to AI control per se. (See the sibling, which is the better place to discuss that.)
I don’t see cases where a philosophical error causes you to lose control, unless you would have some reason to cede control based on philosophical arguments (e.g. in the bargaining case). Failing that, it seems like there is a philosophically simple, apparently adequate notion of “remaining in control” and I would expect to remain in control at least in that sense.
Are these worse than the principal-agent problems that exist in any industrialized society? Most humans lack effective control over many important technologies, both in terms of economic productivity and especially military might. (They can’t understand the design of a car they use, they can’t understand the programs they use, they don’t understand what is actually going on with their investments...) It seems like the situation is quite analogous.
Moreover, even if we could build AI in a different way, it doesn’t seem to do anything to address the problem, since it is equally opaque to an end user who isn’t involved in the AI development process. In any case, they are in some sense at the mercy of the AI developer. I guess this is probably the key point—I don’t understand the qualitative difference between being at the mercy of the software developer on the one hand, and being at the mercy of the software developer + the engineers who help the software run day-to-day on the other. There is a slightly different set of issues for monitoring/law enforcement/compliance/etc., but it doesn’t seem like a huge change.
(Probably the rest of this comment is irrelevant.)
To talk more concretely about mechanisms in a simple example, you might imagine a handful of companies who provide AI software. The people who use this software are essentially at the mercy of the software providers (since for all they know the software they are using will subvert their interests in arbitrary ways, whether or not there is a human involved in the process). In the most extreme case an AI provider could effectively steal all of their users’ wealth. They would presumably then face legal consequences, which are not qualitatively changed by the development of AI if the AI control problem is solved. If anything we expect the legal system and government to better serve human interests.
We could talk about monitoring/enforcement/etc., but again I don’t see these issues as interestingly different from the current set of issues, or as interestingly dependent on the nature of our AI control techniques. The most interesting change is probably the irrelevance of human labor, which I think is a very interesting issue economically/politically/legally/etc.
I agree with the general point that as technology improves a singleton becomes more likely. I’m agnostic on whether the control mechanisms I describe would be used by a singleton or by a bunch of actors, and as far as I can tell the character of the control problem is essentially the same in either case.
I do think that a singleton is likely eventually. From the perspective of human observers, a singleton will probably be established relatively shortly after wages fall below subsistence (at the latest). This prediction is mostly based on my expectation that political change will accelerate alongside technological change.
I wonder—are you also relatively indifferent between a hard and slow takeoff, given sufficient time before the takeoff to develop ai control theory?
(One of the reasons a hard takeoff seems scarier to me is that it is more likely to lead to a singleton, with a higher probability of locking in bad values.)