The trainers are responsible for getting M to do what the trainers want, and the user trusts the trainers to do what the user wants.
In that case, there would be severe principle-agent problems, given the disparity between power/intelligence of the trainer/AI systems and the users. If I was someone who couldn’t directly control an AI using your scheme, I’d be very concerned about getting uneven trades or having my property expropriated outright by individual AIs or AI conspiracies, or just ignored and left behind in the race to capture the cosmic commons. I would be really tempted to try another AI design that does purport to have the AI serve my interests directly, even if that scheme is not as “safe”.
If I imagine an employee who sucks at philosophy but thinks 100x faster than me, I don’t feel like they are going to fail to understand how to defer to me on philosophical questions.
If an employee sucks at philosophy, how does he even recognize philosophical problems as problems that he needs to consult you for? Most people have little idea that they should feel confused and uncertain about things like epistemology, decision theory, and ethics. I suppose it might be relatively easy to teach an AI to recognize the specific problems that we currently consider to be philosophical, but what about new problems that we don’t yet recognize as problems today?
Aside from that, a bigger concern for me is that if I was supervising your AI, I would be constantly bombarded with philosophical questions that I’d have to answer under time pressure, and afraid that one wrong move would cause me to lose control, or lock in some wrong idea.
Consider this scenario. Your AI prompts you for guidance because it has received a message from a trading partner with a proposal to merge your AI systems and share resources for greater efficiency and economy of scale. The proposal contains a new AI design and control scheme and arguments that the new design is safer, more efficient, and divides control of the joint AI fairly between the human owners according to your current bargaining power. The message also claims that every second you take to consider the issue has large costs to you because your AI is falling behind the state of the art in both technology and scale, becoming uncompetitive, so your bargaining power for joining the merger is dropping (slowly in the AI’s time-frame, but quickly in yours). Your AI says it can’t find any obvious flaws in the proposal, but it’s not sure that you’d consider the proposal to really be fair under reflective equilibrium or that the new design would preserve your real values in the long run. There are several arguments in the proposal that it doesn’t know how to evaluate, hence the request for guidance. But it also reminds you not to read those arguments directly since they were written by a superintelligent AI and you risk getting mind-hacked if you do.
What do you do? This story ignores the recursive structure in ALBA. I think that would only make the problem even harder, but I could be wrong. If you don’t think it would go like this, let me know how you think this kind of scenario would go.
In terms of your #1, I would divide the decisions requiring philosophical understanding into two main categories. One is decisions involved in designing/improving AI systems, like in the scenario above. The other, which I talked about in an earlier comment, is ethical disasters directly caused by people who are not uncertain, but just wrong. You didn’t reply to that comment, so I’m not sure why you’re unconcerned about this category either.
A general note: I’m not really taking a stand on the importance of a singleton, and I’m open to the possibility that the only way to achieve a good outcome even in the medium-term is to have very good coordination.
A would-be singleton will also need to solve the AI control problem, and I am just as happy to help with that problem as with the version of the AI control problem faced by a whole economy of actors each using their own AI systems.
The main way in which this affects my work is that I don’t want to count on the formation of a singleton to solve the control problem itself.
You could try to work on AI in a way that helps facilitate the formation of a singleton. I don’t think that is really helpful, but moreover it again seems like a separate problem from AI control. (Also don’t think that e.g. MIRI is doing this with their current research, although they are open to solving AI control in a way that only works if there is a singleton.)
every second you take to consider the issue has large costs to you because your AI is falling behind the state of the art in both technology and scale, becoming uncompetitive, so your bargaining power for joining the merger is dropping
If your most powerful learners are strong enough to learn good-enough answers to these kinds of philosophical questions, then you only need to provide philosophical input during training and so synthesizing training data can take off time pressure. If your most powerful AI is not able to learn how to answer these philosophical questions, then the time pressure seems harder to avoid. In that case though, it seems quite hard to avoid the time pressure by any mechanism. (Especially if we are better at learning than we would be at hand-coding an algorithm for philosophical deliberation—if we are better at learning and our learner can’t handle philosophy, then we simply aren’t going to be able to build an AI that can handle philosophy.)
One is decisions involved in designing/improving AI systems, like in the scenario above. The other, which I talked about in an earlier comment, is ethical disasters directly caused by people who are not uncertain, but just wrong. You didn’t reply to that comment, so I’m not sure why you’re unconcerned about this category either.
I replied to your earlier comment.
My overall feeling is still that these are separate problems. We can evaluate a solution to AI control, and we can evaluate philosophical work that improves our understanding of potentially-relevant issues (or metaphilosophical work to automate philosophy).
I am both less pessimistic about philosophical errors doing damage, and more optimistic about my scheme’s ability to do philosophy, but it’s not clear to me that either of those is the real disagreement (since if I imagining caring a lot about philosophy and thinking this scheme didn’t help automate philosophy, I would still feel like we were facing two distinct problems).
If an employee sucks at philosophy, how does he even recognize philosophical problems as problems that he needs to consult you for? Most people have little idea that they should feel confused and uncertain about things like epistemology, decision theory, and ethics. I suppose it might be relatively easy to teach an AI to recognize the specific problems that we currently consider to be philosophical, but what about new problems that we don’t yet recognize as problems today?
Is this your reaction if you imagine delegating your affairs to an employee today? Are you making some claim about the projected increase in the importance of these philosophical decisions? Or do you think that a brilliant employees’ lack of metaphilosophical understanding would in fact cause great damage right now?
I would divide the decisions requiring philosophical understanding into two main categories. One is decisions involved in designing/improving AI systems, like in the scenario above...
I agree that AI may increase the stakes for philosophical decisions. One of my points is that a natural argument that it might increase the stakes—by forcing us to lock in an answer to philosophical questions—doesn’t seem to go through if you pursue this approach to AI control. There might be other arguments that building AI systems force us to lock in important philosophical views, but I am not familiar with those arguments.
I agree there may be other ways in which AI systems increase the stakes for philosophical decisions.
I like the bargaining example. I hadn’t thought about bargaining as competitive advantage before, and instead had just been thinking about the possible upside (so that the cost of philosophical error was bounded by the damage of using a weaker bargaining scheme). I still don’t feel like this is a big cost, but it’s something I want to think about somewhat more.
If you think there are other examples like this that might help move my view. On my current model, these are just facts that increase my estimates for the importance of philosophical work, I don’t really see it as relevant to AI control per se. (See the sibling, which is the better place to discuss that.)
one wrong move would cause me to lose control
I don’t see cases where a philosophical error causes you to lose control, unless you would have some reason to cede control based on philosophical arguments (e.g. in the bargaining case). Failing that, it seems like there is a philosophically simple, apparently adequate notion of “remaining in control” and I would expect to remain in control at least in that sense.
In that case, there would be severe principle-agent problems, given the disparity between power/intelligence of the trainer/AI systems and the users. If I was someone who couldn’t directly control an AI using your scheme, I’d be very concerned about getting uneven trades or having my property expropriated outright by individual AIs or AI conspiracies, or just ignored and left behind in the race to capture the cosmic commons. I would be really tempted to try another AI design that does purport to have the AI serve my interests directly, even if that scheme is not as “safe”.
Are these worse than the principal-agent problems that exist in any industrialized society? Most humans lack effective control over many important technologies, both in terms of economic productivity and especially military might. (They can’t understand the design of a car they use, they can’t understand the programs they use, they don’t understand what is actually going on with their investments...) It seems like the situation is quite analogous.
Moreover, even if we could build AI in a different way, it doesn’t seem to do anything to address the problem, since it is equally opaque to an end user who isn’t involved in the AI development process. In any case, they are in some sense at the mercy of the AI developer. I guess this is probably the key point—I don’t understand the qualitative difference between being at the mercy of the software developer on the one hand, and being at the mercy of the software developer + the engineers who help the software run day-to-day on the other. There is a slightly different set of issues for monitoring/law enforcement/compliance/etc., but it doesn’t seem like a huge change.
(Probably the rest of this comment is irrelevant.)
To talk more concretely about mechanisms in a simple example, you might imagine a handful of companies who provide AI software. The people who use this software are essentially at the mercy of the software providers (since for all they know the software they are using will subvert their interests in arbitrary ways, whether or not there is a human involved in the process). In the most extreme case an AI provider could effectively steal all of their users’ wealth. They would presumably then face legal consequences, which are not qualitatively changed by the development of AI if the AI control problem is solved. If anything we expect the legal system and government to better serve human interests.
We could talk about monitoring/enforcement/etc., but again I don’t see these issues as interestingly different from the current set of issues, or as interestingly dependent on the nature of our AI control techniques. The most interesting change is probably the irrelevance of human labor, which I think is a very interesting issue economically/politically/legally/etc.
I agree with the general point that as technology improves a singleton becomes more likely. I’m agnostic on whether the control mechanisms I describe would be used by a singleton or by a bunch of actors, and as far as I can tell the character of the control problem is essentially the same in either case.
I do think that a singleton is likely eventually. From the perspective of human observers, a singleton will probably be established relatively shortly after wages fall below subsistence (at the latest). This prediction is mostly based on my expectation that political change will accelerate alongside technological change.
I agree with the general point that as technology improves a singleton becomes more likely. I’m agnostic on whether the control mechanisms I describe would be used by a singleton or by a bunch of actors, and as far as I can tell the character of the control problem is essentially the same in either case.
I wonder—are you also relatively indifferent between a hard and slow takeoff, given sufficient time before the takeoff to develop ai control theory?
(One of the reasons a hard takeoff seems scarier to me is that it is more likely to lead to a singleton, with a higher probability of locking in bad values.)
In that case, there would be severe principle-agent problems, given the disparity between power/intelligence of the trainer/AI systems and the users. If I was someone who couldn’t directly control an AI using your scheme, I’d be very concerned about getting uneven trades or having my property expropriated outright by individual AIs or AI conspiracies, or just ignored and left behind in the race to capture the cosmic commons. I would be really tempted to try another AI design that does purport to have the AI serve my interests directly, even if that scheme is not as “safe”.
If an employee sucks at philosophy, how does he even recognize philosophical problems as problems that he needs to consult you for? Most people have little idea that they should feel confused and uncertain about things like epistemology, decision theory, and ethics. I suppose it might be relatively easy to teach an AI to recognize the specific problems that we currently consider to be philosophical, but what about new problems that we don’t yet recognize as problems today?
Aside from that, a bigger concern for me is that if I was supervising your AI, I would be constantly bombarded with philosophical questions that I’d have to answer under time pressure, and afraid that one wrong move would cause me to lose control, or lock in some wrong idea.
Consider this scenario. Your AI prompts you for guidance because it has received a message from a trading partner with a proposal to merge your AI systems and share resources for greater efficiency and economy of scale. The proposal contains a new AI design and control scheme and arguments that the new design is safer, more efficient, and divides control of the joint AI fairly between the human owners according to your current bargaining power. The message also claims that every second you take to consider the issue has large costs to you because your AI is falling behind the state of the art in both technology and scale, becoming uncompetitive, so your bargaining power for joining the merger is dropping (slowly in the AI’s time-frame, but quickly in yours). Your AI says it can’t find any obvious flaws in the proposal, but it’s not sure that you’d consider the proposal to really be fair under reflective equilibrium or that the new design would preserve your real values in the long run. There are several arguments in the proposal that it doesn’t know how to evaluate, hence the request for guidance. But it also reminds you not to read those arguments directly since they were written by a superintelligent AI and you risk getting mind-hacked if you do.
What do you do? This story ignores the recursive structure in ALBA. I think that would only make the problem even harder, but I could be wrong. If you don’t think it would go like this, let me know how you think this kind of scenario would go.
In terms of your #1, I would divide the decisions requiring philosophical understanding into two main categories. One is decisions involved in designing/improving AI systems, like in the scenario above. The other, which I talked about in an earlier comment, is ethical disasters directly caused by people who are not uncertain, but just wrong. You didn’t reply to that comment, so I’m not sure why you’re unconcerned about this category either.
A general note: I’m not really taking a stand on the importance of a singleton, and I’m open to the possibility that the only way to achieve a good outcome even in the medium-term is to have very good coordination.
A would-be singleton will also need to solve the AI control problem, and I am just as happy to help with that problem as with the version of the AI control problem faced by a whole economy of actors each using their own AI systems.
The main way in which this affects my work is that I don’t want to count on the formation of a singleton to solve the control problem itself.
You could try to work on AI in a way that helps facilitate the formation of a singleton. I don’t think that is really helpful, but moreover it again seems like a separate problem from AI control. (Also don’t think that e.g. MIRI is doing this with their current research, although they are open to solving AI control in a way that only works if there is a singleton.)
In general I think that counterfactual oversight has problems in really low-latency environments. I think the most natural way to avoid them is synthesizing training data in advance. It’s not clear whether that proposal will work.
If your most powerful learners are strong enough to learn good-enough answers to these kinds of philosophical questions, then you only need to provide philosophical input during training and so synthesizing training data can take off time pressure. If your most powerful AI is not able to learn how to answer these philosophical questions, then the time pressure seems harder to avoid. In that case though, it seems quite hard to avoid the time pressure by any mechanism. (Especially if we are better at learning than we would be at hand-coding an algorithm for philosophical deliberation—if we are better at learning and our learner can’t handle philosophy, then we simply aren’t going to be able to build an AI that can handle philosophy.)
I replied to your earlier comment.
My overall feeling is still that these are separate problems. We can evaluate a solution to AI control, and we can evaluate philosophical work that improves our understanding of potentially-relevant issues (or metaphilosophical work to automate philosophy).
I am both less pessimistic about philosophical errors doing damage, and more optimistic about my scheme’s ability to do philosophy, but it’s not clear to me that either of those is the real disagreement (since if I imagining caring a lot about philosophy and thinking this scheme didn’t help automate philosophy, I would still feel like we were facing two distinct problems).
Is this your reaction if you imagine delegating your affairs to an employee today? Are you making some claim about the projected increase in the importance of these philosophical decisions? Or do you think that a brilliant employees’ lack of metaphilosophical understanding would in fact cause great damage right now?
I agree that AI may increase the stakes for philosophical decisions. One of my points is that a natural argument that it might increase the stakes—by forcing us to lock in an answer to philosophical questions—doesn’t seem to go through if you pursue this approach to AI control. There might be other arguments that building AI systems force us to lock in important philosophical views, but I am not familiar with those arguments.
I agree there may be other ways in which AI systems increase the stakes for philosophical decisions.
I like the bargaining example. I hadn’t thought about bargaining as competitive advantage before, and instead had just been thinking about the possible upside (so that the cost of philosophical error was bounded by the damage of using a weaker bargaining scheme). I still don’t feel like this is a big cost, but it’s something I want to think about somewhat more.
If you think there are other examples like this that might help move my view. On my current model, these are just facts that increase my estimates for the importance of philosophical work, I don’t really see it as relevant to AI control per se. (See the sibling, which is the better place to discuss that.)
I don’t see cases where a philosophical error causes you to lose control, unless you would have some reason to cede control based on philosophical arguments (e.g. in the bargaining case). Failing that, it seems like there is a philosophically simple, apparently adequate notion of “remaining in control” and I would expect to remain in control at least in that sense.
Are these worse than the principal-agent problems that exist in any industrialized society? Most humans lack effective control over many important technologies, both in terms of economic productivity and especially military might. (They can’t understand the design of a car they use, they can’t understand the programs they use, they don’t understand what is actually going on with their investments...) It seems like the situation is quite analogous.
Moreover, even if we could build AI in a different way, it doesn’t seem to do anything to address the problem, since it is equally opaque to an end user who isn’t involved in the AI development process. In any case, they are in some sense at the mercy of the AI developer. I guess this is probably the key point—I don’t understand the qualitative difference between being at the mercy of the software developer on the one hand, and being at the mercy of the software developer + the engineers who help the software run day-to-day on the other. There is a slightly different set of issues for monitoring/law enforcement/compliance/etc., but it doesn’t seem like a huge change.
(Probably the rest of this comment is irrelevant.)
To talk more concretely about mechanisms in a simple example, you might imagine a handful of companies who provide AI software. The people who use this software are essentially at the mercy of the software providers (since for all they know the software they are using will subvert their interests in arbitrary ways, whether or not there is a human involved in the process). In the most extreme case an AI provider could effectively steal all of their users’ wealth. They would presumably then face legal consequences, which are not qualitatively changed by the development of AI if the AI control problem is solved. If anything we expect the legal system and government to better serve human interests.
We could talk about monitoring/enforcement/etc., but again I don’t see these issues as interestingly different from the current set of issues, or as interestingly dependent on the nature of our AI control techniques. The most interesting change is probably the irrelevance of human labor, which I think is a very interesting issue economically/politically/legally/etc.
I agree with the general point that as technology improves a singleton becomes more likely. I’m agnostic on whether the control mechanisms I describe would be used by a singleton or by a bunch of actors, and as far as I can tell the character of the control problem is essentially the same in either case.
I do think that a singleton is likely eventually. From the perspective of human observers, a singleton will probably be established relatively shortly after wages fall below subsistence (at the latest). This prediction is mostly based on my expectation that political change will accelerate alongside technological change.
I wonder—are you also relatively indifferent between a hard and slow takeoff, given sufficient time before the takeoff to develop ai control theory?
(One of the reasons a hard takeoff seems scarier to me is that it is more likely to lead to a singleton, with a higher probability of locking in bad values.)