A cautious overseer might demand an explanation of the improvement and why it’s safe, in terms that he can understand
In this proposal:
A cautious overseer demands such an argument with very small probability. I’ll write soon about just how small I think this probability can safely be, I think in the original post I suggested 1%, but I think it could be much lower. If the check is only done with probability 1/10000, then it’s fine if the cost is 10,000% of the cost of implementing the project.
The human overseer has very little role in the early stages of the evaluation, and certainly they don’t have to understand the whole proposal. In my proposal as written she has to relay questions between different AI’s, but of course you could automate that completely, so that the first stages of evaluation are just done by other AI’s (whose evaluations are accountable to other AI’s, whose evaluations are accountable to other AI’s… whose evaluations are accountable to the overseer). At worst, the overseer’s role is similar to the arbirtrator in this scheme, though she has many additional techniques at her disposal.
If the world is moving fast, the main problem is probably the latency of the human evaluation. But there are a bunch of plausible-seeming techniques for getting to low latency. I hope to write about this soon as well.
ETA: What I should have said is that an overseer requests explanations very rarely during normal, intended operation. But this relies on the AI’s ability to find a policy which the overseer will predictably approve of.
Perhaps your point is that a more cautious overseer should request explanations more often, owing to the AI’s limited ability to predict the overseer’s reaction. But if so, we are going around in circles. What is the nature of these improvements, such that whether they are a good idea or not depends in such a detailed way on the values or philosophical views of the overseer? Why can’t one AI build an improved AI which also follows an innocuous policy like “don’t do anything terrible; acquire resources; let the user control those resources”?
Even if the actual biological overseer doesn’t evaluate a proposal, the AI still has to predict what the overseer’s judgement would be, if he or she were to, right? That involves modeling the overseer and the other sub-AIs (each of which is also modeling the overseer and other sub-AIs) and coming up with arguments that can convince the modeled overseer/AIs, etc. I don’t see how that could be only 1% less efficient, compared to an incautious overseer who the AI knows probably won’t demand such arguments or consult lots of sub-AIs and would be willing to try out any improvements that aren’t obviously dangerous.
Separately, the “innocuous” AI you talked about suffers from two efficiency-sapping problems: the AI isn’t sure what the overseer’s values are, and the overseer isn’t sure what his own values are. This forces the AI to avoid certain actions that may be highly efficient but potentially ethically disastrous. For example suppose the overseer isn’t sure whether negative feedback received by reinforcement-based agents has negative moral value, so the AI has to avoid building lots of reinforcement-based subagents. Another example might be that the AI faces a lot of everyday ethical questions in the course of acquiring resources, and has to take the latency hit of asking the overseer about them every time. These efficiency-sapping problems are worse for overseers with more complex ethical views, and more uncertain ethical views.
If there is competition, everyone has strong incentives to quickly build “full-fledged FAIs” which can solve these ethical problems and know exactly what they should and shouldn’t do. People who are less cautious will again have an efficiency advantage while doing this. E.g., they might be fine with building a standard utility-maximizing AI based on a crude model of their current understanding of ethics. I do not see how mandatory oversights or other social techniques can prevent this outcome, if you’re imagining a world where your AI design is being used widely. Someone could make a copy of an existing AI based on your design, change the code or configuration files to make themselves the overseer and remove the mandatory oversights, and then ask the AI to make a “full-fledged FAI” for them, and if they happen to be of the incautious type, this will probably result in the kind of crude normative AI mentioned above (or worse, if they approve a bunch of “improvements” that end up subverting their intentions altogether).
Re paragraph 3: it seems like these are mostly considerations that might strengthen your conclusions if we granted that there was a big productivity difference between my design and a “a standard utility-maximizing AI based on a crude model of their current understanding of ethics.” But I would already be happy to classify a large productivity loss as a failure, so let’s just concentrate on the claimed productivity loss.
If there is competition, everyone has strong incentives to quickly build “full-fledged FAIs” which can solve these ethical problems and know exactly what they should and shouldn’t do
These incentives only operate if there is a big productivity difference.
Beyond that, if the kinds of issues peope run into are “the AI faces a lot of everyday ethical questions in the course of acquiring resources,” then it really seems like what you need is a not-catastrophically-wrong model of human morality, which would probably just be built in mundane ways. I don’t see a strong argument that this would require lots of impresive conceptual progress, rather than being simiar-in-kind to building a predictive model of anything else. But I suspect this is just a reflection of the disagreement about paragraph 2, which we should flesh out in the sibling.
Someone could make a copy of an existing AI based on your design, change the code or configuration files to make themselves the overseer and remove the mandatory oversights, and then ask the AI to make a “full-fledged FAI” for them, and if they happen to be of the incautious type, this will probably result in the kind of crude normative AI mentioned above (or worse, if they approve a bunch of “improvements” that end up subverting their intentions altogether).
This doesn’t seem like a very general argument against the possibility of mandatory oversight or technological handicapping, and the measures you describe seem like strawmen. I agree that whatever kind of oversight you employ, it will be possible to subvert it, whatever tax you charge it will be possible to evade it, and so on. But doing so will often come with a cost (as it does today), and it just doesn’t seem that hard to get it up to a 1% loss (say). We could talk more about the particular measures that could be taken for oversight; I’m sure we can both imagine many regulatory and technological approaches that would be more annoying to sidestep than an entry in a configuration file, but I suspect our disagreement comes form us imagining different productivity gaps.
The prospect of someone designing their own AI, which is very architecturally different from the rest of the world, just doesn’t seem especially troubling, unless you imagine that the rest of the world is using a significantly handicapped design. (See the first sentence of this reply.)
that involves modeling the overseer and the other sub-AIs (each of which is also modeling the overseer and other sub-AIs) and coming up with arguments that can convince the modeled overseer/AIs, etc
I don’t think this is necessary. Let’s go back to the alien analogy. Let’s assume that you have an action that effectively implements U = “acquire resources; don’t be evil; leave the user in control.” I claimed that pursuing U is not much harder than acquiring resources on its own; for the moment can we grant that, since it seems orthogonal to the issues in this paragraph?
So you have this good action, and you are deciding what to do. I claim that you should just do it, with no extra computation. Then, if it gets evaluated, you will have to think about rhetorical strategies to use when convincing the overseer its a good idea. Also, a copy of you will have to think about how to convince the overseer that it’s a bad idea. And the overseer will need to actually think about the issue. And so on.
There are some issues that you’ll need to think about before you commit to the action. For example, if there is a knockdown argument that this action is a bad idea, then you shouldn’t do it. But that thinking seems to be a necessary part of maximizing U, not an extra step (unless the arguing system is broken).
You might be able to do even better by thinking through all of the arguments in advance, and taking an action which is not merely good but which will fare especially well during an evaluation. But if the extra computational costs of doing so are significant, this has to be weighed against the benefits of actually finding a better plan. It seems like this basically comes down to the question “If you take the very most efficient plan, and we perform some elaborate procedure for arguing, can you actually ‘convince’ the overseer that it’s a good plan?”
If the answer is “no” then you are screwed anyway, before we even start thinking about computational overhead—you simply can’t take the best plan, because you wouldn’t be able to receive a good rating if you do. And if the answer is “yes” then the problem seems to go away. The justifiability of good plans does seem to be a key question about the feasibility of this scheme, which I am happy to discuss if it’s underlying objection you have in mind.
This forces the AI to avoid certain actions that may be highly efficient but potentially ethically disastrous.
If the most efficient actions are ethically disastrous, then we have a fundamental problem which seems to me to be unrelated to AI safety, and which no AI control proposal will address. For example, if the most efficient strategy is to build a virus that kills everyone but you, and an AI is considering this strategy but has to reject it because it is unethical, then we are just out of luck. We could call this a problem with “AI,” but it’s really a problem with biotechnology.
If a certain kind of reinforcement learning is especially efficient but morally unacceptable, then that seems to be the same situation. What are we supposed to do, other than either accept the moral cost or adopt a good enough social solution to overcome the efficiency gap? What kind of solution might you hope to find that would make this kind of problem go away?
If the efficient actions merely might be ethically disastrous, then I guess the cost is supposed to be the time required to clarify the overseer’s values. Which brings us to:
Another example might be that the AI faces a lot of everyday ethical questions in the course of acquiring resources, and has to take the latency hit of asking the overseer about them every time.
The question is just how many distinct questions of this form there are, and how important they are to the AI’s plans. If there were merely a billion such questions it doesn’t seem like a big deal at all (though then a significant occupation of humans would be answering moral questions).
Even that strikes me as completely implausible given our experience so far (combined with my inability to see many future examples). If I were the user, and people were trying to optimize values using the range of policies available today, then it seems like they would have had to ask me no more than a dozen or so questions to get things basically right (i.e. realizing much more than 99% of the potential value from my perspective). So this seems to require moral problems to proliferate at a much faster rate than technological problems.
Do you disagree about the importance of hard ethical questions in the situation today (e.g. I am implicitly overlooking many important issues because I’m not used to dealing with an AI), or do you just expect more proliferation in the future?
Also, the problem of predicting human moral judgments doesn’t seem to be radically harder than the problem of e.g. negotiating with humans. I guess this is just another angle on “how many distinct moral questions do you have to answer?” since the real question is how much you can generalize from each answer. I don’t feel like there are that many hard-to-predict parameters before everything reduces to easy-to-predict consequences.
In this proposal:
A cautious overseer demands such an argument with very small probability. I’ll write soon about just how small I think this probability can safely be, I think in the original post I suggested 1%, but I think it could be much lower. If the check is only done with probability 1/10000, then it’s fine if the cost is 10,000% of the cost of implementing the project.
The human overseer has very little role in the early stages of the evaluation, and certainly they don’t have to understand the whole proposal. In my proposal as written she has to relay questions between different AI’s, but of course you could automate that completely, so that the first stages of evaluation are just done by other AI’s (whose evaluations are accountable to other AI’s, whose evaluations are accountable to other AI’s… whose evaluations are accountable to the overseer). At worst, the overseer’s role is similar to the arbirtrator in this scheme, though she has many additional techniques at her disposal.
If the world is moving fast, the main problem is probably the latency of the human evaluation. But there are a bunch of plausible-seeming techniques for getting to low latency. I hope to write about this soon as well.
ETA: What I should have said is that an overseer requests explanations very rarely during normal, intended operation. But this relies on the AI’s ability to find a policy which the overseer will predictably approve of.
Perhaps your point is that a more cautious overseer should request explanations more often, owing to the AI’s limited ability to predict the overseer’s reaction. But if so, we are going around in circles. What is the nature of these improvements, such that whether they are a good idea or not depends in such a detailed way on the values or philosophical views of the overseer? Why can’t one AI build an improved AI which also follows an innocuous policy like “don’t do anything terrible; acquire resources; let the user control those resources”?
Even if the actual biological overseer doesn’t evaluate a proposal, the AI still has to predict what the overseer’s judgement would be, if he or she were to, right? That involves modeling the overseer and the other sub-AIs (each of which is also modeling the overseer and other sub-AIs) and coming up with arguments that can convince the modeled overseer/AIs, etc. I don’t see how that could be only 1% less efficient, compared to an incautious overseer who the AI knows probably won’t demand such arguments or consult lots of sub-AIs and would be willing to try out any improvements that aren’t obviously dangerous.
Separately, the “innocuous” AI you talked about suffers from two efficiency-sapping problems: the AI isn’t sure what the overseer’s values are, and the overseer isn’t sure what his own values are. This forces the AI to avoid certain actions that may be highly efficient but potentially ethically disastrous. For example suppose the overseer isn’t sure whether negative feedback received by reinforcement-based agents has negative moral value, so the AI has to avoid building lots of reinforcement-based subagents. Another example might be that the AI faces a lot of everyday ethical questions in the course of acquiring resources, and has to take the latency hit of asking the overseer about them every time. These efficiency-sapping problems are worse for overseers with more complex ethical views, and more uncertain ethical views.
If there is competition, everyone has strong incentives to quickly build “full-fledged FAIs” which can solve these ethical problems and know exactly what they should and shouldn’t do. People who are less cautious will again have an efficiency advantage while doing this. E.g., they might be fine with building a standard utility-maximizing AI based on a crude model of their current understanding of ethics. I do not see how mandatory oversights or other social techniques can prevent this outcome, if you’re imagining a world where your AI design is being used widely. Someone could make a copy of an existing AI based on your design, change the code or configuration files to make themselves the overseer and remove the mandatory oversights, and then ask the AI to make a “full-fledged FAI” for them, and if they happen to be of the incautious type, this will probably result in the kind of crude normative AI mentioned above (or worse, if they approve a bunch of “improvements” that end up subverting their intentions altogether).
Re paragraph 3: it seems like these are mostly considerations that might strengthen your conclusions if we granted that there was a big productivity difference between my design and a “a standard utility-maximizing AI based on a crude model of their current understanding of ethics.” But I would already be happy to classify a large productivity loss as a failure, so let’s just concentrate on the claimed productivity loss.
These incentives only operate if there is a big productivity difference.
Beyond that, if the kinds of issues peope run into are “the AI faces a lot of everyday ethical questions in the course of acquiring resources,” then it really seems like what you need is a not-catastrophically-wrong model of human morality, which would probably just be built in mundane ways. I don’t see a strong argument that this would require lots of impresive conceptual progress, rather than being simiar-in-kind to building a predictive model of anything else. But I suspect this is just a reflection of the disagreement about paragraph 2, which we should flesh out in the sibling.
This doesn’t seem like a very general argument against the possibility of mandatory oversight or technological handicapping, and the measures you describe seem like strawmen. I agree that whatever kind of oversight you employ, it will be possible to subvert it, whatever tax you charge it will be possible to evade it, and so on. But doing so will often come with a cost (as it does today), and it just doesn’t seem that hard to get it up to a 1% loss (say). We could talk more about the particular measures that could be taken for oversight; I’m sure we can both imagine many regulatory and technological approaches that would be more annoying to sidestep than an entry in a configuration file, but I suspect our disagreement comes form us imagining different productivity gaps.
The prospect of someone designing their own AI, which is very architecturally different from the rest of the world, just doesn’t seem especially troubling, unless you imagine that the rest of the world is using a significantly handicapped design. (See the first sentence of this reply.)
Re paragraph 1:
I don’t think this is necessary. Let’s go back to the alien analogy. Let’s assume that you have an action that effectively implements U = “acquire resources; don’t be evil; leave the user in control.” I claimed that pursuing U is not much harder than acquiring resources on its own; for the moment can we grant that, since it seems orthogonal to the issues in this paragraph?
So you have this good action, and you are deciding what to do. I claim that you should just do it, with no extra computation. Then, if it gets evaluated, you will have to think about rhetorical strategies to use when convincing the overseer its a good idea. Also, a copy of you will have to think about how to convince the overseer that it’s a bad idea. And the overseer will need to actually think about the issue. And so on.
There are some issues that you’ll need to think about before you commit to the action. For example, if there is a knockdown argument that this action is a bad idea, then you shouldn’t do it. But that thinking seems to be a necessary part of maximizing U, not an extra step (unless the arguing system is broken).
You might be able to do even better by thinking through all of the arguments in advance, and taking an action which is not merely good but which will fare especially well during an evaluation. But if the extra computational costs of doing so are significant, this has to be weighed against the benefits of actually finding a better plan. It seems like this basically comes down to the question “If you take the very most efficient plan, and we perform some elaborate procedure for arguing, can you actually ‘convince’ the overseer that it’s a good plan?”
If the answer is “no” then you are screwed anyway, before we even start thinking about computational overhead—you simply can’t take the best plan, because you wouldn’t be able to receive a good rating if you do. And if the answer is “yes” then the problem seems to go away. The justifiability of good plans does seem to be a key question about the feasibility of this scheme, which I am happy to discuss if it’s underlying objection you have in mind.
Re paragraph 2:
If the most efficient actions are ethically disastrous, then we have a fundamental problem which seems to me to be unrelated to AI safety, and which no AI control proposal will address. For example, if the most efficient strategy is to build a virus that kills everyone but you, and an AI is considering this strategy but has to reject it because it is unethical, then we are just out of luck. We could call this a problem with “AI,” but it’s really a problem with biotechnology.
If a certain kind of reinforcement learning is especially efficient but morally unacceptable, then that seems to be the same situation. What are we supposed to do, other than either accept the moral cost or adopt a good enough social solution to overcome the efficiency gap? What kind of solution might you hope to find that would make this kind of problem go away?
If the efficient actions merely might be ethically disastrous, then I guess the cost is supposed to be the time required to clarify the overseer’s values. Which brings us to:
The question is just how many distinct questions of this form there are, and how important they are to the AI’s plans. If there were merely a billion such questions it doesn’t seem like a big deal at all (though then a significant occupation of humans would be answering moral questions).
Even that strikes me as completely implausible given our experience so far (combined with my inability to see many future examples). If I were the user, and people were trying to optimize values using the range of policies available today, then it seems like they would have had to ask me no more than a dozen or so questions to get things basically right (i.e. realizing much more than 99% of the potential value from my perspective). So this seems to require moral problems to proliferate at a much faster rate than technological problems.
Do you disagree about the importance of hard ethical questions in the situation today (e.g. I am implicitly overlooking many important issues because I’m not used to dealing with an AI), or do you just expect more proliferation in the future?
Also, the problem of predicting human moral judgments doesn’t seem to be radically harder than the problem of e.g. negotiating with humans. I guess this is just another angle on “how many distinct moral questions do you have to answer?” since the real question is how much you can generalize from each answer. I don’t feel like there are that many hard-to-predict parameters before everything reduces to easy-to-predict consequences.