But how will the safe projects exclude the unsafe projects from economies of scale and favorable terms of trade, if the unsafe projects are using the same basic design but just have overseers who care more about capability than safety?
Controlling the distribution of AI technology is one way to make someone’s life harder, but it’s not the only way. If we imagine a productivity gap as small as 1%, it seems like it doesn’t take much to close it.
(Disclaimer: this is unusally wild speculation; nothing I say is likely to be true, but hopefully it gives the flavor.)
If unsafe projects perfectly pretend to be safe projects, then they aren’t being more efficient. So it seems like we can assume that they are observably different from safe projects. (For example, there can’t just be complexity-loving humans who oversee projects exactly as if they had normal values; they need to skimp on oversight in order to actually be more eficient. Or else they need to differ in some other way...) If they are observably different, then possible measures include:
Even very small tax rates coupled with redistribution that is even marginally better-directed at safe projects (e.g. that goes to humans)
Regulatory measures to force everyone to incur the overhead, or most of the overhead, of being safe, e.g. lower bounds on human involvement.
Today many trades involve trust and understanding between the parties (e.g. if I go work for you). Probably some trades will retain this character. Honest people may be less happy to trade with those they expect to be malicious. I doubt this would be a huge factor, but 1% seems tiny.
Even in this scenario it may be easy to make technology which is architecturally harder to use by unsafe projects. E.g., it’s not clear whether the end user is the only overseer, or whether some oversight can be retained by law enforcement or the designers or someone else.
Of course unsafe projects can go to greater lengths in order to avoid these issues, for example by moving to friendlier jurisdictions or operating a black market in unsafe technology. But as these measures become more extreme they become increasingly easy to identify. If unsafe jurisdictions and black markets have only a few percent of the population of the world, then it’s easy to see how they could be less efficient.
(I’d also expect e.g. unsafe jurisdictions to quickly cave under international pressure, if the rents they could extract were a fraction of a percent of total productivty. They could easily be paid off, and if they didn’t want to be paid off, they would not be miliratily competitive.)
All of these measures become increasingly implausible at large productivty differentials. And I doubt that any of these particular foreseeable measures will be important. But overall, given that there are economies of scale, I find it very likely that the majority can win. The main question is whether they care enough to.
Normally I am on the other side of a discussion similar to this one, but involving much larger posited productivity gaps and a more confident claim (things are so likely to be OK that it’s not worth worrying about safety). Sorry if you were imagining a very much larger gap, so that this discussion isn’t helpful. And I do agree that there is a real possibility that things won’t be OK, even for small productivity gaps, but I feel like it’s more likely than not to be OK.
Also note that at a 1% gap, we can basically wait it out. If 10% of the world starts out malicious, then by the time the economy has grown 1000x, then 11% of the world is malicious, and it seems implausible that the AI situation won’t change during that time—certainly contemporary thinking about AI will be obsoleted, in an economic period as long as 0-2015AD. (The discussion of social coordination is more important in the case where there are larger efficiency gaps, and hence probably larger differences in how the projects look and what technology they need.)
ETA: Really the situation is not so straightforward, since 1% more productivity leads to more than 1% more profit; overall this issue really seems too complicated for this kind of vague theoretical speculation to be meaningfully accurate, but I hope I’ve given the basic flavor of my thinking.
And finally, I intended 1% as a relatively conservative estimate. I don’t see any particular reason you need to have so much waste, and I wouldn’t be surprised if it end up much lower, if future people end up pursuing some strategy along these lines.
1% seems really low to me. Suppose for example that the AI invents a modification to itself, which is meant to improve its performance. A cautious overseer might demand an explanation of the improvement and why it’s safe, in terms that he can understand, while an incautious overseer might be willing to just approve the modification right away and start using it. It seems to me that the cost of developing an understandable and convincing explanation of the improvement and its safety and then waiting for the overseer to process that, could easily be greater than 1% (or even 100%) of the cost of the inventing the improvement itself.
Also, caution/safety is a matter of degree, and it seems hard to define what “unsafe” means, for the purpose of imposing a penalty on all unsafe projects. (As you said, it would be even safer to not use any machine aid at all until much later. Where/how do you draw the line between “safe enough” and “unsafe”?) It also seems hard to tell which projects are skimping on safety from the outside (you can observe that they are advancing faster than expected given their resource base, but how do you rule out other explanations for that?), while from the inside, the workers “in the know” can probably find all kinds of justifications for doing what they are doing.
Really the situation is not so straightforward, since 1% more productivity leads to more than 1% more profit
Yes, I was about to point this out. In many areas the difference between the best and the second best is already the difference between making a profit and making a loss. This will be even more true for AI, since if AIs can be copied, there won’t be many fields where it makes economic sense to hire a 99% efficient AI instead of a copy of a 100% efficient AI.
In many areas the difference between the best and the second best is already the difference between making a profit and making a loss.
Without this issue the “wait it out” solution would seem almost completely satisfactory to me for a gap as small as 1% (or even 5%). Most of my comment was about making unsafe projects less efficient, which (I think) addresses this point.
Also note that 1% productivity benefit only leads to >> 1% efficiency if you are trading extensively with the rest of the world, so that most of your revenue is from sales (and the rest of the world can tell you are the most efficient, or at least that you are selling a lot of stuff cheaply). It’s much easier to regulate or make life harder for someone who is trading extensively with the rest of the world than someone who is operating in secret, so this seems to strengthen the arguments given in the body of my my comment.
Also, caution/safety is a matter of degree, and it seems hard to define what “unsafe” means, for the purpose of imposing a penalty on all unsafe projects
The ideas is just to make the relative cost of safety as low as possible. In the linked post I gave a quantification of safety. So let’s consider some 99.9% safe project, and the level of oversight it requires.
If this oversight is expensive because it involves using some resource (like involving human overseers, or periodically pausing as you wait on an overseer, or whatever) then it would be sufficient to require each project to use that resource, or to provide that resource for free to any project using tax dollars, or so on.
Alternatively, if there is some kind of oversight (e.g. periodic high-intensity audits, or AI oversight of AI projects, or mandated close involvement of human auditors) then the goal would just be to ensure that the price of evading detection eats up the efficiency benefits of unsafety. This looks pretty plausible to me, but you may be more skeptical about the feasibility of oversight.
A cautious overseer might demand an explanation of the improvement and why it’s safe, in terms that he can understand
In this proposal:
A cautious overseer demands such an argument with very small probability. I’ll write soon about just how small I think this probability can safely be, I think in the original post I suggested 1%, but I think it could be much lower. If the check is only done with probability 1/10000, then it’s fine if the cost is 10,000% of the cost of implementing the project.
The human overseer has very little role in the early stages of the evaluation, and certainly they don’t have to understand the whole proposal. In my proposal as written she has to relay questions between different AI’s, but of course you could automate that completely, so that the first stages of evaluation are just done by other AI’s (whose evaluations are accountable to other AI’s, whose evaluations are accountable to other AI’s… whose evaluations are accountable to the overseer). At worst, the overseer’s role is similar to the arbirtrator in this scheme, though she has many additional techniques at her disposal.
If the world is moving fast, the main problem is probably the latency of the human evaluation. But there are a bunch of plausible-seeming techniques for getting to low latency. I hope to write about this soon as well.
ETA: What I should have said is that an overseer requests explanations very rarely during normal, intended operation. But this relies on the AI’s ability to find a policy which the overseer will predictably approve of.
Perhaps your point is that a more cautious overseer should request explanations more often, owing to the AI’s limited ability to predict the overseer’s reaction. But if so, we are going around in circles. What is the nature of these improvements, such that whether they are a good idea or not depends in such a detailed way on the values or philosophical views of the overseer? Why can’t one AI build an improved AI which also follows an innocuous policy like “don’t do anything terrible; acquire resources; let the user control those resources”?
Even if the actual biological overseer doesn’t evaluate a proposal, the AI still has to predict what the overseer’s judgement would be, if he or she were to, right? That involves modeling the overseer and the other sub-AIs (each of which is also modeling the overseer and other sub-AIs) and coming up with arguments that can convince the modeled overseer/AIs, etc. I don’t see how that could be only 1% less efficient, compared to an incautious overseer who the AI knows probably won’t demand such arguments or consult lots of sub-AIs and would be willing to try out any improvements that aren’t obviously dangerous.
Separately, the “innocuous” AI you talked about suffers from two efficiency-sapping problems: the AI isn’t sure what the overseer’s values are, and the overseer isn’t sure what his own values are. This forces the AI to avoid certain actions that may be highly efficient but potentially ethically disastrous. For example suppose the overseer isn’t sure whether negative feedback received by reinforcement-based agents has negative moral value, so the AI has to avoid building lots of reinforcement-based subagents. Another example might be that the AI faces a lot of everyday ethical questions in the course of acquiring resources, and has to take the latency hit of asking the overseer about them every time. These efficiency-sapping problems are worse for overseers with more complex ethical views, and more uncertain ethical views.
If there is competition, everyone has strong incentives to quickly build “full-fledged FAIs” which can solve these ethical problems and know exactly what they should and shouldn’t do. People who are less cautious will again have an efficiency advantage while doing this. E.g., they might be fine with building a standard utility-maximizing AI based on a crude model of their current understanding of ethics. I do not see how mandatory oversights or other social techniques can prevent this outcome, if you’re imagining a world where your AI design is being used widely. Someone could make a copy of an existing AI based on your design, change the code or configuration files to make themselves the overseer and remove the mandatory oversights, and then ask the AI to make a “full-fledged FAI” for them, and if they happen to be of the incautious type, this will probably result in the kind of crude normative AI mentioned above (or worse, if they approve a bunch of “improvements” that end up subverting their intentions altogether).
Re paragraph 3: it seems like these are mostly considerations that might strengthen your conclusions if we granted that there was a big productivity difference between my design and a “a standard utility-maximizing AI based on a crude model of their current understanding of ethics.” But I would already be happy to classify a large productivity loss as a failure, so let’s just concentrate on the claimed productivity loss.
If there is competition, everyone has strong incentives to quickly build “full-fledged FAIs” which can solve these ethical problems and know exactly what they should and shouldn’t do
These incentives only operate if there is a big productivity difference.
Beyond that, if the kinds of issues peope run into are “the AI faces a lot of everyday ethical questions in the course of acquiring resources,” then it really seems like what you need is a not-catastrophically-wrong model of human morality, which would probably just be built in mundane ways. I don’t see a strong argument that this would require lots of impresive conceptual progress, rather than being simiar-in-kind to building a predictive model of anything else. But I suspect this is just a reflection of the disagreement about paragraph 2, which we should flesh out in the sibling.
Someone could make a copy of an existing AI based on your design, change the code or configuration files to make themselves the overseer and remove the mandatory oversights, and then ask the AI to make a “full-fledged FAI” for them, and if they happen to be of the incautious type, this will probably result in the kind of crude normative AI mentioned above (or worse, if they approve a bunch of “improvements” that end up subverting their intentions altogether).
This doesn’t seem like a very general argument against the possibility of mandatory oversight or technological handicapping, and the measures you describe seem like strawmen. I agree that whatever kind of oversight you employ, it will be possible to subvert it, whatever tax you charge it will be possible to evade it, and so on. But doing so will often come with a cost (as it does today), and it just doesn’t seem that hard to get it up to a 1% loss (say). We could talk more about the particular measures that could be taken for oversight; I’m sure we can both imagine many regulatory and technological approaches that would be more annoying to sidestep than an entry in a configuration file, but I suspect our disagreement comes form us imagining different productivity gaps.
The prospect of someone designing their own AI, which is very architecturally different from the rest of the world, just doesn’t seem especially troubling, unless you imagine that the rest of the world is using a significantly handicapped design. (See the first sentence of this reply.)
that involves modeling the overseer and the other sub-AIs (each of which is also modeling the overseer and other sub-AIs) and coming up with arguments that can convince the modeled overseer/AIs, etc
I don’t think this is necessary. Let’s go back to the alien analogy. Let’s assume that you have an action that effectively implements U = “acquire resources; don’t be evil; leave the user in control.” I claimed that pursuing U is not much harder than acquiring resources on its own; for the moment can we grant that, since it seems orthogonal to the issues in this paragraph?
So you have this good action, and you are deciding what to do. I claim that you should just do it, with no extra computation. Then, if it gets evaluated, you will have to think about rhetorical strategies to use when convincing the overseer its a good idea. Also, a copy of you will have to think about how to convince the overseer that it’s a bad idea. And the overseer will need to actually think about the issue. And so on.
There are some issues that you’ll need to think about before you commit to the action. For example, if there is a knockdown argument that this action is a bad idea, then you shouldn’t do it. But that thinking seems to be a necessary part of maximizing U, not an extra step (unless the arguing system is broken).
You might be able to do even better by thinking through all of the arguments in advance, and taking an action which is not merely good but which will fare especially well during an evaluation. But if the extra computational costs of doing so are significant, this has to be weighed against the benefits of actually finding a better plan. It seems like this basically comes down to the question “If you take the very most efficient plan, and we perform some elaborate procedure for arguing, can you actually ‘convince’ the overseer that it’s a good plan?”
If the answer is “no” then you are screwed anyway, before we even start thinking about computational overhead—you simply can’t take the best plan, because you wouldn’t be able to receive a good rating if you do. And if the answer is “yes” then the problem seems to go away. The justifiability of good plans does seem to be a key question about the feasibility of this scheme, which I am happy to discuss if it’s underlying objection you have in mind.
This forces the AI to avoid certain actions that may be highly efficient but potentially ethically disastrous.
If the most efficient actions are ethically disastrous, then we have a fundamental problem which seems to me to be unrelated to AI safety, and which no AI control proposal will address. For example, if the most efficient strategy is to build a virus that kills everyone but you, and an AI is considering this strategy but has to reject it because it is unethical, then we are just out of luck. We could call this a problem with “AI,” but it’s really a problem with biotechnology.
If a certain kind of reinforcement learning is especially efficient but morally unacceptable, then that seems to be the same situation. What are we supposed to do, other than either accept the moral cost or adopt a good enough social solution to overcome the efficiency gap? What kind of solution might you hope to find that would make this kind of problem go away?
If the efficient actions merely might be ethically disastrous, then I guess the cost is supposed to be the time required to clarify the overseer’s values. Which brings us to:
Another example might be that the AI faces a lot of everyday ethical questions in the course of acquiring resources, and has to take the latency hit of asking the overseer about them every time.
The question is just how many distinct questions of this form there are, and how important they are to the AI’s plans. If there were merely a billion such questions it doesn’t seem like a big deal at all (though then a significant occupation of humans would be answering moral questions).
Even that strikes me as completely implausible given our experience so far (combined with my inability to see many future examples). If I were the user, and people were trying to optimize values using the range of policies available today, then it seems like they would have had to ask me no more than a dozen or so questions to get things basically right (i.e. realizing much more than 99% of the potential value from my perspective). So this seems to require moral problems to proliferate at a much faster rate than technological problems.
Do you disagree about the importance of hard ethical questions in the situation today (e.g. I am implicitly overlooking many important issues because I’m not used to dealing with an AI), or do you just expect more proliferation in the future?
Also, the problem of predicting human moral judgments doesn’t seem to be radically harder than the problem of e.g. negotiating with humans. I guess this is just another angle on “how many distinct moral questions do you have to answer?” since the real question is how much you can generalize from each answer. I don’t feel like there are that many hard-to-predict parameters before everything reduces to easy-to-predict consequences.
Controlling the distribution of AI technology is one way to make someone’s life harder, but it’s not the only way. If we imagine a productivity gap as small as 1%, it seems like it doesn’t take much to close it.
(Disclaimer: this is unusally wild speculation; nothing I say is likely to be true, but hopefully it gives the flavor.)
If unsafe projects perfectly pretend to be safe projects, then they aren’t being more efficient. So it seems like we can assume that they are observably different from safe projects. (For example, there can’t just be complexity-loving humans who oversee projects exactly as if they had normal values; they need to skimp on oversight in order to actually be more eficient. Or else they need to differ in some other way...) If they are observably different, then possible measures include:
Even very small tax rates coupled with redistribution that is even marginally better-directed at safe projects (e.g. that goes to humans)
Regulatory measures to force everyone to incur the overhead, or most of the overhead, of being safe, e.g. lower bounds on human involvement.
Today many trades involve trust and understanding between the parties (e.g. if I go work for you). Probably some trades will retain this character. Honest people may be less happy to trade with those they expect to be malicious. I doubt this would be a huge factor, but 1% seems tiny.
Even in this scenario it may be easy to make technology which is architecturally harder to use by unsafe projects. E.g., it’s not clear whether the end user is the only overseer, or whether some oversight can be retained by law enforcement or the designers or someone else.
Of course unsafe projects can go to greater lengths in order to avoid these issues, for example by moving to friendlier jurisdictions or operating a black market in unsafe technology. But as these measures become more extreme they become increasingly easy to identify. If unsafe jurisdictions and black markets have only a few percent of the population of the world, then it’s easy to see how they could be less efficient.
(I’d also expect e.g. unsafe jurisdictions to quickly cave under international pressure, if the rents they could extract were a fraction of a percent of total productivty. They could easily be paid off, and if they didn’t want to be paid off, they would not be miliratily competitive.)
All of these measures become increasingly implausible at large productivty differentials. And I doubt that any of these particular foreseeable measures will be important. But overall, given that there are economies of scale, I find it very likely that the majority can win. The main question is whether they care enough to.
Normally I am on the other side of a discussion similar to this one, but involving much larger posited productivity gaps and a more confident claim (things are so likely to be OK that it’s not worth worrying about safety). Sorry if you were imagining a very much larger gap, so that this discussion isn’t helpful. And I do agree that there is a real possibility that things won’t be OK, even for small productivity gaps, but I feel like it’s more likely than not to be OK.
Also note that at a 1% gap, we can basically wait it out. If 10% of the world starts out malicious, then by the time the economy has grown 1000x, then 11% of the world is malicious, and it seems implausible that the AI situation won’t change during that time—certainly contemporary thinking about AI will be obsoleted, in an economic period as long as 0-2015AD. (The discussion of social coordination is more important in the case where there are larger efficiency gaps, and hence probably larger differences in how the projects look and what technology they need.)
ETA: Really the situation is not so straightforward, since 1% more productivity leads to more than 1% more profit; overall this issue really seems too complicated for this kind of vague theoretical speculation to be meaningfully accurate, but I hope I’ve given the basic flavor of my thinking.
And finally, I intended 1% as a relatively conservative estimate. I don’t see any particular reason you need to have so much waste, and I wouldn’t be surprised if it end up much lower, if future people end up pursuing some strategy along these lines.
1% seems really low to me. Suppose for example that the AI invents a modification to itself, which is meant to improve its performance. A cautious overseer might demand an explanation of the improvement and why it’s safe, in terms that he can understand, while an incautious overseer might be willing to just approve the modification right away and start using it. It seems to me that the cost of developing an understandable and convincing explanation of the improvement and its safety and then waiting for the overseer to process that, could easily be greater than 1% (or even 100%) of the cost of the inventing the improvement itself.
Also, caution/safety is a matter of degree, and it seems hard to define what “unsafe” means, for the purpose of imposing a penalty on all unsafe projects. (As you said, it would be even safer to not use any machine aid at all until much later. Where/how do you draw the line between “safe enough” and “unsafe”?) It also seems hard to tell which projects are skimping on safety from the outside (you can observe that they are advancing faster than expected given their resource base, but how do you rule out other explanations for that?), while from the inside, the workers “in the know” can probably find all kinds of justifications for doing what they are doing.
Yes, I was about to point this out. In many areas the difference between the best and the second best is already the difference between making a profit and making a loss. This will be even more true for AI, since if AIs can be copied, there won’t be many fields where it makes economic sense to hire a 99% efficient AI instead of a copy of a 100% efficient AI.
Without this issue the “wait it out” solution would seem almost completely satisfactory to me for a gap as small as 1% (or even 5%). Most of my comment was about making unsafe projects less efficient, which (I think) addresses this point.
Also note that 1% productivity benefit only leads to >> 1% efficiency if you are trading extensively with the rest of the world, so that most of your revenue is from sales (and the rest of the world can tell you are the most efficient, or at least that you are selling a lot of stuff cheaply). It’s much easier to regulate or make life harder for someone who is trading extensively with the rest of the world than someone who is operating in secret, so this seems to strengthen the arguments given in the body of my my comment.
The ideas is just to make the relative cost of safety as low as possible. In the linked post I gave a quantification of safety. So let’s consider some 99.9% safe project, and the level of oversight it requires.
If this oversight is expensive because it involves using some resource (like involving human overseers, or periodically pausing as you wait on an overseer, or whatever) then it would be sufficient to require each project to use that resource, or to provide that resource for free to any project using tax dollars, or so on.
Alternatively, if there is some kind of oversight (e.g. periodic high-intensity audits, or AI oversight of AI projects, or mandated close involvement of human auditors) then the goal would just be to ensure that the price of evading detection eats up the efficiency benefits of unsafety. This looks pretty plausible to me, but you may be more skeptical about the feasibility of oversight.
In this proposal:
A cautious overseer demands such an argument with very small probability. I’ll write soon about just how small I think this probability can safely be, I think in the original post I suggested 1%, but I think it could be much lower. If the check is only done with probability 1/10000, then it’s fine if the cost is 10,000% of the cost of implementing the project.
The human overseer has very little role in the early stages of the evaluation, and certainly they don’t have to understand the whole proposal. In my proposal as written she has to relay questions between different AI’s, but of course you could automate that completely, so that the first stages of evaluation are just done by other AI’s (whose evaluations are accountable to other AI’s, whose evaluations are accountable to other AI’s… whose evaluations are accountable to the overseer). At worst, the overseer’s role is similar to the arbirtrator in this scheme, though she has many additional techniques at her disposal.
If the world is moving fast, the main problem is probably the latency of the human evaluation. But there are a bunch of plausible-seeming techniques for getting to low latency. I hope to write about this soon as well.
ETA: What I should have said is that an overseer requests explanations very rarely during normal, intended operation. But this relies on the AI’s ability to find a policy which the overseer will predictably approve of.
Perhaps your point is that a more cautious overseer should request explanations more often, owing to the AI’s limited ability to predict the overseer’s reaction. But if so, we are going around in circles. What is the nature of these improvements, such that whether they are a good idea or not depends in such a detailed way on the values or philosophical views of the overseer? Why can’t one AI build an improved AI which also follows an innocuous policy like “don’t do anything terrible; acquire resources; let the user control those resources”?
Even if the actual biological overseer doesn’t evaluate a proposal, the AI still has to predict what the overseer’s judgement would be, if he or she were to, right? That involves modeling the overseer and the other sub-AIs (each of which is also modeling the overseer and other sub-AIs) and coming up with arguments that can convince the modeled overseer/AIs, etc. I don’t see how that could be only 1% less efficient, compared to an incautious overseer who the AI knows probably won’t demand such arguments or consult lots of sub-AIs and would be willing to try out any improvements that aren’t obviously dangerous.
Separately, the “innocuous” AI you talked about suffers from two efficiency-sapping problems: the AI isn’t sure what the overseer’s values are, and the overseer isn’t sure what his own values are. This forces the AI to avoid certain actions that may be highly efficient but potentially ethically disastrous. For example suppose the overseer isn’t sure whether negative feedback received by reinforcement-based agents has negative moral value, so the AI has to avoid building lots of reinforcement-based subagents. Another example might be that the AI faces a lot of everyday ethical questions in the course of acquiring resources, and has to take the latency hit of asking the overseer about them every time. These efficiency-sapping problems are worse for overseers with more complex ethical views, and more uncertain ethical views.
If there is competition, everyone has strong incentives to quickly build “full-fledged FAIs” which can solve these ethical problems and know exactly what they should and shouldn’t do. People who are less cautious will again have an efficiency advantage while doing this. E.g., they might be fine with building a standard utility-maximizing AI based on a crude model of their current understanding of ethics. I do not see how mandatory oversights or other social techniques can prevent this outcome, if you’re imagining a world where your AI design is being used widely. Someone could make a copy of an existing AI based on your design, change the code or configuration files to make themselves the overseer and remove the mandatory oversights, and then ask the AI to make a “full-fledged FAI” for them, and if they happen to be of the incautious type, this will probably result in the kind of crude normative AI mentioned above (or worse, if they approve a bunch of “improvements” that end up subverting their intentions altogether).
Re paragraph 3: it seems like these are mostly considerations that might strengthen your conclusions if we granted that there was a big productivity difference between my design and a “a standard utility-maximizing AI based on a crude model of their current understanding of ethics.” But I would already be happy to classify a large productivity loss as a failure, so let’s just concentrate on the claimed productivity loss.
These incentives only operate if there is a big productivity difference.
Beyond that, if the kinds of issues peope run into are “the AI faces a lot of everyday ethical questions in the course of acquiring resources,” then it really seems like what you need is a not-catastrophically-wrong model of human morality, which would probably just be built in mundane ways. I don’t see a strong argument that this would require lots of impresive conceptual progress, rather than being simiar-in-kind to building a predictive model of anything else. But I suspect this is just a reflection of the disagreement about paragraph 2, which we should flesh out in the sibling.
This doesn’t seem like a very general argument against the possibility of mandatory oversight or technological handicapping, and the measures you describe seem like strawmen. I agree that whatever kind of oversight you employ, it will be possible to subvert it, whatever tax you charge it will be possible to evade it, and so on. But doing so will often come with a cost (as it does today), and it just doesn’t seem that hard to get it up to a 1% loss (say). We could talk more about the particular measures that could be taken for oversight; I’m sure we can both imagine many regulatory and technological approaches that would be more annoying to sidestep than an entry in a configuration file, but I suspect our disagreement comes form us imagining different productivity gaps.
The prospect of someone designing their own AI, which is very architecturally different from the rest of the world, just doesn’t seem especially troubling, unless you imagine that the rest of the world is using a significantly handicapped design. (See the first sentence of this reply.)
Re paragraph 1:
I don’t think this is necessary. Let’s go back to the alien analogy. Let’s assume that you have an action that effectively implements U = “acquire resources; don’t be evil; leave the user in control.” I claimed that pursuing U is not much harder than acquiring resources on its own; for the moment can we grant that, since it seems orthogonal to the issues in this paragraph?
So you have this good action, and you are deciding what to do. I claim that you should just do it, with no extra computation. Then, if it gets evaluated, you will have to think about rhetorical strategies to use when convincing the overseer its a good idea. Also, a copy of you will have to think about how to convince the overseer that it’s a bad idea. And the overseer will need to actually think about the issue. And so on.
There are some issues that you’ll need to think about before you commit to the action. For example, if there is a knockdown argument that this action is a bad idea, then you shouldn’t do it. But that thinking seems to be a necessary part of maximizing U, not an extra step (unless the arguing system is broken).
You might be able to do even better by thinking through all of the arguments in advance, and taking an action which is not merely good but which will fare especially well during an evaluation. But if the extra computational costs of doing so are significant, this has to be weighed against the benefits of actually finding a better plan. It seems like this basically comes down to the question “If you take the very most efficient plan, and we perform some elaborate procedure for arguing, can you actually ‘convince’ the overseer that it’s a good plan?”
If the answer is “no” then you are screwed anyway, before we even start thinking about computational overhead—you simply can’t take the best plan, because you wouldn’t be able to receive a good rating if you do. And if the answer is “yes” then the problem seems to go away. The justifiability of good plans does seem to be a key question about the feasibility of this scheme, which I am happy to discuss if it’s underlying objection you have in mind.
Re paragraph 2:
If the most efficient actions are ethically disastrous, then we have a fundamental problem which seems to me to be unrelated to AI safety, and which no AI control proposal will address. For example, if the most efficient strategy is to build a virus that kills everyone but you, and an AI is considering this strategy but has to reject it because it is unethical, then we are just out of luck. We could call this a problem with “AI,” but it’s really a problem with biotechnology.
If a certain kind of reinforcement learning is especially efficient but morally unacceptable, then that seems to be the same situation. What are we supposed to do, other than either accept the moral cost or adopt a good enough social solution to overcome the efficiency gap? What kind of solution might you hope to find that would make this kind of problem go away?
If the efficient actions merely might be ethically disastrous, then I guess the cost is supposed to be the time required to clarify the overseer’s values. Which brings us to:
The question is just how many distinct questions of this form there are, and how important they are to the AI’s plans. If there were merely a billion such questions it doesn’t seem like a big deal at all (though then a significant occupation of humans would be answering moral questions).
Even that strikes me as completely implausible given our experience so far (combined with my inability to see many future examples). If I were the user, and people were trying to optimize values using the range of policies available today, then it seems like they would have had to ask me no more than a dozen or so questions to get things basically right (i.e. realizing much more than 99% of the potential value from my perspective). So this seems to require moral problems to proliferate at a much faster rate than technological problems.
Do you disagree about the importance of hard ethical questions in the situation today (e.g. I am implicitly overlooking many important issues because I’m not used to dealing with an AI), or do you just expect more proliferation in the future?
Also, the problem of predicting human moral judgments doesn’t seem to be radically harder than the problem of e.g. negotiating with humans. I guess this is just another angle on “how many distinct moral questions do you have to answer?” since the real question is how much you can generalize from each answer. I don’t feel like there are that many hard-to-predict parameters before everything reduces to easy-to-predict consequences.