Suppose that you created an entity which is superhuman in some respects (a task that has already been done already done many times over) and asked it to find third alternatives.
As far as software is concerned, this flavor of superhumanity does not remotely resemble anything that has already been done. You’re talking about assembling an “entity” capable of answering complex questions at the intersection of physics, philosophy, and human psychology. This is a far cry from the automation of relatively simple, isolated tasks like playing chess or decoding speech—I seriously doubt that any sub-AGI would be up to the task.
The non-software alternatives you mention are even less predictable/controllable than AI, so I don’t see how pursuing those strategies could be any safer than a strictly FAI-based approach. Granted sufficient superhumanity (we can’t precisely anticipate how much we’re granting them), the human components of your “team” would face an enormous temptation to use their power to acquire more power. This risk would need to be weighed against its benefits, but the original aim was just the prevention of a sub-optimal singleton! So all we’ve done is step closer to the endgame without knowably improving our board position.
Human teams, or human/software amalgams (like the LessWrong moderation system that we’re part of right now) are routinely superhuman in many ways. LessWrong, considered as a single entity, has superhumanly broad knowledge. It has a fairly short turn-around time for getting somewhat thoughtful answers—possibly more consistently short turn-around time than any of us could manage alone.
An entity such as this one might be highly capable in some narrowly focused ways (indeed, it could be purpose-built for one goal—the goal of reducing the risk of the earth being paperclipped) while being utterly incapable in many other ways, and posing almost no threat to the earth or wider society.
Building a purportedly-Friendly general-purpose recursive self-improving process, on the other hand, means a risk that you’ve created something that will diverge on the Nth self-improvement cycle and become unfriendly. By explicitly going for general-purpose, no-human-dependencies, and indefinitely self-improvable, you’re building in exactly the same elements that you suspect are dangerous.
By explicitly going for general-purpose, no-human-dependencies, and indefinitely self-improvable, you’re building in exactly the same elements that you suspect are dangerous.
This is a fairly obvious point that becomes more complicated in a larger scope. Being charitable, you seem to be implying
P(Fail | FAI-attempt) > P(Fail | ~FAI-attempt)
where FAI-attempt means “we built and deployed an AGI that we thought was Friendly”. If our FAI efforts are the only thing that causally affects Fail, then your implication might be correct. But if we take into account all other AGI research, we need more detail. Assume FAI researchers have no inherent advantage over AGI researchers (questionable). Then we basically have
Eliezer has argued at considerable length that P(Fail | AGI & ~FAI) is very close to 1. So under these assumptions, the odds of a FAI failure must be higher than the odds of non-FAI AGI being created in order to successfully argue that FAI is more dangerous than the alternative. Do you have any objection to my assumptions and derivation, or do you believe that P(Fail | FAI) > P(AGI | ~FAI)?
Can you send me a model? I think my objection is to the binariness of the possible strategies node, but I’m not sure how to express that best in your model.
Suppose there are N rojects in the world each of which might almost-succeed and so each of which is an existential risk.
The variable that I can counterfactually control is my actions. The variable that we can counterfactually control are our actions. Since we’re conversing in persuasive dialog, it is reasonable to discuss what strategies we might take to best reduce existential risk.
Suppose that we distinguish between “safety strategies” and “singleton strategies”.
Singleton strategies are explicitly going for fast, general-purpose power and capability, with as many stacks of iterated exponential growth in capability as the recursive self-improvement engineers can manage. It seems obvious to me that if we embarked on a singleton strategy, even with the best of intentions, there are now N+1 AGI projects, each increasing existential risk, and our best intentions might not outweigh that increase.
Safety strategies would involve attempting to create entities (e.g. human teams, human/software amalgams, special-purpose software) which are explicitly limited and very unlikely to be generally powerful compared to the world at large. They would try to decrease existential risk both directly (e.g. build tools for the AGI projects that reduce the chance of the AGI projects going wrong) and indirectly, by not contributing to the problem.
No, sorry, the above comment was just my attempt to explain my objection as unambiguously as possible.
It seems obvious to me that if we embarked on a singleton strategy, even with the best of intentions, there are now N+1 AGI projects, each increasing existential risk, and our best intentions might not outweigh that increase.
Yes, but your “N+1” hides some important detail: Our effective contribution to existential risk diminishes as N grows, while our contribution to safer outcomes stays constant or even grows (in the case that our work has a positive impact on someone else’s “winning” project).
I think my objection is to the binariness of the possible strategies node, but I’m not sure how to express that best in your model. [...] They would try to decrease existential risk both directly (e.g. build tools for the AGI projects that reduce the chance of the AGI projects going wrong) and indirectly, by not contributing to the problem.
Since you were making the point that attempting to build Friendly AGI contributes to existential risk, I thought it fair to factor out other actions. The two strategies you outline above are entirely independent, so they should be evaluated separately. I read you as promoting the latter strategy independently when you say:
By explicitly going for general-purpose, no-human-dependencies, and indefinitely self-improvable, you’re building in exactly the same elements that you suspect are dangerous.
The choice under consideration is binary: Attempt a singleton or don’t. Safety strategies may also be worthwhile, but I need a better reason than “they’re working toward the same goal” to view them as relevant to the singleton question.
As far as software is concerned, this flavor of superhumanity does not remotely resemble anything that has already been done. You’re talking about assembling an “entity” capable of answering complex questions at the intersection of physics, philosophy, and human psychology. This is a far cry from the automation of relatively simple, isolated tasks like playing chess or decoding speech—I seriously doubt that any sub-AGI would be up to the task.
The non-software alternatives you mention are even less predictable/controllable than AI, so I don’t see how pursuing those strategies could be any safer than a strictly FAI-based approach. Granted sufficient superhumanity (we can’t precisely anticipate how much we’re granting them), the human components of your “team” would face an enormous temptation to use their power to acquire more power. This risk would need to be weighed against its benefits, but the original aim was just the prevention of a sub-optimal singleton! So all we’ve done is step closer to the endgame without knowably improving our board position.
Human teams, or human/software amalgams (like the LessWrong moderation system that we’re part of right now) are routinely superhuman in many ways. LessWrong, considered as a single entity, has superhumanly broad knowledge. It has a fairly short turn-around time for getting somewhat thoughtful answers—possibly more consistently short turn-around time than any of us could manage alone.
An entity such as this one might be highly capable in some narrowly focused ways (indeed, it could be purpose-built for one goal—the goal of reducing the risk of the earth being paperclipped) while being utterly incapable in many other ways, and posing almost no threat to the earth or wider society.
Building a purportedly-Friendly general-purpose recursive self-improving process, on the other hand, means a risk that you’ve created something that will diverge on the Nth self-improvement cycle and become unfriendly. By explicitly going for general-purpose, no-human-dependencies, and indefinitely self-improvable, you’re building in exactly the same elements that you suspect are dangerous.
This is a fairly obvious point that becomes more complicated in a larger scope. Being charitable, you seem to be implying
where FAI-attempt means “we built and deployed an AGI that we thought was Friendly”. If our FAI efforts are the only thing that causally affects Fail, then your implication might be correct. But if we take into account all other AGI research, we need more detail. Assume FAI researchers have no inherent advantage over AGI researchers (questionable). Then we basically have
So in these terms, what would it mean for an FAI attempt to be riskier?
Eliezer has argued at considerable length that P(Fail | AGI & ~FAI) is very close to 1. So under these assumptions, the odds of a FAI failure must be higher than the odds of non-FAI AGI being created in order to successfully argue that FAI is more dangerous than the alternative. Do you have any objection to my assumptions and derivation, or do you believe that P(Fail | FAI) > P(AGI | ~FAI)?
Can you send me a model? I think my objection is to the binariness of the possible strategies node, but I’m not sure how to express that best in your model.
Suppose there are N rojects in the world each of which might almost-succeed and so each of which is an existential risk.
The variable that I can counterfactually control is my actions. The variable that we can counterfactually control are our actions. Since we’re conversing in persuasive dialog, it is reasonable to discuss what strategies we might take to best reduce existential risk.
Suppose that we distinguish between “safety strategies” and “singleton strategies”.
Singleton strategies are explicitly going for fast, general-purpose power and capability, with as many stacks of iterated exponential growth in capability as the recursive self-improvement engineers can manage. It seems obvious to me that if we embarked on a singleton strategy, even with the best of intentions, there are now N+1 AGI projects, each increasing existential risk, and our best intentions might not outweigh that increase.
Safety strategies would involve attempting to create entities (e.g. human teams, human/software amalgams, special-purpose software) which are explicitly limited and very unlikely to be generally powerful compared to the world at large. They would try to decrease existential risk both directly (e.g. build tools for the AGI projects that reduce the chance of the AGI projects going wrong) and indirectly, by not contributing to the problem.
No, sorry, the above comment was just my attempt to explain my objection as unambiguously as possible.
Yes, but your “N+1” hides some important detail: Our effective contribution to existential risk diminishes as N grows, while our contribution to safer outcomes stays constant or even grows (in the case that our work has a positive impact on someone else’s “winning” project).
Since you were making the point that attempting to build Friendly AGI contributes to existential risk, I thought it fair to factor out other actions. The two strategies you outline above are entirely independent, so they should be evaluated separately. I read you as promoting the latter strategy independently when you say:
The choice under consideration is binary: Attempt a singleton or don’t. Safety strategies may also be worthwhile, but I need a better reason than “they’re working toward the same goal” to view them as relevant to the singleton question.