Good catch, I edited the last points in each part to make the scale of the disaster clearer, and removed the reference to gorillas.
I do think the scale of disaster is smaller (in expectation) in Part 1 than in Part 2, for the reason mentioned here—basically, the systems in Part 1 are somewhat more aligned with human intentions (albeit poorly specified proxies to them), so there’s some chance that they leave humans alone. Whereas Part 2 is a treacherous turn inner alignment failure, where the systems learned arbitrary objectives and so have no incentive at all to keep humans alive.
My own guess is that this is not that far-fetched. This is a “generic values hypothesis”, that human values are enough of a blank slate thing that the Internet already redundantly imprints everything relevant that humans share. In this case a random AI with values that are vaguely learning-from-Internet inspired is not much less aligned than a random human, and although that’s not particularly reassuring (value drift can go far when minds are upgraded without a clear architecture that formulates and preserves values), this is a reason for some nontrivial chance of settling on a humane attitude to humanity, which wouldn’t just happen on its own, without cause. This possibility gets more remote if values are engineered de novo and don’t start out as channeling a language model.
humans will most likely go extinct
Without the argument this feels alarmist. Humans can manage their own survival if they are not actively exterminated, it takes a massive disruption as a byproduct of AIs’ activities to prevent that. The possibility of such a disruption is grounded in the convergent instrumental value of resource acquisition and eventual feasibility of megascale engineering, the premises that are not necessarily readily apparent.
My own guess is that this is not that far-fetched.
Thanks for writing this out, I found it helpful and it’s updated me a bit towards human extinction not being that far-fetched in the ‘Part 1’ world. Though I do still think that, in this world, humans would almost certaintly have very little chance of ever gaining control over our future/trajectory.
Without the argument this feels alarmist
Let me try to spell out the argument a little more—I think my original post was a little unclear. I don’t think the argument actually appeals to the “convergent instrumental value of resource acquisition”. We’re not talking about randomly sampling an objective function for AGI and asking whether it implies resource acquisition for intstrumental reasons.
Rather, we’re talking about selecting an objective function for AGI using something like gradient descent on some training objective, and—instead of an aligned objective arising from this process—a resource-acquiring/influence-seeking objective emerges. This is because doing well on the training objective is a good strategy for gaining resources/influence.
Random objectives that aren’t resource/influence-seeking will be selected against by the training process, because they don’t perform well on the training objective.
On this model, the AGI will have a resource-acquiring objective function, and we don’t need to appeal to the convergent instrumental value of resource acquisition.
I’m curious if this distinction makes sense and seems right to you?
Maybe. But that depends on what exactly are the terminal resource-seeking objectives, it’s not clear that in this story they would go far enough to directly talk of dismantling whole planets. On the other hand, dismantling whole planets is instrumentally useful for running experiments into the details of fundamental physics or building planet-sized computers or weapons against possible aliens, all to ensure that the objective of gathering strawberries on a particular (small, well-defined) farm proceeds without fail.
Good catch, I edited the last points in each part to make the scale of the disaster clearer, and removed the reference to gorillas.
I do think the scale of disaster is smaller (in expectation) in Part 1 than in Part 2, for the reason mentioned here—basically, the systems in Part 1 are somewhat more aligned with human intentions (albeit poorly specified proxies to them), so there’s some chance that they leave humans alone. Whereas Part 2 is a treacherous turn inner alignment failure, where the systems learned arbitrary objectives and so have no incentive at all to keep humans alive.
My own guess is that this is not that far-fetched. This is a “generic values hypothesis”, that human values are enough of a blank slate thing that the Internet already redundantly imprints everything relevant that humans share. In this case a random AI with values that are vaguely learning-from-Internet inspired is not much less aligned than a random human, and although that’s not particularly reassuring (value drift can go far when minds are upgraded without a clear architecture that formulates and preserves values), this is a reason for some nontrivial chance of settling on a humane attitude to humanity, which wouldn’t just happen on its own, without cause. This possibility gets more remote if values are engineered de novo and don’t start out as channeling a language model.
Without the argument this feels alarmist. Humans can manage their own survival if they are not actively exterminated, it takes a massive disruption as a byproduct of AIs’ activities to prevent that. The possibility of such a disruption is grounded in the convergent instrumental value of resource acquisition and eventual feasibility of megascale engineering, the premises that are not necessarily readily apparent.
Thanks for writing this out, I found it helpful and it’s updated me a bit towards human extinction not being that far-fetched in the ‘Part 1’ world. Though I do still think that, in this world, humans would almost certaintly have very little chance of ever gaining control over our future/trajectory.
Let me try to spell out the argument a little more—I think my original post was a little unclear. I don’t think the argument actually appeals to the “convergent instrumental value of resource acquisition”. We’re not talking about randomly sampling an objective function for AGI and asking whether it implies resource acquisition for intstrumental reasons.
Rather, we’re talking about selecting an objective function for AGI using something like gradient descent on some training objective, and—instead of an aligned objective arising from this process—a resource-acquiring/influence-seeking objective emerges. This is because doing well on the training objective is a good strategy for gaining resources/influence.
Random objectives that aren’t resource/influence-seeking will be selected against by the training process, because they don’t perform well on the training objective.
On this model, the AGI will have a resource-acquiring objective function, and we don’t need to appeal to the convergent instrumental value of resource acquisition.
I’m curious if this distinction makes sense and seems right to you?
Maybe. But that depends on what exactly are the terminal resource-seeking objectives, it’s not clear that in this story they would go far enough to directly talk of dismantling whole planets. On the other hand, dismantling whole planets is instrumentally useful for running experiments into the details of fundamental physics or building planet-sized computers or weapons against possible aliens, all to ensure that the objective of gathering strawberries on a particular (small, well-defined) farm proceeds without fail.