My own guess is that this is not that far-fetched.
Thanks for writing this out, I found it helpful and it’s updated me a bit towards human extinction not being that far-fetched in the ‘Part 1’ world. Though I do still think that, in this world, humans would almost certaintly have very little chance of ever gaining control over our future/trajectory.
Without the argument this feels alarmist
Let me try to spell out the argument a little more—I think my original post was a little unclear. I don’t think the argument actually appeals to the “convergent instrumental value of resource acquisition”. We’re not talking about randomly sampling an objective function for AGI and asking whether it implies resource acquisition for intstrumental reasons.
Rather, we’re talking about selecting an objective function for AGI using something like gradient descent on some training objective, and—instead of an aligned objective arising from this process—a resource-acquiring/influence-seeking objective emerges. This is because doing well on the training objective is a good strategy for gaining resources/influence.
Random objectives that aren’t resource/influence-seeking will be selected against by the training process, because they don’t perform well on the training objective.
On this model, the AGI will have a resource-acquiring objective function, and we don’t need to appeal to the convergent instrumental value of resource acquisition.
I’m curious if this distinction makes sense and seems right to you?
Maybe. But that depends on what exactly are the terminal resource-seeking objectives, it’s not clear that in this story they would go far enough to directly talk of dismantling whole planets. On the other hand, dismantling whole planets is instrumentally useful for running experiments into the details of fundamental physics or building planet-sized computers or weapons against possible aliens, all to ensure that the objective of gathering strawberries on a particular (small, well-defined) farm proceeds without fail.
Thanks for writing this out, I found it helpful and it’s updated me a bit towards human extinction not being that far-fetched in the ‘Part 1’ world. Though I do still think that, in this world, humans would almost certaintly have very little chance of ever gaining control over our future/trajectory.
Let me try to spell out the argument a little more—I think my original post was a little unclear. I don’t think the argument actually appeals to the “convergent instrumental value of resource acquisition”. We’re not talking about randomly sampling an objective function for AGI and asking whether it implies resource acquisition for intstrumental reasons.
Rather, we’re talking about selecting an objective function for AGI using something like gradient descent on some training objective, and—instead of an aligned objective arising from this process—a resource-acquiring/influence-seeking objective emerges. This is because doing well on the training objective is a good strategy for gaining resources/influence.
Random objectives that aren’t resource/influence-seeking will be selected against by the training process, because they don’t perform well on the training objective.
On this model, the AGI will have a resource-acquiring objective function, and we don’t need to appeal to the convergent instrumental value of resource acquisition.
I’m curious if this distinction makes sense and seems right to you?
Maybe. But that depends on what exactly are the terminal resource-seeking objectives, it’s not clear that in this story they would go far enough to directly talk of dismantling whole planets. On the other hand, dismantling whole planets is instrumentally useful for running experiments into the details of fundamental physics or building planet-sized computers or weapons against possible aliens, all to ensure that the objective of gathering strawberries on a particular (small, well-defined) farm proceeds without fail.