Sam Clarke comments on Sam Clarke’s Shortform

Sam Clarke 8 Oct 2021 15:58 UTC
1 point
Good catch, I edited the last points in each part to make the scale of the disaster clearer, and removed the reference to gorillas.

I do think the scale of disaster is smaller (in expectation) in Part 1 than in Part 2, for the reason mentioned here—basically, the systems in Part 1 are somewhat more aligned with human intentions (albeit poorly specified proxies to them), so there’s some chance that they leave humans alone. Whereas Part 2 is a treacherous turn inner alignment failure, where the systems learned arbitrary objectives and so have no incentive at all to keep humans alive.
- Vladimir_Nesov 8 Oct 2021 17:11 UTC
  5 points
  Parent
  
  there’s some chance that they leave humans alone
  
  My own guess is that this is not that far-fetched. This is a “generic values hypothesis”, that human values are enough of a blank slate thing that the Internet already redundantly imprints everything relevant that humans share. In this case a random AI with values that are vaguely learning-from-Internet inspired is not much less aligned than a random human, and although that’s not particularly reassuring (value drift can go far when minds are upgraded without a clear architecture that formulates and preserves values), this is a reason for some nontrivial chance of settling on a humane attitude to humanity, which wouldn’t just happen on its own, without cause. This possibility gets more remote if values are engineered de novo and don’t start out as channeling a language model.
  
  humans will most likely go extinct
  
  Without the argument this feels alarmist. Humans can manage their own survival if they are not actively exterminated, it takes a massive disruption as a byproduct of AIs’ activities to prevent that. The possibility of such a disruption is grounded in the convergent instrumental value of resource acquisition and eventual feasibility of megascale engineering, the premises that are not necessarily readily apparent.
  - Sam Clarke 13 Oct 2021 12:02 UTC
    1 point
    Parent
    
    My own guess is that this is not that far-fetched.
    
    Thanks for writing this out, I found it helpful and it’s updated me a bit towards human extinction not being that far-fetched in the ‘Part 1’ world. Though I do still think that, in this world, humans would almost certaintly have very little chance of ever gaining control over our future/trajectory.
    
    Without the argument this feels alarmist
    
    Let me try to spell out the argument a little more—I think my original post was a little unclear. I don’t think the argument actually appeals to the “convergent instrumental value of resource acquisition”. We’re not talking about randomly sampling an objective function for AGI and asking whether it implies resource acquisition for intstrumental reasons.
    
    Rather, we’re talking about selecting an objective function for AGI using something like gradient descent on some training objective, and—instead of an aligned objective arising from this process—a resource-acquiring/influence-seeking objective emerges. This is because doing well on the training objective is a good strategy for gaining resources/influence.
    
    Random objectives that aren’t resource/influence-seeking will be selected against by the training process, because they don’t perform well on the training objective.
    
    On this model, the AGI will have a resource-acquiring objective function, and we don’t need to appeal to the convergent instrumental value of resource acquisition.
    
    I’m curious if this distinction makes sense and seems right to you?
    - Vladimir_Nesov 13 Oct 2021 12:56 UTC
      3 points
      Parent
      Maybe. But that depends on what exactly are the terminal resource-seeking objectives, it’s not clear that in this story they would go far enough to directly talk of dismantling whole planets. On the other hand, dismantling whole planets is instrumentally useful for running experiments into the details of fundamental physics or building planet-sized computers or weapons against possible aliens, all to ensure that the objective of gathering strawberries on a particular (small, well-defined) farm proceeds without fail.