Seth Herd comments on Alignment by default: the simulation hypothesis

Seth Herd 4 Oct 2024 12:03 UTC
2 points
0
After reading all the comments threads, I think there’s some framing that hasn’t been analyzed adequately:

Why would humans be testing AGIs this way if they have the resources to create simulation that will fool a super intelligence?

Also, the risk of humanity being wiped out seems different and worse while that asi is attempting a takeover—during that time the humans are probably an actual threat.

Finally, leaving humans around would seem to pose a nontrivial risk that they’ll eventually spawn a new ASI that could threaten the original.

The Dyson sphere is just a tiny part of the universe so using that as the fractional cost seems wrong. Other considerations in both directions would seem to dominate it.
- gb 4 Oct 2024 13:32 UTC
  1 point
  0
  Parent
  
  Why would humans be testing AGIs this way if they have the resources to create simulation that will fool a super intelligence?
  
  My argument is more that the ASI will be “fooled” by default, really. It might not even need to be a particularly good simulation, because the ASI will probably not even look at it before pre-commiting not to update down on the prior of it being a simulation.
  
  But to answer your question, possibly because it might be the best way to test for alignment. We can create an AI that generates realistic simulations, and use those to test other ASIs.
  
  Also, the risk of humanity being wiped out seems different and worse while that asi is attempting a takeover—during that time the humans are probably an actual threat.
  
  Downstream of the above.
  
  Finally, leaving humans around would seem to pose a nontrivial risk that they’ll eventually spawn a new ASI that could threaten the original.
  
  The Dyson sphere is just a tiny part of the universe so using that as the fractional cost seems wrong. Other considerations in both directions would seem to dominate it.
  
  We can be spared and yet not allowed to build further ASIs. The cost of enforcing such restriction is negligible compared to the loss of output due to the hole in the Dyson sphere.
  - faul_sname 4 Oct 2024 16:18 UTC
    2 points
    0
    Parent
    
    My argument is more that the ASI will be “fooled” by default, really. It might not even need to be a particularly good simulation, because the ASI will probably not even look at it before pre-commiting not to update down on the prior of it being a simulation.
    
    Do you expect that the first takeover-capable ASI / the first sufficiently-internally-cooperative-to-be-takeover-capable group of AGIs will follow this style of reasoning pattern? And particularly the first ASI / group of AGIs that actually make the attempt.
    - gb 4 Oct 2024 16:55 UTC
      1 point
      0
      Parent
      That’s a great question. If it turns out to be something like an LLM, I’d say probably yes. More generally, it seems to me at least plausible that a system capable enough to take over would also (necessarily or by default) be capable of abstract reasoning like this, but I recognize the opposite view is also plausible, so the honest answer is that I don’t know. But even if it is the latter, it seems that whether or not the system would have such abstract-reasoning capability is something at least partially within our control, as it’s likely highly dependent on the underlying technology and training.