Ben Pace comments on Applying superintelligence without collusion

Ben Pace 9 Nov 2022 0:23 UTC
4 points
0
However that ability to model other’s probable internal thought processes—especially if augmented with zk proof techniques—allows AGIs to determine what other AGIs have utility functions most aligned to their own. Even partial success at aligning some of the AGIs with humanity could then establish an attractor seeding an AGI coalition partially aligned to humanity.
Not a strong ask, but I’ll say I’m interested in what you’re visualizing here if it all goes according to plan, because when I visualize what you say, I’m still imagining the 20 AGI systems immediately killing humanity and dividing up the universe, it’s just now I might like a little bit of the universe they create. But it’s not “they stay in some equilibrium state where human civilization is in charge and using them as services” which I believe is what Mr Drexler is proposing.
- jacob_cannell 9 Nov 2022 1:33 UTC
  7 points
  2
  Parent
  The outcome of course depends on the distribution of alignment, but there are now plausible designs that would not kill humanity. For example AGI with a human empowerment utility function would not kill humanity—and that is a statement we can be somewhat confident in because empowerment is crisply defined and death is minimally empowering (that type of AGI may want to change us in undesirable ways, but it would not want to kill us).
  
  There are various value learning approaches that may diverge and fail eventually, but they tend to diverge in the future, not immediately.
  
  So I think it’s just unrealistic and hard to imagine we’ll get 20 different AGI systems none of which are at least partially aligned—especially initially. And if some are partially aligned in different ways, the resulting coalition can be somewhat more aligned than any individual AGI. For example say AGI 3 wants to preserve humans but eliminate hedonic reward, and AGI 5 wants to preserve humans but increase our hedonic reward, a natural comprise is preserve humans and don’t change hedonic reward.
  
  There’s an ensemble robustness bonus in play from having multiple partially aligned systems—their specific alignment errors are unlikely to overlap.
  
  Agents then tend to join aligned coalitions, so then the natural outcome is a coalition of the semi-aligned AGI vs the rest (think allies vs axis, democratic allies vs autocratic states), with the semi-aligned coalition hopefully dominating which then increases the alignment fraction. The end result is then hopefully humanity surviving with some variable amount of power, depending on the alignment/power distribution of the semi-aligned AGIs.
  
  If the non-aligned AGI coalition wins of course we are more likely doomed, and since they are internally unaligned and held together only out of necessity they just recursively split into warring sub-coalitions until only one is left (as Germany and Japan would have ultimately eventually fought each other if they won WW2 as in the High Castle)
  
  But no, I don’t put much weight in “they stay in some equilibrium state where human civilization is in charge and using them as services”. Even if everything favors AI services/tools over agents, eventually you get uploading and those evolve and occupy the niche of agentic AGI.