cousin_it comments on Implied “utilities” of simulators are broad, dense, and shallow

cousin_it 1 Mar 2023 18:49 UTC
LW: 5 AF: 3
2
AF
I really like how you’ve laid out a spectrum of AIs, from input-imitators to world-optimizers. At some point I had a hope that world-optimizer AIs would be too slow to train for the real world, and we’d live for awhile with input-imitator AIs that get more and more capable but still stay docile.

But the trouble is, I can think of plausible paths from input-imitator to world-optimizer. For example if you can make AI imitate a conversation between humans, then maybe you can make an AI that makes real world plans as fast as a committee of 10 smart humans conversing at 1000x speed. For extra fun, allow the imitated committee to send network packets and read responses; for extra extra fun, give them access to a workbench improving their own AI. I’d say this gets awfully close to a world-optimizer that could plausibly defeat the rest of humanity, if the imitator it’s running on is good enough (GPT-6 or something). And there’s of course no law saying it’ll be friendly: you could prompt the inner humans with “you want to destroy real humanity” and watch the fireworks.
- porby 1 Mar 2023 23:23 UTC
  3 points
  0
  Parent
  Yup, agreed. Understanding and successfully applying these concepts are necessary for one path to safety, but not sufficient. Even a predictive model with zero instrumentality and no misaligned internal mesaoptimizers could still yield oopsies in relatively few steps.
  I view it as an attempt to build a foundation- the ideal predictive model isn’t actively adversarial, it’s not obscuring the meaning of its weights (because doing so would be instrumental to some other goal), and so on. Something like this seems necessary for non-godzilla interpretability to work, and it at least admits the possibility that we could find some use that doesn’t naturally drift into an amplified version of “I have been a good bing” or whatever else. I’m not super optimistic about finding a version of this path that’s also resistant to the “and some company takes off the safeties three weeks later” problem, but at least I can’t state that it’s impossible yet!
- Roman Leventov 29 Jun 2023 12:01 UTC
  1 point
  0
  Parent
  Your scenario seems to suggest that dense real-world feedbacks at human speeds (i.e., compute surveillance) and decentralisation (primarily, of the internet: rogue AI shouldn’t be able to replicate itself in minutes across thousands of servers across the globe) should serve as counter-measures.