I really like how you’ve laid out a spectrum of AIs, from input-imitators to world-optimizers. At some point I had a hope that world-optimizer AIs would be too slow to train for the real world, and we’d live for awhile with input-imitator AIs that get more and more capable but still stay docile.
But the trouble is, I can think of plausible paths from input-imitator to world-optimizer. For example if you can make AI imitate a conversation between humans, then maybe you can make an AI that makes real world plans as fast as a committee of 10 smart humans conversing at 1000x speed. For extra fun, allow the imitated committee to send network packets and read responses; for extra extra fun, give them access to a workbench improving their own AI. I’d say this gets awfully close to a world-optimizer that could plausibly defeat the rest of humanity, if the imitator it’s running on is good enough (GPT-6 or something). And there’s of course no law saying it’ll be friendly: you could prompt the inner humans with “you want to destroy real humanity” and watch the fireworks.
Yup, agreed. Understanding and successfully applying these concepts are necessary for one path to safety, but not sufficient. Even a predictive model with zero instrumentality and no misaligned internal mesaoptimizers could still yield oopsies in relatively few steps.
I view it as an attempt to build a foundation- the ideal predictive model isn’t actively adversarial, it’s not obscuring the meaning of its weights (because doing so would be instrumental to some other goal), and so on. Something like this seems necessary for non-godzilla interpretability to work, and it at least admits the possibility that we could find some use that doesn’t naturally drift into an amplified version of “I have been a good bing” or whatever else. I’m not super optimistic about finding a version of this path that’s also resistant to the “and some company takes off the safeties three weeks later” problem, but at least I can’t state that it’s impossible yet!
Your scenario seems to suggest that dense real-world feedbacks at human speeds (i.e., compute surveillance) and decentralisation (primarily, of the internet: rogue AI shouldn’t be able to replicate itself in minutes across thousands of servers across the globe) should serve as counter-measures.
I really like how you’ve laid out a spectrum of AIs, from input-imitators to world-optimizers. At some point I had a hope that world-optimizer AIs would be too slow to train for the real world, and we’d live for awhile with input-imitator AIs that get more and more capable but still stay docile.
But the trouble is, I can think of plausible paths from input-imitator to world-optimizer. For example if you can make AI imitate a conversation between humans, then maybe you can make an AI that makes real world plans as fast as a committee of 10 smart humans conversing at 1000x speed. For extra fun, allow the imitated committee to send network packets and read responses; for extra extra fun, give them access to a workbench improving their own AI. I’d say this gets awfully close to a world-optimizer that could plausibly defeat the rest of humanity, if the imitator it’s running on is good enough (GPT-6 or something). And there’s of course no law saying it’ll be friendly: you could prompt the inner humans with “you want to destroy real humanity” and watch the fireworks.
Yup, agreed. Understanding and successfully applying these concepts are necessary for one path to safety, but not sufficient. Even a predictive model with zero instrumentality and no misaligned internal mesaoptimizers could still yield oopsies in relatively few steps.
I view it as an attempt to build a foundation- the ideal predictive model isn’t actively adversarial, it’s not obscuring the meaning of its weights (because doing so would be instrumental to some other goal), and so on. Something like this seems necessary for non-godzilla interpretability to work, and it at least admits the possibility that we could find some use that doesn’t naturally drift into an amplified version of “I have been a good bing” or whatever else. I’m not super optimistic about finding a version of this path that’s also resistant to the “and some company takes off the safeties three weeks later” problem, but at least I can’t state that it’s impossible yet!
Your scenario seems to suggest that dense real-world feedbacks at human speeds (i.e., compute surveillance) and decentralisation (primarily, of the internet: rogue AI shouldn’t be able to replicate itself in minutes across thousands of servers across the globe) should serve as counter-measures.