If the first AGI is a robotics system trained with RL and access to the physical world, we’re significantly more screwed than if we just get a really really good Language model.
That doesn’t seem true at all? A generally intelligent language model sounds like a manipulation machine, which sounds plenty dangerous.
A generally intelligent language model is one which outputs simulated human output which very closely resemble those in its dataset. The dataset of internet posts and books don’t include very many examples of successfully manipulating teams of AI researchers, therefore that strategy is not assigned a high likelihood by the model, even if it might actually be capable of executing the strategy. A language model just outputs the continuation to the query and then stops, this would still be unsafe at ultra-high capabilities because of the risk of mesa-optimizers, but we can control a weakly superhuman language model by placing it in a box and resetting its state for every new question we ask it.
Also, detecting human manipulation is one of the things that we might believe human brains to be *exceptionally* good at. We didn’t evolve to solve math or physics problems, but we certainly did evolve to deceive and detect deception in other humans. I expect that an AI with uniformly increasing capabilities across some set of tasks would become able to solve deep math problems much earlier than it would be able to manipulate hostile humans guarded against it.
This all means that a weakly superhuman language model would be a great tool to have, while still not ending the world right away.
In contrast, an open-ended reward maximizer that uses RL operating on the physical world is a nightmare, it would just automatically modify itself to acquire all the capability that the general language model would have, if it believed it needed them to maximise reward.
In some sense current language models are already general given their wide breath. The real crucial part is being human-level or weakly superhuman, for instance such model should be able to generate a physics textbook, or generate correct science papers from given only the abstract as prompt. Novel scientific research is where I’d draw the line to define “impactful” language models.
That doesn’t seem true at all? A generally intelligent language model sounds like a manipulation machine, which sounds plenty dangerous.
A generally intelligent language model is one which outputs simulated human output which very closely resemble those in its dataset. The dataset of internet posts and books don’t include very many examples of successfully manipulating teams of AI researchers, therefore that strategy is not assigned a high likelihood by the model, even if it might actually be capable of executing the strategy. A language model just outputs the continuation to the query and then stops, this would still be unsafe at ultra-high capabilities because of the risk of mesa-optimizers, but we can control a weakly superhuman language model by placing it in a box and resetting its state for every new question we ask it.
Also, detecting human manipulation is one of the things that we might believe human brains to be *exceptionally* good at. We didn’t evolve to solve math or physics problems, but we certainly did evolve to deceive and detect deception in other humans. I expect that an AI with uniformly increasing capabilities across some set of tasks would become able to solve deep math problems much earlier than it would be able to manipulate hostile humans guarded against it.
This all means that a weakly superhuman language model would be a great tool to have, while still not ending the world right away.
In contrast, an open-ended reward maximizer that uses RL operating on the physical world is a nightmare, it would just automatically modify itself to acquire all the capability that the general language model would have, if it believed it needed them to maximise reward.
What exactly makes it “general” then? Whats the difference between a general language model and non-general language model?
In some sense current language models are already general given their wide breath. The real crucial part is being human-level or weakly superhuman, for instance such model should be able to generate a physics textbook, or generate correct science papers from given only the abstract as prompt. Novel scientific research is where I’d draw the line to define “impactful” language models.