I don’t think that design (1) is particularly safe.
If your claim that design (1) is harder to get working is true, then you get a small amount of safety from the fact that a design that isn’t doing anything is safe.
It depends on what the set of questions is, but if you want to be able to reliably answer questions like “how do I get from here to the bank?” then it needs to have a map, and some sort of pathfinding algorithm encoded in it somehow. If it can answer “what would a good advertising slogan be for product X” then it needs to have some model that includes human psychology and business, and be able to seek long term goals like maximising profit. This is getting into dangerous territory.
A system trained purely to imitate humans might be limited to human levels of competence, and so not too dangerous. Given that humans are more competent at some tasks than others, and that competence varies between humans, the AI might contain a competence chooser, which guesses at how good an answer a human would produce, and an optimiser module that can optimise a goal with a chosen level of competence. Of course, you aren’t training for anything above top human level competence, so whether or not the optimiser carries on working when asked for superhuman competence depends on the inductive bias.
Of course, if humans are unusually bad at X, then superhuman performance on X could be trained by training the general optimiser on A,B,C… which humans are better at. If humans could apply 10 units of optimisation power to problems A,B,C… and we train the AI on human answers, we might train it to apply 10 units of optimisation power to arbitrary problems. If humans can only produce 2 units of optimisation on problem X, then the AI’s 10 units on X is superhuman at that problem.
To me, this design space feels like the set of heath robinson contraption that contains several lumps of enriched uranium. If you just run one, you might be lucky and have the dangerous parts avoid hitting each other in just the wrong way. You might be able to find a particular design in which you can prove that the lumps of uranium never get near each other. But all the pieces needed for something to go badly wrong are there.
I agree. I’m generally okay with the order (oracles do seem marginally safer than agents, for example, and more restrictions should generally be safer than less), but also think the marginal amount of additional safety doesn’t matter much when you consider the total absolute risk. Just to make up some numbers, I think of it like choosing between options that are 99.6%, 99.7%, 99.8%, and 99.9% likely to result in disaster. I mean of course I’ll pick the one with a 0.4% chance of success, but I’d much rather do something radically different that is orders of magnitude safer.
Yeah, so I guess opinions on this would differ depending on how likely people think existential risk from AGI is. Personally, it’s clear to me that agentic misaligned superintelligences are bad news—but I’m much less persuaded by descriptions of how long-term maximising behaviour arises in something like an oracle. The prospect of an AGI that’s much more intelligent than humans and much less agentic seems quite plausible—even, perhaps, in a RL agent.
I don’t think that design (1) is particularly safe.
If your claim that design (1) is harder to get working is true, then you get a small amount of safety from the fact that a design that isn’t doing anything is safe.
It depends on what the set of questions is, but if you want to be able to reliably answer questions like “how do I get from here to the bank?” then it needs to have a map, and some sort of pathfinding algorithm encoded in it somehow. If it can answer “what would a good advertising slogan be for product X” then it needs to have some model that includes human psychology and business, and be able to seek long term goals like maximising profit. This is getting into dangerous territory.
A system trained purely to imitate humans might be limited to human levels of competence, and so not too dangerous. Given that humans are more competent at some tasks than others, and that competence varies between humans, the AI might contain a competence chooser, which guesses at how good an answer a human would produce, and an optimiser module that can optimise a goal with a chosen level of competence. Of course, you aren’t training for anything above top human level competence, so whether or not the optimiser carries on working when asked for superhuman competence depends on the inductive bias.
Of course, if humans are unusually bad at X, then superhuman performance on X could be trained by training the general optimiser on A,B,C… which humans are better at. If humans could apply 10 units of optimisation power to problems A,B,C… and we train the AI on human answers, we might train it to apply 10 units of optimisation power to arbitrary problems. If humans can only produce 2 units of optimisation on problem X, then the AI’s 10 units on X is superhuman at that problem.
To me, this design space feels like the set of heath robinson contraption that contains several lumps of enriched uranium. If you just run one, you might be lucky and have the dangerous parts avoid hitting each other in just the wrong way. You might be able to find a particular design in which you can prove that the lumps of uranium never get near each other. But all the pieces needed for something to go badly wrong are there.
I agree. I’m generally okay with the order (oracles do seem marginally safer than agents, for example, and more restrictions should generally be safer than less), but also think the marginal amount of additional safety doesn’t matter much when you consider the total absolute risk. Just to make up some numbers, I think of it like choosing between options that are 99.6%, 99.7%, 99.8%, and 99.9% likely to result in disaster. I mean of course I’ll pick the one with a 0.4% chance of success, but I’d much rather do something radically different that is orders of magnitude safer.
Yeah, so I guess opinions on this would differ depending on how likely people think existential risk from AGI is. Personally, it’s clear to me that agentic misaligned superintelligences are bad news—but I’m much less persuaded by descriptions of how long-term maximising behaviour arises in something like an oracle. The prospect of an AGI that’s much more intelligent than humans and much less agentic seems quite plausible—even, perhaps, in a RL agent.