I have been interpreting control to mean “when you use the model for some task, use strategies that will cause the final output to be correct even if you suspect the model may “want” to sabotage it”. Alignment would be making the model “want” to help.
In real life we can “control” prisoners, we know most likely “want” to escape, but as prisoners don’t have the tools to defeat the various defenses this usually doesn’t happen. Also, the worst case—the prisoner escapes and commits several murders before being stopped—is an acceptable loss to a civilization so long as pEscape is low.
Suppose an escaped prisoner wants to work for you. Can you “control” them and get useful work out of them? Yes. It depends on the task and it depends on what tools you have available, but this is generally possible.
For AI specifically this strategy has been dismissed, I think because many of the people doing the dismissing were imagining somewhat generic future AI and not considering the specific details. For example one of the strongest methods for control is to carefully filter inputs, ideally removing all unnecessary bits—what Ryan talks about when he write “paraphrase ” but I bet we can do much better than that—so the model doesn’t know that now is the time to betray. And erase the models memory after a task, and keep each task as short as possible. (A few hours of wall time total, model doesn’t think between inputs)
This specific method is not something you can do with humans, and if you imagine AGI are like a human brain but smarter, then this method won’t be considered.
Theoretically if most tasks that require the model to have authority are using “controlled” models, then it’s acceptable if open models exist, so long as they are unable to send messages that will be perceived and negotiate with other AI under human “control”.
If all the weapons and infrastructure etc are in the hands of humans directly or models they “control”, open models will be limited to whatever resources whoever runs the AGI is stupid enough to give them. (Running potentially sabotaged AI downloaded from a torrent sounds like a Darwin award)
Absolutely. Main thing is one guy in his garage who hooks up an open source agi to some hand built robots that start assembling more of themselves is a totally different scale of threat than a multi square kilometer...or cubic kilometer....optimized factory that is making weapons.
Billions of times scale difference. And this points to a way to control AI—so long as the humans have a very large resource advantage, ideally billions of times, that they achieved with the help of “controlled” AI models, the humans can defend themselves against superintelligence.
But you won’t get a billion fold advantage with any AI pauses—you need a crash program to accelerate developing ai. You won’t get there without ai either, we don’t have enough living humans and besides, these large factories won’t have any concessions to humans. No hand rails, probably no breathable gas inside, etc.
I have been interpreting control to mean “when you use the model for some task, use strategies that will cause the final output to be correct even if you suspect the model may “want” to sabotage it”. Alignment would be making the model “want” to help.
In real life we can “control” prisoners, we know most likely “want” to escape, but as prisoners don’t have the tools to defeat the various defenses this usually doesn’t happen. Also, the worst case—the prisoner escapes and commits several murders before being stopped—is an acceptable loss to a civilization so long as pEscape is low.
Suppose an escaped prisoner wants to work for you. Can you “control” them and get useful work out of them? Yes. It depends on the task and it depends on what tools you have available, but this is generally possible.
For AI specifically this strategy has been dismissed, I think because many of the people doing the dismissing were imagining somewhat generic future AI and not considering the specific details. For example one of the strongest methods for control is to carefully filter inputs, ideally removing all unnecessary bits—what Ryan talks about when he write “paraphrase ” but I bet we can do much better than that—so the model doesn’t know that now is the time to betray. And erase the models memory after a task, and keep each task as short as possible. (A few hours of wall time total, model doesn’t think between inputs)
This specific method is not something you can do with humans, and if you imagine AGI are like a human brain but smarter, then this method won’t be considered.
Theoretically if most tasks that require the model to have authority are using “controlled” models, then it’s acceptable if open models exist, so long as they are unable to send messages that will be perceived and negotiate with other AI under human “control”.
If all the weapons and infrastructure etc are in the hands of humans directly or models they “control”, open models will be limited to whatever resources whoever runs the AGI is stupid enough to give them. (Running potentially sabotaged AI downloaded from a torrent sounds like a Darwin award)
Darwin awards do happen. Lots of different humans means some are dumb sometimes.
Absolutely. Main thing is one guy in his garage who hooks up an open source agi to some hand built robots that start assembling more of themselves is a totally different scale of threat than a multi square kilometer...or cubic kilometer....optimized factory that is making weapons.
Billions of times scale difference. And this points to a way to control AI—so long as the humans have a very large resource advantage, ideally billions of times, that they achieved with the help of “controlled” AI models, the humans can defend themselves against superintelligence.
But you won’t get a billion fold advantage with any AI pauses—you need a crash program to accelerate developing ai. You won’t get there without ai either, we don’t have enough living humans and besides, these large factories won’t have any concessions to humans. No hand rails, probably no breathable gas inside, etc.