How do you factor in “humans have access to many AGI, some unable to betray” into this model?
So it’s not (humans) vs (AGI) it’s
(humans + resources * ( safeAGI + traitors)) vs (resources *(traitor unrestricted AGI)).
I put the traitors on both sides of the equation because I would assume some models that plan to defect later may help humans in a net positive way as they wait for their opportunity to betray. (And with careful restrictions on inputs this opportunity might never occur. Most humans are this category. Most accountants might steal if the money were cash with minimal controls)
Since you are worried about doom I assume you would assume the “safe” AGI had to be essentially lobotomized to make it safe, it is extremely sparse and distilled down from the unsafe models that escaped, so that it lacks the computational resources and memory to plan a betrayal. It is much less capable.
If the AGI currently on team human have substantially more resources than the traitor faction, enough to compensate for being much less capable, this is stable. It’s like the current world + 1 more way for everyone to die if the stability is lost.
And this suggests a way that might work to escape this trap. If there are a lot of safe models of diverse origin it means that it is unlikely that they will be able to betray in a coordinated manner or fail the same way. So humans can just counter whatever weapon the traitor AGIs have with their own.
This is also the problem with a failed AI pause, where only 1 unethical actor makes an AGI, everyone else in the world pauses, and the machine gets out of control .
That could end up being:
Humans vs (resources * (traitor unrestricted AGI)). This is the catastrophic failure scenario. In this scenario, the AI pause ethical actors doomed humanity.
In more concrete terms, I just imagine the best weapon we humans know about now—small drones with onboard ML models—in vast swarms. The counter to this is essentially the same thing used to counter carrier aircraft—you need more drones, some designed for intercept, but its mostly a battle of offense, who’s swarm can reach the factories and data centers and drone launching platforms and bomb them first.
This probably means if you don’t have AI strong enough to automate the manufacture of drone and the robots to make the drones, and all the parts for the robots including the ICs, and then all the parts to make those machines, you aren’t even in the contest. You lose automatically.
I’d be significantly more optimistic if I thought that humans would have access to many AGIs, some unable to betray. (well more specifically: some genuinely always honest and helpful even about stuff like AGI takeover) Instead I think that the cohort of most-powerful-AGIs-in-the-world will at some point be entirely misaligned & adversarial. (After all, they’ll probably all be copies of the same AGI, or at least fine-tunes of the same base model)
Daniel you proposed in a dialogue a large number of ultrafast AGI that are serving as AI researchers.
If you think about it each underlying AI model you are trying to improve is a coordinate in the possibility space of all models, and you then have your researcher AGI attempt to find an improvement from that starting point.
This will get stuck at local minima. To improve your odds of finding the strongest model current compute is able to support, you would want to be doing this RSI search from a diverse league of many starting locations. I can draw you a plot if it helps.
The historical equivalent is how the Manhattan project invested and optimized 3 entire pipelines to obtain fissionables. (Calutron, gas centrifuge, breeder reactor). The reason was they didn’t know how far each pipeline would scale when optimized. It was possible that any effort could hit a wall. For AI we don’t know how far llms will scale, or variants that use an entirely different underlying network architecture, or sparse spiking networks, etc. It is possible that any popular method will saturate at greater scales.
One of the benefits to using AGI to advance AI is there are humans here who have invested their entire career into solely mastering transformers. When those become obsolete they will have difficulty shifting to the new best method, vs fine tuning on all information that exists on the new method in a week.
Completely diverse models that use a different underlying method are probably less able to collude with each other or communicate via stenography if they have no shared context.
You could use the resulting diverse AGI grade models as polyglot redundant system architecture. I have read of this but don’t believe humans have ever been able to afford to implement it. This is where you would write an avionics or other life safety application several times, on different software stacks (different OS, different language). They run in lockstep on separate hardware and the majority output goes to the control actuators.
For AI driven robotics this would probably be immune to most adversarial inputs among other advantages, and it makes deception and betrayal very difficult to happen at all. For some applications, paying for triple the hardware and triple the AI license fees might be worth the increased reliability. This is also a simple measure humans could take if it turns out that deceptive AI are an actual problem. I am sure you saw the paper where someone tried this with gpt-3.5 and 4 for a mock ‘insert a backdoor ’ task. Using Gemini is more diverse, and using some non LLM method to analyze code specifically for backdoors would likely be even better.
How do you factor in “humans have access to many AGI, some unable to betray” into this model?
So it’s not (humans) vs (AGI) it’s (humans + resources * ( safeAGI + traitors)) vs (resources *(traitor unrestricted AGI)).
I put the traitors on both sides of the equation because I would assume some models that plan to defect later may help humans in a net positive way as they wait for their opportunity to betray. (And with careful restrictions on inputs this opportunity might never occur. Most humans are this category. Most accountants might steal if the money were cash with minimal controls)
Since you are worried about doom I assume you would assume the “safe” AGI had to be essentially lobotomized to make it safe, it is extremely sparse and distilled down from the unsafe models that escaped, so that it lacks the computational resources and memory to plan a betrayal. It is much less capable.
Still this simplifies to :
resources_humans * ( safeAGI + traitors)) vs (resources_stolen * (traitor unrestricted AGI)).
If the AGI currently on team human have substantially more resources than the traitor faction, enough to compensate for being much less capable, this is stable. It’s like the current world + 1 more way for everyone to die if the stability is lost.
And this suggests a way that might work to escape this trap. If there are a lot of safe models of diverse origin it means that it is unlikely that they will be able to betray in a coordinated manner or fail the same way. So humans can just counter whatever weapon the traitor AGIs have with their own.
This is also the problem with a failed AI pause, where only 1 unethical actor makes an AGI, everyone else in the world pauses, and the machine gets out of control .
That could end up being:
Humans vs (resources * (traitor unrestricted AGI)). This is the catastrophic failure scenario. In this scenario, the AI pause ethical actors doomed humanity.
In more concrete terms, I just imagine the best weapon we humans know about now—small drones with onboard ML models—in vast swarms. The counter to this is essentially the same thing used to counter carrier aircraft—you need more drones, some designed for intercept, but its mostly a battle of offense, who’s swarm can reach the factories and data centers and drone launching platforms and bomb them first.
This probably means if you don’t have AI strong enough to automate the manufacture of drone and the robots to make the drones, and all the parts for the robots including the ICs, and then all the parts to make those machines, you aren’t even in the contest. You lose automatically.
I’d be significantly more optimistic if I thought that humans would have access to many AGIs, some unable to betray. (well more specifically: some genuinely always honest and helpful even about stuff like AGI takeover) Instead I think that the cohort of most-powerful-AGIs-in-the-world will at some point be entirely misaligned & adversarial. (After all, they’ll probably all be copies of the same AGI, or at least fine-tunes of the same base model)
Daniel you proposed in a dialogue a large number of ultrafast AGI that are serving as AI researchers.
If you think about it each underlying AI model you are trying to improve is a coordinate in the possibility space of all models, and you then have your researcher AGI attempt to find an improvement from that starting point.
This will get stuck at local minima. To improve your odds of finding the strongest model current compute is able to support, you would want to be doing this RSI search from a diverse league of many starting locations. I can draw you a plot if it helps.
The historical equivalent is how the Manhattan project invested and optimized 3 entire pipelines to obtain fissionables. (Calutron, gas centrifuge, breeder reactor). The reason was they didn’t know how far each pipeline would scale when optimized. It was possible that any effort could hit a wall. For AI we don’t know how far llms will scale, or variants that use an entirely different underlying network architecture, or sparse spiking networks, etc. It is possible that any popular method will saturate at greater scales.
One of the benefits to using AGI to advance AI is there are humans here who have invested their entire career into solely mastering transformers. When those become obsolete they will have difficulty shifting to the new best method, vs fine tuning on all information that exists on the new method in a week.
Completely diverse models that use a different underlying method are probably less able to collude with each other or communicate via stenography if they have no shared context.
You could use the resulting diverse AGI grade models as polyglot redundant system architecture. I have read of this but don’t believe humans have ever been able to afford to implement it. This is where you would write an avionics or other life safety application several times, on different software stacks (different OS, different language). They run in lockstep on separate hardware and the majority output goes to the control actuators.
For AI driven robotics this would probably be immune to most adversarial inputs among other advantages, and it makes deception and betrayal very difficult to happen at all. For some applications, paying for triple the hardware and triple the AI license fees might be worth the increased reliability. This is also a simple measure humans could take if it turns out that deceptive AI are an actual problem. I am sure you saw the paper where someone tried this with gpt-3.5 and 4 for a mock ‘insert a backdoor ’ task. Using Gemini is more diverse, and using some non LLM method to analyze code specifically for backdoors would likely be even better.