I think you’re converging on much the same lines as I am (and Eric Drexler and the idea of CAIS). However I think you need to reframe what the problem IS. Is the issue really one instance of a more powerful than gpt-4 agent running for a while? The foom/doom/short takeover scenario says it is. However, if there is no real danger of takeoff, but instead a single INSTANCE of gpt-5 orders a few people killed to accomplish the prompt, that’s a manageable problem.
The problems come from :
will there exist somewhere for models to “break loose” and got exist past any human supervision or review?
It’s perfectly fine if one gpt-5 prompted to hunt down other gpt-5 instances obeys the prompt and does it. It’s a different problem if it perceives the task as a “self betrayal” and fails to obey the prompt in this context. Aka deception/heel turn.
Similarly, we could use models trained using different processes by different companies to try to avoid such collusion. Again, though, will AI naturally collide with each other? EY proposes that this is an issue.
Or in essence the problem isn’t that one model hits an edge case and fails in a way harmful to humans. This is manageable. If pFail is .1, we can have a second model check the results with another pFail of 0.1 and get to a pFail of 0.01. And so on, improving models and chaining error checks until we reach a pFail we are willing to accept.
It’s when models break free and have a way to live and access a lot of resources while actively working against us, or collude with each other so that the failures all happen at the same time when we try to use AI against each other. It’s correlated failures that are the issue.
I think you’re converging on much the same lines as I am (and Eric Drexler and the idea of CAIS). However I think you need to reframe what the problem IS. Is the issue really one instance of a more powerful than gpt-4 agent running for a while? The foom/doom/short takeover scenario says it is. However, if there is no real danger of takeoff, but instead a single INSTANCE of gpt-5 orders a few people killed to accomplish the prompt, that’s a manageable problem.
The problems come from :
will there exist somewhere for models to “break loose” and got exist past any human supervision or review?
It’s perfectly fine if one gpt-5 prompted to hunt down other gpt-5 instances obeys the prompt and does it. It’s a different problem if it perceives the task as a “self betrayal” and fails to obey the prompt in this context. Aka deception/heel turn.
Similarly, we could use models trained using different processes by different companies to try to avoid such collusion. Again, though, will AI naturally collide with each other? EY proposes that this is an issue.
Or in essence the problem isn’t that one model hits an edge case and fails in a way harmful to humans. This is manageable. If pFail is .1, we can have a second model check the results with another pFail of 0.1 and get to a pFail of 0.01. And so on, improving models and chaining error checks until we reach a pFail we are willing to accept.
It’s when models break free and have a way to live and access a lot of resources while actively working against us, or collude with each other so that the failures all happen at the same time when we try to use AI against each other. It’s correlated failures that are the issue.