Interesting. The specific idea you’re proposing here may or may not be workable, but it’s an intriguing example of a more general strategy that I’ve previously tried to articulate in another context. The idea is that it may be viable to use an AI to create a “platform” that accelerates human progress in an area of interest to existential safety, as opposed to using an AI to directly solve the problem or perform the action.
Essentially:
A “platform” for work in domain X is something that removes key constraints that would otherwise have consumed human time and effort when working in X. This allows humans to explore solutions in X they wouldn’t have previously — whether because they’d considered and rejected those solution paths, or because they’d subconsciously trained themselves not to look in places where the initial effort barrier was too high. Thus, developing an excellent platform for X allows humans to accelerate progress in domain X relative to other domains, ceteris paribus. (Every successful platform company does this. e.g., Shopify, Amazon, etc., make valuable businesses possible that wouldn’t otherwise exist.)
For certain carefully selected domains X, a platform for X may plausibly be relatively easier to secure & validate than an agent that’s targeted at some specific task x ∈ X would be. (Not easy; easier.) It’s less risky to validate the outputs of a platform and leave the really dangerous last-mile stuff to humans, than it would be to give an end-to-end trained AI agent a pivotal command in the real world (i.e., “melt all GPUs”) that necessarily takes the whole system far outside its training distribution. Fundamentally, the bet is that if humans are the ones doing the out-of-distribution part of the work, then the output that comes out the other end is less likely to have been adversarially selected against us.
(Note that platforms are tools, and tools want to be agents, so a strategy like this is unlikely to arise along the “natural” path of capabilities progress other than transiently.)
There are some obvious problems with this strategy. One is that point 1 above is no help if you can’t tell which of the solutions the humans come up with are good, and which are bad. So the approach can only work on problems that humans would otherwise have been smart enough to solve eventually, given enough time to do so (as you already pointed out in your example). If AI alignment is such a problem, then it could be a viable candidate for such an approach. Ditto for a pivotal act.
Another obvious problem is that capabilities research might benefit from the similar platforms that alignment research can. So actually implementing this in the real world might just accelerate the timeline for everything, leaving us worse off. (Absent an intervention at some higher level of coordination.)
A third concern is that point 2 above could be flat-out wrong in practice. Asking an AI to build a platform means asking for generalization, even if it is just “generalization within X”, and that’s playing a lethally dangerous game. In fact, it might well be lethal for any useful X, though that isn’t currently obvious to me. e.g., AlphaFold2 is a primitive example of a platform that that’s useful and non-dangerous, though it’s not useful enough for this.
On top of all that, there are all the steganographic considerations — AI embedding dangerous things in the tool itself, etc. — that you pointed out in your example.
But this strategy still seems like it could bring us closer to the Pareto frontier for critical domains (alignment problem, pivotal act), than it would be to directly train an AI to do the dangerous action.
Interesting. The specific idea you’re proposing here may or may not be workable, but it’s an intriguing example of a more general strategy that I’ve previously tried to articulate in another context. The idea is that it may be viable to use an AI to create a “platform” that accelerates human progress in an area of interest to existential safety, as opposed to using an AI to directly solve the problem or perform the action.
Essentially:
A “platform” for work in domain X is something that removes key constraints that would otherwise have consumed human time and effort when working in X. This allows humans to explore solutions in X they wouldn’t have previously — whether because they’d considered and rejected those solution paths, or because they’d subconsciously trained themselves not to look in places where the initial effort barrier was too high. Thus, developing an excellent platform for X allows humans to accelerate progress in domain X relative to other domains, ceteris paribus. (Every successful platform company does this. e.g., Shopify, Amazon, etc., make valuable businesses possible that wouldn’t otherwise exist.)
For certain carefully selected domains X, a platform for X may plausibly be relatively easier to secure & validate than an agent that’s targeted at some specific task x ∈ X would be. (Not easy; easier.) It’s less risky to validate the outputs of a platform and leave the really dangerous last-mile stuff to humans, than it would be to give an end-to-end trained AI agent a pivotal command in the real world (i.e., “melt all GPUs”) that necessarily takes the whole system far outside its training distribution. Fundamentally, the bet is that if humans are the ones doing the out-of-distribution part of the work, then the output that comes out the other end is less likely to have been adversarially selected against us.
(Note that platforms are tools, and tools want to be agents, so a strategy like this is unlikely to arise along the “natural” path of capabilities progress other than transiently.)
There are some obvious problems with this strategy. One is that point 1 above is no help if you can’t tell which of the solutions the humans come up with are good, and which are bad. So the approach can only work on problems that humans would otherwise have been smart enough to solve eventually, given enough time to do so (as you already pointed out in your example). If AI alignment is such a problem, then it could be a viable candidate for such an approach. Ditto for a pivotal act.
Another obvious problem is that capabilities research might benefit from the similar platforms that alignment research can. So actually implementing this in the real world might just accelerate the timeline for everything, leaving us worse off. (Absent an intervention at some higher level of coordination.)
A third concern is that point 2 above could be flat-out wrong in practice. Asking an AI to build a platform means asking for generalization, even if it is just “generalization within X”, and that’s playing a lethally dangerous game. In fact, it might well be lethal for any useful X, though that isn’t currently obvious to me. e.g., AlphaFold2 is a primitive example of a platform that that’s useful and non-dangerous, though it’s not useful enough for this.
On top of all that, there are all the steganographic considerations — AI embedding dangerous things in the tool itself, etc. — that you pointed out in your example.
But this strategy still seems like it could bring us closer to the Pareto frontier for critical domains (alignment problem, pivotal act), than it would be to directly train an AI to do the dangerous action.