We might summarise this counterargument to #30 as “verification is easier than generation”. The idea is that the AI comes up with a plan (+ explanation of how it works etc.) that the human systems could not have generated themselves, but that human systems can understand and check in retrospect.
Counterclaim to “verification is easier than generation” is that any pivotal act will involve plans that human systems cannot predict the effects of just by looking at the plan. What about the explanation, though? I think the problem there may be more that we don’t know how to get the AI to produce a helpful and accurate explanation as opposed to a bogus misleading but plausible-sounding one, not that no helpful explanation exists.
This seems to me like a case of the imaginary hypothetical “weak pivotal act” that nobody can ever produce. If you have a pivotal act you can do via following some procedure that only the AI was smart enough to generate, yet humans are smart enough to verify and smart enough to not be reliably fooled about, NAME THAT ACTUAL WEAK PIVOTAL ACT.
Okay, I will try to name a strong-but-checkable pivotal act.
(Having a strong-but-checkable pivotal act doesn’t necessarily translate into having a weak pivotal act. Checkability allows us to tell the difference between a good plan and a trapped plan with high probability, but the AI has no reason to give us a good plan. It will just produce output like “I have insufficient computing power to solve this problem” regardless of whether that’s actually true. If we’re unusually successful at convincing the AI our checking process is bad when it’s actually good, then that AI may give us a trapped plan, which we can then determine is trapped. Of course, one should not risk executing a trapped plan, even if one thinks one has identified and removed all the traps. So even if #30 is false, we are still default-doomed. (I’m not fully certain that we couldn’t create some kind of satisficing AI that gets reward 1 if it generates a safe plan, reward 0 if its output is neither helpful nor dangerous, and reward −1 if it generates a trapped plan that gets caught by our checking process. The AI may then decide that it has a higher chance of success if it just submits a safe plan. But I don’t know how one would train such a satisficer with current deep learning techniques.))
The premise of this pivotal act is that even mere humans would be capable of designing very complex nanomachines, if only they could see the atoms in front of them, and observe the dynamics as they bounce and move around on various timescales. Thus, the one and only output of the AI will be the code for fast and accurate simulation of atomic-level physics. Being able to get quick feedback on what would happen if you designed such-and-such a thing not only helps with being able to check and iterate designs quickly, it means that you can actually do lots of quick experiments to help you intuitively grok the dynamics of how atoms move and bond.
This is kind of a long comment, and I predict the next few paragraphs will be review for many LW readers, so feel free to skip to the paragraph starting with “SO HOW ARE YOU ACTUALLY GOING TO CHECK THE MOLECULAR DYNAMICS SIMULATION CODE?”.
Picture a team of nano-engineers designing some kind of large and complicated nanomachine. Each engineer wears a VR headset so they can view the atomic structure they’re working on in 3d, and each has VR gloves with touch-feedback so they can manipulate the atoms around. The engineers all work on various components of the larger nanomachine that must be built. Often there are standardized interfaces for transfer of information, energy, or charge. Other times, interfaces must be custom-designed for a particular purpose. Each component might connect to several of these interfaces, as well as being physically connected to the larger structure.
The hardest part of nanomachines is probably going to be the process of actually manufacturing them. The easiest route from current technology is to take advantage of our existing DNA synthesis tech to program ribosomes to produce the machines we want. The first stage of machines would be made from amino acids, but from there we could build machines that built other machines and bootstrap our way up to being able to build just about anything. This bootstrapping process would be more difficult than the mere design process for the final machine, and the first stage where we have to build things entirely out of amino acids sounds particularly brutal. But just as people could write the first compilers in machine code, it should be possible to figure out how to get things done even in the insanely constrained domain of amino acids. And this step only has to be done once before a whole world opens up.
The first obvious question is “what about quantum mechanics?”. The answer is that we don’t care too much about it. It makes computing the dynamics harder of course, but most systems interact frequently with their environment, and most nanomachines will probably interact with their environment frequently enough that they behave nearly classically. QM is important for determining what local chemical reactions take place, but there’s no long-range entanglement to worry about. That’s also helpful for allowing the human engineers to get a handle on the dynamics. The main effect of the frequent interaction with the environment is that the dynamics becomes somewhat stochastic. Often you will see engineers run the same simulation several times, so they can understand the distribution of possible outcomes. As the final design is approached, the engineers run it through thousands of simulations so that even rare thermal fluctuations are accounted for.
The zeroth obvious question is “what will you build with it?”. The answer is that a variety of things could be built, and there might be many side-projects produced to raise funds for the main project, but the main project is a scanner that can read a cryonics patient’s brain one layer at a time in sufficient detail to faithfully emulate them. This is a pretty standard pivotal act, and has been written about elsewhere, but the main idea is that we can freeze, scan, and maybe copy a bunch of alignment researchers and other smart folks to solve alignment at a much faster speed than the outside world is working on their unaligned AIs. Or, the researchers might be able to foom while maintaining their values by making careful changes to their own minds. Besides the scanner itself, this project would probably need to design extremely fast special-purpose computing hardware for running the ems, along with various kinds of tiny probes and other experimental equipment to collect data on how a scan translates to brain-function. (An ordinary dumb neural network should be sufficient for this kind of inference problem.)
SO HOW ARE YOU ACTUALLY GOING TO CHECK THE MOLECULAR DYNAMICS SIMULATION CODE? There are two possible attack modes. One is to exploit any possible vulnerabilities in the hardware running the code to achieve unwanted side effects, and use those side effects to escape from containment. The second is to somehow make the output of the simulation subtly flawed, such that the humans, working off of flawed data, accidentally design a nanomachine that lets the AI out of the box instead of doing what we were designing that machine to do. Obviously the first attack mode is much more desirable for the AI than the second. Actually, the second attack mode may be so hard to pull off that we don’t need to check for it at all, but not checking for things when you have the capability to check for them is silly.
Checks for false physics that might trick the humans into building trapped nanomachines:
Compare with the real dynamics, which we know, but is slower for us to simulate. In real physics, we can break down the world into little cubes and the dynamics are symmetrical around translating those cubes in space and time. So if we request a simulation method that has those same symmetries, we can check the simulation on small length and time scales, which are feasible for us to simulate. This doesn’t guarantee that a larger simulation would be correct, since small errors can multiply out of control as the simulation size gets larger, but it does mean that the dynamics of the large simulation have to bear a pretty-close resemblance to real physics.
We can also request that the code provably obeys conservation of momentum, energy, charge, etc.
Use diff fuzzing techniques: This is where you use a fuzzer (implemented with techniques like GOFAI and dumb neural networks) to try and find inputs to the simulation such that the output of the AI’s code differs from the output of our slow-but-known-correct code. The stronger the optimizer we use here, the harder it is for the AI to avoid detection.
Request that the AI give us error bounds on the simulation, along with machine-checkable proofs that they hold.
Side channel attack and its mitigation: The AI can steer us away from certain designs and towards others by making the simulation take different amounts of time depending on the configuration of atoms we submit. Therefore we request code that takes a constant amount of time to run, plus proof that it does so. In order to make this easier, the code will be run on a GPU architecture with cache levels explicitly managed in the code, and where none of the cores have branch prediction.
The checks for the first attack method are pretty much just normal computer security, but at extremely high stakes. Just so we’re clear on what the picture is like here: Generally the AI is not going to know in perfect detail what the hardware is like that its simulation code will be running on, though we at least have to give it the instruction set, or some high-level language that compiles to that instruction set. (Probably the high level language is the better option. That way we can design the language so that properties like constant run-time and space-time symmetry are gauranteed-true of all valid programs in that language.) So the best technique for the AI will be to submit simulation code that has a hidden optimizer embedded into it. This optimizer then searches for possible escape methods while it runs. Even though different runs of the code don’t share memory, they’ll have different inputs, which the optimizer can use as a random seed to determine which part of the escape-strategy search space it should be trying. If there are any holes at all in our security, the AI will be able to find and exploit them. All those engineers in VR headsets running simulations? In each case, the computer running the simulation has a hostile AI on it, trying everything it can think of to get out. In general, every aspect of the system must be proved correct, from the hardware, to the high-level language we give to the AI. In addition, everything must be thoroughly unit tested, integration tested, fuzz tested, and given to the best security researchers who will try to find side-channel attacks, etc. Other people will write a proof that’s as close as humanly possible to an actual mathematical proof that no side channel attacks are possible, and then everyone will try everything they can to poke holes in that proof.
Okay, yeah, we get it, you’re going to try really really hard. Don’t you know that the universe doesn’t owe you success just because of how much you tried? No matter what security you put in place as a mere human, the AI is going to get around it just because it’s so much smarter. Even a superintelligent AI would not be able to break a one time pad. Depending on the details of computational complexity, it probably wouldn’t be able to crack our best symmetric ciphers today either. For a superintelligent AI to succeed at a task, that task must be actually possible. If we can actually design a system with no holes, which takes a large but finite amount of work and intelligence, then a superintelligence will not be able to break it, no matter how much smarter than us it is.
Well, I had to think about this for longer than five seconds, so that’s already a huge victory.
If I try to compress your idea down to a few sentences:
The humans ask the AI to produce design tools, rather than designs, such that there’s a bunch of human cognition that goes into picking out the particular atomic arrangements or synthesis pathways; and we can piecewise verify that the tool is making accurate predictions; and the tool is powerful enough that we can build molecular nanotech and an uploader by using the tool for an amount of time too short for Facebook to catch up and destroy the world. The AI that does this is purportedly sufficiently good at meta-engineering to build the tool, but not good enough at larger strategy that it can hack its way through the humans using just the code of the tool. The way in which this attacks a central difficulty is by making it harder for the AI to just build unhelpful nanotech using the capabilities that humans use to produce helpful nanotech.
Yes, sounds right to me. It’s also true that one of the big unproven assumptions here is that we could create an AI strong enough to build such a tool, but too weak to hack humans. I find it plausible, personally, but I don’t yet have an easy-to-communicate argument for it.
I don’t know of a reason we couldn’t do this with a narrow AI. I have no idea how, but it’s possible in principle so far as I know. If anyone can figure out how, they could plausibly execute the pivotal act described above, which would be a very good thing for humanity’s chances of survival.
EDIT: Needless to say, but I’ll say it anyway: Doing this via narrow AI is vastly preferable to using a general AI. It’s both much less risky and means you don’t have to expend an insane amount of effort on checking.
The humans ask the AI to produce design tools, rather than designs (...) we can piecewise verify that the tool is making accurate predictions (...) The way in which this attacks a central difficulty is by making it harder for the AI to just build unhelpful nanotech
I think this is a good way to put things, and it’s a concept that can be made more general and built upon.
Like, we can also have AIs produce:
Tools that make other tools
Tools that help to verify other tools
Tools that look for problems with other tools (in ways that don’t guarantee finding all problems, but can help find many)
Tools that help approximate brain emulations (or get us part of the way there), or predict what a human would say when responding to questions in some restricted domain
Etc, etc
Maybe you already have thought through such strategies very extensively, but AFAIK you don’t make that clear in any of your writings, and it’s not a trivial amount of inferential distance that is required to realize the full power of techniques like these.
I have written more about this concept in this post in this series. I’m not sure whether or not any of the concepts/ideas in the series are new, but it seems to me that several of them at the very least are under-discussed.
I think secrecy is rarely a long-term solution because it’s fragile, but it can definitely have short-term uses? For example, I’m sure that some insights into AI have the capacity to advance both alignment and capabilities; if you have such an insight then you might want to share it secretly with alignment researchers while avoiding sharing it publicly because you’d rather Facebook AI not enhance its capabilities. And so the secrecy doesn’t have to be a permanent load-bearing part of a system; instead it’s just that every day the secrecy holds up is one more day you get to pull ahead of Facebook.
Interesting. The specific idea you’re proposing here may or may not be workable, but it’s an intriguing example of a more general strategy that I’ve previously tried to articulate in another context. The idea is that it may be viable to use an AI to create a “platform” that accelerates human progress in an area of interest to existential safety, as opposed to using an AI to directly solve the problem or perform the action.
Essentially:
A “platform” for work in domain X is something that removes key constraints that would otherwise have consumed human time and effort when working in X. This allows humans to explore solutions in X they wouldn’t have previously — whether because they’d considered and rejected those solution paths, or because they’d subconsciously trained themselves not to look in places where the initial effort barrier was too high. Thus, developing an excellent platform for X allows humans to accelerate progress in domain X relative to other domains, ceteris paribus. (Every successful platform company does this. e.g., Shopify, Amazon, etc., make valuable businesses possible that wouldn’t otherwise exist.)
For certain carefully selected domains X, a platform for X may plausibly be relatively easier to secure & validate than an agent that’s targeted at some specific task x ∈ X would be. (Not easy; easier.) It’s less risky to validate the outputs of a platform and leave the really dangerous last-mile stuff to humans, than it would be to give an end-to-end trained AI agent a pivotal command in the real world (i.e., “melt all GPUs”) that necessarily takes the whole system far outside its training distribution. Fundamentally, the bet is that if humans are the ones doing the out-of-distribution part of the work, then the output that comes out the other end is less likely to have been adversarially selected against us.
(Note that platforms are tools, and tools want to be agents, so a strategy like this is unlikely to arise along the “natural” path of capabilities progress other than transiently.)
There are some obvious problems with this strategy. One is that point 1 above is no help if you can’t tell which of the solutions the humans come up with are good, and which are bad. So the approach can only work on problems that humans would otherwise have been smart enough to solve eventually, given enough time to do so (as you already pointed out in your example). If AI alignment is such a problem, then it could be a viable candidate for such an approach. Ditto for a pivotal act.
Another obvious problem is that capabilities research might benefit from the similar platforms that alignment research can. So actually implementing this in the real world might just accelerate the timeline for everything, leaving us worse off. (Absent an intervention at some higher level of coordination.)
A third concern is that point 2 above could be flat-out wrong in practice. Asking an AI to build a platform means asking for generalization, even if it is just “generalization within X”, and that’s playing a lethally dangerous game. In fact, it might well be lethal for any useful X, though that isn’t currently obvious to me. e.g., AlphaFold2 is a primitive example of a platform that that’s useful and non-dangerous, though it’s not useful enough for this.
On top of all that, there are all the steganographic considerations — AI embedding dangerous things in the tool itself, etc. — that you pointed out in your example.
But this strategy still seems like it could bring us closer to the Pareto frontier for critical domains (alignment problem, pivotal act), than it would be to directly train an AI to do the dangerous action.
If you have a pivotal act you can do via following some procedure that only the AI was smart enough to generate, yet humans are smart enough to verify and smart enough to not be reliably fooled about, NAME THAT ACTUAL WEAK PIVOTAL ACT.
I don’t claim to have a solution where every detail is filled in, or where I have watertight arguments showing that it’s guaranteed to work (if executed faithfully).
But I think I have something, and that it could be built upon. The outlines of a potential solution.
And by “solution”, I mean a pivotal strategy (consisting of many acts that could be done over a short amount of time), where we can verify output extensively and hopefully (probably?) avoid being fooled/manipulated/tricked/”hacked”.
I’m writing a series about this here. Only 2 parts finished so far (current plan is to write 4).
I must say, you have a very pessimistic/optimistic view of AI would be able to solve P=NP. I won’t say you’re completely wrong, as there’s always a chance that P does equal NP. But I would be very careful of predicting anything based on the possibility of P=NP.
I think P?=NP is a distraction. Like, it’s not very useful to ask the question of whether Lee Sedol played a ‘polynomial’ number of games of Go, and AlphaGo played a ‘nonpolynomial’ number of games of Go. AlphaGo played more games and had a more careful and precise memory, and developed better intuitions, and could scale to more hardware better.
We might summarise this counterargument to #30 as “verification is easier than generation”. The idea is that the AI comes up with a plan (+ explanation of how it works etc.) that the human systems could not have generated themselves, but that human systems can understand and check in retrospect.
Counterclaim to “verification is easier than generation” is that any pivotal act will involve plans that human systems cannot predict the effects of just by looking at the plan. What about the explanation, though? I think the problem there may be more that we don’t know how to get the AI to produce a helpful and accurate explanation as opposed to a bogus misleading but plausible-sounding one, not that no helpful explanation exists.
This seems to me like a case of the imaginary hypothetical “weak pivotal act” that nobody can ever produce. If you have a pivotal act you can do via following some procedure that only the AI was smart enough to generate, yet humans are smart enough to verify and smart enough to not be reliably fooled about, NAME THAT ACTUAL WEAK PIVOTAL ACT.
Okay, I will try to name a strong-but-checkable pivotal act.
(Having a strong-but-checkable pivotal act doesn’t necessarily translate into having a weak pivotal act. Checkability allows us to tell the difference between a good plan and a trapped plan with high probability, but the AI has no reason to give us a good plan. It will just produce output like “I have insufficient computing power to solve this problem” regardless of whether that’s actually true. If we’re unusually successful at convincing the AI our checking process is bad when it’s actually good, then that AI may give us a trapped plan, which we can then determine is trapped. Of course, one should not risk executing a trapped plan, even if one thinks one has identified and removed all the traps. So even if #30 is false, we are still default-doomed. (I’m not fully certain that we couldn’t create some kind of satisficing AI that gets reward 1 if it generates a safe plan, reward 0 if its output is neither helpful nor dangerous, and reward −1 if it generates a trapped plan that gets caught by our checking process. The AI may then decide that it has a higher chance of success if it just submits a safe plan. But I don’t know how one would train such a satisficer with current deep learning techniques.))
The premise of this pivotal act is that even mere humans would be capable of designing very complex nanomachines, if only they could see the atoms in front of them, and observe the dynamics as they bounce and move around on various timescales. Thus, the one and only output of the AI will be the code for fast and accurate simulation of atomic-level physics. Being able to get quick feedback on what would happen if you designed such-and-such a thing not only helps with being able to check and iterate designs quickly, it means that you can actually do lots of quick experiments to help you intuitively grok the dynamics of how atoms move and bond.
This is kind of a long comment, and I predict the next few paragraphs will be review for many LW readers, so feel free to skip to the paragraph starting with “SO HOW ARE YOU ACTUALLY GOING TO CHECK THE MOLECULAR DYNAMICS SIMULATION CODE?”.
Picture a team of nano-engineers designing some kind of large and complicated nanomachine. Each engineer wears a VR headset so they can view the atomic structure they’re working on in 3d, and each has VR gloves with touch-feedback so they can manipulate the atoms around. The engineers all work on various components of the larger nanomachine that must be built. Often there are standardized interfaces for transfer of information, energy, or charge. Other times, interfaces must be custom-designed for a particular purpose. Each component might connect to several of these interfaces, as well as being physically connected to the larger structure.
The hardest part of nanomachines is probably going to be the process of actually manufacturing them. The easiest route from current technology is to take advantage of our existing DNA synthesis tech to program ribosomes to produce the machines we want. The first stage of machines would be made from amino acids, but from there we could build machines that built other machines and bootstrap our way up to being able to build just about anything. This bootstrapping process would be more difficult than the mere design process for the final machine, and the first stage where we have to build things entirely out of amino acids sounds particularly brutal. But just as people could write the first compilers in machine code, it should be possible to figure out how to get things done even in the insanely constrained domain of amino acids. And this step only has to be done once before a whole world opens up.
The first obvious question is “what about quantum mechanics?”. The answer is that we don’t care too much about it. It makes computing the dynamics harder of course, but most systems interact frequently with their environment, and most nanomachines will probably interact with their environment frequently enough that they behave nearly classically. QM is important for determining what local chemical reactions take place, but there’s no long-range entanglement to worry about. That’s also helpful for allowing the human engineers to get a handle on the dynamics. The main effect of the frequent interaction with the environment is that the dynamics becomes somewhat stochastic. Often you will see engineers run the same simulation several times, so they can understand the distribution of possible outcomes. As the final design is approached, the engineers run it through thousands of simulations so that even rare thermal fluctuations are accounted for.
The zeroth obvious question is “what will you build with it?”. The answer is that a variety of things could be built, and there might be many side-projects produced to raise funds for the main project, but the main project is a scanner that can read a cryonics patient’s brain one layer at a time in sufficient detail to faithfully emulate them. This is a pretty standard pivotal act, and has been written about elsewhere, but the main idea is that we can freeze, scan, and maybe copy a bunch of alignment researchers and other smart folks to solve alignment at a much faster speed than the outside world is working on their unaligned AIs. Or, the researchers might be able to foom while maintaining their values by making careful changes to their own minds. Besides the scanner itself, this project would probably need to design extremely fast special-purpose computing hardware for running the ems, along with various kinds of tiny probes and other experimental equipment to collect data on how a scan translates to brain-function. (An ordinary dumb neural network should be sufficient for this kind of inference problem.)
SO HOW ARE YOU ACTUALLY GOING TO CHECK THE MOLECULAR DYNAMICS SIMULATION CODE? There are two possible attack modes. One is to exploit any possible vulnerabilities in the hardware running the code to achieve unwanted side effects, and use those side effects to escape from containment. The second is to somehow make the output of the simulation subtly flawed, such that the humans, working off of flawed data, accidentally design a nanomachine that lets the AI out of the box instead of doing what we were designing that machine to do. Obviously the first attack mode is much more desirable for the AI than the second. Actually, the second attack mode may be so hard to pull off that we don’t need to check for it at all, but not checking for things when you have the capability to check for them is silly.
Checks for false physics that might trick the humans into building trapped nanomachines:
Compare with the real dynamics, which we know, but is slower for us to simulate. In real physics, we can break down the world into little cubes and the dynamics are symmetrical around translating those cubes in space and time. So if we request a simulation method that has those same symmetries, we can check the simulation on small length and time scales, which are feasible for us to simulate. This doesn’t guarantee that a larger simulation would be correct, since small errors can multiply out of control as the simulation size gets larger, but it does mean that the dynamics of the large simulation have to bear a pretty-close resemblance to real physics.
We can also request that the code provably obeys conservation of momentum, energy, charge, etc.
Use diff fuzzing techniques: This is where you use a fuzzer (implemented with techniques like GOFAI and dumb neural networks) to try and find inputs to the simulation such that the output of the AI’s code differs from the output of our slow-but-known-correct code. The stronger the optimizer we use here, the harder it is for the AI to avoid detection.
Request that the AI give us error bounds on the simulation, along with machine-checkable proofs that they hold.
Side channel attack and its mitigation: The AI can steer us away from certain designs and towards others by making the simulation take different amounts of time depending on the configuration of atoms we submit. Therefore we request code that takes a constant amount of time to run, plus proof that it does so. In order to make this easier, the code will be run on a GPU architecture with cache levels explicitly managed in the code, and where none of the cores have branch prediction.
The checks for the first attack method are pretty much just normal computer security, but at extremely high stakes. Just so we’re clear on what the picture is like here: Generally the AI is not going to know in perfect detail what the hardware is like that its simulation code will be running on, though we at least have to give it the instruction set, or some high-level language that compiles to that instruction set. (Probably the high level language is the better option. That way we can design the language so that properties like constant run-time and space-time symmetry are gauranteed-true of all valid programs in that language.) So the best technique for the AI will be to submit simulation code that has a hidden optimizer embedded into it. This optimizer then searches for possible escape methods while it runs. Even though different runs of the code don’t share memory, they’ll have different inputs, which the optimizer can use as a random seed to determine which part of the escape-strategy search space it should be trying. If there are any holes at all in our security, the AI will be able to find and exploit them. All those engineers in VR headsets running simulations? In each case, the computer running the simulation has a hostile AI on it, trying everything it can think of to get out. In general, every aspect of the system must be proved correct, from the hardware, to the high-level language we give to the AI. In addition, everything must be thoroughly unit tested, integration tested, fuzz tested, and given to the best security researchers who will try to find side-channel attacks, etc. Other people will write a proof that’s as close as humanly possible to an actual mathematical proof that no side channel attacks are possible, and then everyone will try everything they can to poke holes in that proof.
Okay, yeah, we get it, you’re going to try really really hard. Don’t you know that the universe doesn’t owe you success just because of how much you tried? No matter what security you put in place as a mere human, the AI is going to get around it just because it’s so much smarter. Even a superintelligent AI would not be able to break a one time pad. Depending on the details of computational complexity, it probably wouldn’t be able to crack our best symmetric ciphers today either. For a superintelligent AI to succeed at a task, that task must be actually possible. If we can actually design a system with no holes, which takes a large but finite amount of work and intelligence, then a superintelligence will not be able to break it, no matter how much smarter than us it is.
Well, I had to think about this for longer than five seconds, so that’s already a huge victory.
If I try to compress your idea down to a few sentences:
The humans ask the AI to produce design tools, rather than designs, such that there’s a bunch of human cognition that goes into picking out the particular atomic arrangements or synthesis pathways; and we can piecewise verify that the tool is making accurate predictions; and the tool is powerful enough that we can build molecular nanotech and an uploader by using the tool for an amount of time too short for Facebook to catch up and destroy the world. The AI that does this is purportedly sufficiently good at meta-engineering to build the tool, but not good enough at larger strategy that it can hack its way through the humans using just the code of the tool. The way in which this attacks a central difficulty is by making it harder for the AI to just build unhelpful nanotech using the capabilities that humans use to produce helpful nanotech.
Sound about right?
Yes, sounds right to me. It’s also true that one of the big unproven assumptions here is that we could create an AI strong enough to build such a tool, but too weak to hack humans. I find it plausible, personally, but I don’t yet have an easy-to-communicate argument for it.
Why can’t a narrow AI (maybe like Drexler’s proposal) create the tool safely?
I don’t know of a reason we couldn’t do this with a narrow AI. I have no idea how, but it’s possible in principle so far as I know. If anyone can figure out how, they could plausibly execute the pivotal act described above, which would be a very good thing for humanity’s chances of survival.
EDIT: Needless to say, but I’ll say it anyway: Doing this via narrow AI is vastly preferable to using a general AI. It’s both much less risky and means you don’t have to expend an insane amount of effort on checking.
I think this is a good way to put things, and it’s a concept that can be made more general and built upon.
Like, we can also have AIs produce:
Tools that make other tools
Tools that help to verify other tools
Tools that look for problems with other tools (in ways that don’t guarantee finding all problems, but can help find many)
Tools that help approximate brain emulations (or get us part of the way there), or predict what a human would say when responding to questions in some restricted domain
Etc, etc
Maybe you already have thought through such strategies very extensively, but AFAIK you don’t make that clear in any of your writings, and it’s not a trivial amount of inferential distance that is required to realize the full power of techniques like these.
I have written more about this concept in this post in this series. I’m not sure whether or not any of the concepts/ideas in the series are new, but it seems to me that several of them at the very least are under-discussed.
Useless; none of these abstractions help find an answer.
From what I know of security, any system requiring secrecy is already implicitly flawed.
(Naturally, if this doesn’t apply and you backchanneled your idea for some legitimate meta-reason, I withdraw my objection.)
I think secrecy is rarely a long-term solution because it’s fragile, but it can definitely have short-term uses? For example, I’m sure that some insights into AI have the capacity to advance both alignment and capabilities; if you have such an insight then you might want to share it secretly with alignment researchers while avoiding sharing it publicly because you’d rather Facebook AI not enhance its capabilities. And so the secrecy doesn’t have to be a permanent load-bearing part of a system; instead it’s just that every day the secrecy holds up is one more day you get to pull ahead of Facebook.
Interesting. The specific idea you’re proposing here may or may not be workable, but it’s an intriguing example of a more general strategy that I’ve previously tried to articulate in another context. The idea is that it may be viable to use an AI to create a “platform” that accelerates human progress in an area of interest to existential safety, as opposed to using an AI to directly solve the problem or perform the action.
Essentially:
A “platform” for work in domain X is something that removes key constraints that would otherwise have consumed human time and effort when working in X. This allows humans to explore solutions in X they wouldn’t have previously — whether because they’d considered and rejected those solution paths, or because they’d subconsciously trained themselves not to look in places where the initial effort barrier was too high. Thus, developing an excellent platform for X allows humans to accelerate progress in domain X relative to other domains, ceteris paribus. (Every successful platform company does this. e.g., Shopify, Amazon, etc., make valuable businesses possible that wouldn’t otherwise exist.)
For certain carefully selected domains X, a platform for X may plausibly be relatively easier to secure & validate than an agent that’s targeted at some specific task x ∈ X would be. (Not easy; easier.) It’s less risky to validate the outputs of a platform and leave the really dangerous last-mile stuff to humans, than it would be to give an end-to-end trained AI agent a pivotal command in the real world (i.e., “melt all GPUs”) that necessarily takes the whole system far outside its training distribution. Fundamentally, the bet is that if humans are the ones doing the out-of-distribution part of the work, then the output that comes out the other end is less likely to have been adversarially selected against us.
(Note that platforms are tools, and tools want to be agents, so a strategy like this is unlikely to arise along the “natural” path of capabilities progress other than transiently.)
There are some obvious problems with this strategy. One is that point 1 above is no help if you can’t tell which of the solutions the humans come up with are good, and which are bad. So the approach can only work on problems that humans would otherwise have been smart enough to solve eventually, given enough time to do so (as you already pointed out in your example). If AI alignment is such a problem, then it could be a viable candidate for such an approach. Ditto for a pivotal act.
Another obvious problem is that capabilities research might benefit from the similar platforms that alignment research can. So actually implementing this in the real world might just accelerate the timeline for everything, leaving us worse off. (Absent an intervention at some higher level of coordination.)
A third concern is that point 2 above could be flat-out wrong in practice. Asking an AI to build a platform means asking for generalization, even if it is just “generalization within X”, and that’s playing a lethally dangerous game. In fact, it might well be lethal for any useful X, though that isn’t currently obvious to me. e.g., AlphaFold2 is a primitive example of a platform that that’s useful and non-dangerous, though it’s not useful enough for this.
On top of all that, there are all the steganographic considerations — AI embedding dangerous things in the tool itself, etc. — that you pointed out in your example.
But this strategy still seems like it could bring us closer to the Pareto frontier for critical domains (alignment problem, pivotal act), than it would be to directly train an AI to do the dangerous action.
I don’t claim to have a solution where every detail is filled in, or where I have watertight arguments showing that it’s guaranteed to work (if executed faithfully).
But I think I have something, and that it could be built upon. The outlines of a potential solution.
And by “solution”, I mean a pivotal strategy (consisting of many acts that could be done over a short amount of time), where we can verify output extensively and hopefully (probably?) avoid being fooled/manipulated/tricked/”hacked”.
I’m writing a series about this here. Only 2 parts finished so far (current plan is to write 4).
I must say, you have a very pessimistic/optimistic view of AI would be able to solve P=NP. I won’t say you’re completely wrong, as there’s always a chance that P does equal NP. But I would be very careful of predicting anything based on the possibility of P=NP.
I think P?=NP is a distraction. Like, it’s not very useful to ask the question of whether Lee Sedol played a ‘polynomial’ number of games of Go, and AlphaGo played a ‘nonpolynomial’ number of games of Go. AlphaGo played more games and had a more careful and precise memory, and developed better intuitions, and could scale to more hardware better.