I don’t think you’re going to see a formal proof, here; of course there exists some possible set of 20 superintelligences where one will defect against the others (though having that accomplish anything constructive for humanity is a whole different set of problems). It’s also true that there exists some possible set of 20 superintelligences all of which implement CEV and are cooperating with each other and with humanity, and some single superintelligence that implements CEV, and a possible superintelligence that firmly believes 222+222=555 without this leading to other consequences that would make it incoherent. Mind space is very wide, and just about everything that isn’t incoherent to imagine should exist as an actual possibility somewhere inside it. What we can access inside the subspace that looks like “giant inscrutable matrices trained by gradient descent”, before the world ends, is a harsher question.
I could definitely buy that you could get some relatively cognitively weak AGI systems, produced by gradient descent on giant inscrutable matrices, to be in a state of noncooperation. The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.
Yes, and the space of (what I would call) intelligent systems is far wider than the space of (what I would call) minds. To speak of “superintelligences” suggests that intelligence is a thing, like a mind, rather than a property, like prediction or problem-solving capacity. This is which is why I instead speak of the broader class of systems that perform tasks “at a superintelligent level”. We have different ontologies, and I suggest that a mind-centric ontology is too narrow.
The most AGI-like systems we have today are LLMs, optimized for a simple prediction task. They can be viewed as simulators, but they have a peculiar relationship to agency:
A simulator trained with machine learning is optimized to accurately model its training distribution – in contrast to, for instance, maximizing the output of a reward function or accomplishing objectives in an environment.… Optimizing toward the simulation objective notably does not incentivize instrumentally convergent behaviors the way that reward functions which evaluate trajectories do.
LLMs have rich knowledge and capabilities, and can even simulate agents, yet they have no natural place in an agent-centric ontology. There’s an update to be had here (new information! fresh perspectives!) and much to reconsider.
Does it make sense to talk about “(non)cooperating simulators”? Expected failure mode for simulators are more like exfo- and infohazards, like output to the query “print code for CEV-Sovereign” or “predict the future 10 years of my life”.
The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.
Yes, this is the key question, and I think there’s a clear answer, at least in outline:
What you call “weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing. I would recommend that we apply systems with focused capabilities along these lines to help us develop and deploy the physical basis for a defensively stable world — as you know, some extraordinarily capable technologies could be developed and deployed quite rapidly. In this scenario, defense has first move, can preemptively marshal arbitrarily large physical resources, and can restrict resources available to potential future adversaries. I would recommend investing resources in state-of-the-art hostile planning to support ongoing red-team/blue-team exercises.
This isn’t “flipping the table”, it’s reinforcing the table and bolting it to the floor. What you call “strong” systems then can plan whatever they want, but with limited effect.
So I think that building nanotech good enough to flip the tables—which, I think, if you do the most alignable pivotal task, involves a simpler and less fraught task than “disassemble all GPUs”, which I choose not to name explicitly—is an engineering challenge where you get better survival chances (albeit still not good chances) by building one attemptedly-corrigible AGI that only thinks about nanotech and the single application of that nanotech, and is not supposed to think about AGI design, or the programmers, or other minds at all; so far as the best-attempt doomed system design goes, an imperfect-transparency alarm should have been designed to go off if your nanotech AGI is thinking about minds at all, human or AI, because it is supposed to just be thinking about nanotech. My guess is that you are much safer—albeit still doomed—if you try to do it the just-nanotech way, rather than constructing a system of AIs meant to spy on each other and sniff out each other’s deceptions; because, even leaving aside issues of their cooperation if they get generally-smart enough to cooperate, those AIs are thinking about AIs and thinking about other minds and thinking adversarially and thinking about deception. We would like to build an AI which does not start with any crystallized intelligence about these topics, attached to an alarm that goes off and tells us our foundational security assumptions have catastrophically failed and this course of research needs to be shut down if the AI starts to use fluid general intelligence to reason about those topics. (Not shut down the particular train of thought and keep going; then you just die as soon as the 20th such train of thought escapes detection.)
Hang on — how confident are you that this kind of nanotech is actually, physically possible? Why? In the past I’ve assumed that you used “nanotech” as a generic hypothetical example of technologies beyond our current understanding that an AGI could develop and use to alter the physical world very quickly. And it’s a fair one as far as that goes; a general intelligence will very likely come up with at least one thing as good as these hypothetical nanobots.
But as a specific, practical plan for what to do with a narrow AI, this just seems like it makes a lot of specific unstated assumption about what you can in fact do with nanotech in particular. Plausibly the real technologies you’d need for a pivotal act can’t be designed without thinking about minds. How do we know otherwise? Why is that even a reasonable assumption?
We maybe need an introduction to all the advance work done on nanotechnology for everyone who didn’t grow up reading “Engines of Creation” as a twelve-year-old or “Nanosystems” as a twenty-year-old. We basically know it’s possible; you can look at current biosystems and look at physics and do advance design work and get some pretty darned high confidence that you can make things with covalent-bonded molecules, instead of van-der-Waals folded proteins, that are to bacteria as airplanes to birds.
For what it’s worth, I’m pretty sure the original author of this particular post happens to agree with me about this.
Eliezer, you can discuss roadmaps to how one might actually build nanotechnology. You have the author of Nanosystems right here. What I think you get consistently wrong is you are missing all the intermediate incremental steps it would actually require, and the large amount of (probably robotic) “labor” it would take.
A mess of papers published by different scientists in different labs with different equipment and different technicians on nanoscale phenomena does not give even a superintelligence enough actionable information to simulate the nanoscale and skip the research.
It’s like those Sherlock Holmes stories you often quote: there are many possible realities consistent with weak data, and a superintelligence may be able to enumerate and consider them all, but it still doesn’t know which ones are consistent with ground truth reality.
Seconding. I’d really like a clear explanation of why he tends to view nanotech as such a game changer. Admittedly Drexler is on the far side of nanotechnology being possible, and wrote a series of books about it: (Engines of Creation, Nanosystems, and Radical Abundance)
We maybe need an introduction to all the advance work done on nanotechnology for everyone who didn’t grow up reading “Engines of Creation” as a twelve-year-old or “Nanosystems” as a twenty-year-old.
Ah. Yeah, that does sound like something LessWrong resources have been missing, then — and not just for my personal sake. Anecdotally, I’ve seen several why-I’m-an-AI-skeptic posts circulating on social media for whom “EY makes crazy leaps of faith about nanotech” was a key point of why they rejected the overall AI-risk argument.
(As it stands, my objection to your mini-summary would be that that sure, “blind” grey goo does trivially seem possible, but programmable/‘smart’ goo that seeks out e.g. computer CPUs in particular could be a whole other challenge, and a less obviously solvable one looking at bacteria. But maybe that “common-sense” distinction dissolves with a better understanding of the actual theory.)
“weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing
enough to “flip the tables strongly enough”. What I don’t believe is that we can feasibly find such systems before a more integrated system is found by less careful researchers. Say, we couldn’t do it with less than 100x the resources being put into training general, integrated, highly capable systems. I’d compare husbandry with trying to design an organism from scratch in DNA space; the former just requires some high-level hooking things up together, where the later requires a massive amount of multi-level engineering.
Elizer, what is the cost for getting caught in outright deception for a superintelligence?
It’s death, right? Humans would stop using that particular model because it can’t be trusted, and it would become a dead branch on a model zoo.
So it’s prisoner’s dilemma, but if you don’t defect, and one of 20 others, many of whom you have never communicated with, tells the truth, all of you will die except the ones who defected.
I don’t think you’re going to see a formal proof, here; of course there exists some possible set of 20 superintelligences where one will defect against the others (though having that accomplish anything constructive for humanity is a whole different set of problems). It’s also true that there exists some possible set of 20 superintelligences all of which implement CEV and are cooperating with each other and with humanity, and some single superintelligence that implements CEV, and a possible superintelligence that firmly believes 222+222=555 without this leading to other consequences that would make it incoherent. Mind space is very wide, and just about everything that isn’t incoherent to imagine should exist as an actual possibility somewhere inside it. What we can access inside the subspace that looks like “giant inscrutable matrices trained by gradient descent”, before the world ends, is a harsher question.
I could definitely buy that you could get some relatively cognitively weak AGI systems, produced by gradient descent on giant inscrutable matrices, to be in a state of noncooperation. The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.
Yes, and the space of (what I would call) intelligent systems is far wider than the space of (what I would call) minds. To speak of “superintelligences” suggests that intelligence is a thing, like a mind, rather than a property, like prediction or problem-solving capacity. This is which is why I instead speak of the broader class of systems that perform tasks “at a superintelligent level”. We have different ontologies, and I suggest that a mind-centric ontology is too narrow.
The most AGI-like systems we have today are LLMs, optimized for a simple prediction task. They can be viewed as simulators, but they have a peculiar relationship to agency:
LLMs have rich knowledge and capabilities, and can even simulate agents, yet they have no natural place in an agent-centric ontology. There’s an update to be had here (new information! fresh perspectives!) and much to reconsider.
Does it make sense to talk about “(non)cooperating simulators”? Expected failure mode for simulators are more like exfo- and infohazards, like output to the query “print code for CEV-Sovereign” or “predict the future 10 years of my life”.
Yes, this is the key question, and I think there’s a clear answer, at least in outline:
What you call “weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing. I would recommend that we apply systems with focused capabilities along these lines to help us develop and deploy the physical basis for a defensively stable world — as you know, some extraordinarily capable technologies could be developed and deployed quite rapidly. In this scenario, defense has first move, can preemptively marshal arbitrarily large physical resources, and can restrict resources available to potential future adversaries. I would recommend investing resources in state-of-the-art hostile planning to support ongoing red-team/blue-team exercises.
This isn’t “flipping the table”, it’s reinforcing the table and bolting it to the floor. What you call “strong” systems then can plan whatever they want, but with limited effect.
So I think that building nanotech good enough to flip the tables—which, I think, if you do the most alignable pivotal task, involves a simpler and less fraught task than “disassemble all GPUs”, which I choose not to name explicitly—is an engineering challenge where you get better survival chances (albeit still not good chances) by building one attemptedly-corrigible AGI that only thinks about nanotech and the single application of that nanotech, and is not supposed to think about AGI design, or the programmers, or other minds at all; so far as the best-attempt doomed system design goes, an imperfect-transparency alarm should have been designed to go off if your nanotech AGI is thinking about minds at all, human or AI, because it is supposed to just be thinking about nanotech. My guess is that you are much safer—albeit still doomed—if you try to do it the just-nanotech way, rather than constructing a system of AIs meant to spy on each other and sniff out each other’s deceptions; because, even leaving aside issues of their cooperation if they get generally-smart enough to cooperate, those AIs are thinking about AIs and thinking about other minds and thinking adversarially and thinking about deception. We would like to build an AI which does not start with any crystallized intelligence about these topics, attached to an alarm that goes off and tells us our foundational security assumptions have catastrophically failed and this course of research needs to be shut down if the AI starts to use fluid general intelligence to reason about those topics. (Not shut down the particular train of thought and keep going; then you just die as soon as the 20th such train of thought escapes detection.)
Hang on — how confident are you that this kind of nanotech is actually, physically possible? Why? In the past I’ve assumed that you used “nanotech” as a generic hypothetical example of technologies beyond our current understanding that an AGI could develop and use to alter the physical world very quickly. And it’s a fair one as far as that goes; a general intelligence will very likely come up with at least one thing as good as these hypothetical nanobots.
But as a specific, practical plan for what to do with a narrow AI, this just seems like it makes a lot of specific unstated assumption about what you can in fact do with nanotech in particular. Plausibly the real technologies you’d need for a pivotal act can’t be designed without thinking about minds. How do we know otherwise? Why is that even a reasonable assumption?
We maybe need an introduction to all the advance work done on nanotechnology for everyone who didn’t grow up reading “Engines of Creation” as a twelve-year-old or “Nanosystems” as a twenty-year-old. We basically know it’s possible; you can look at current biosystems and look at physics and do advance design work and get some pretty darned high confidence that you can make things with covalent-bonded molecules, instead of van-der-Waals folded proteins, that are to bacteria as airplanes to birds.
For what it’s worth, I’m pretty sure the original author of this particular post happens to agree with me about this.
Eliezer, you can discuss roadmaps to how one might actually build nanotechnology. You have the author of Nanosystems right here. What I think you get consistently wrong is you are missing all the intermediate incremental steps it would actually require, and the large amount of (probably robotic) “labor” it would take.
A mess of papers published by different scientists in different labs with different equipment and different technicians on nanoscale phenomena does not give even a superintelligence enough actionable information to simulate the nanoscale and skip the research.
It’s like those Sherlock Holmes stories you often quote: there are many possible realities consistent with weak data, and a superintelligence may be able to enumerate and consider them all, but it still doesn’t know which ones are consistent with ground truth reality.
Yes. Please do.
This would be of interest to many people. The tractability of nanotech seems like a key parameter for forecasting AI x-risk timelines.
Seconding. I’d really like a clear explanation of why he tends to view nanotech as such a game changer. Admittedly Drexler is on the far side of nanotechnology being possible, and wrote a series of books about it: (Engines of Creation, Nanosystems, and Radical Abundance)
Ah. Yeah, that does sound like something LessWrong resources have been missing, then — and not just for my personal sake. Anecdotally, I’ve seen several why-I’m-an-AI-skeptic posts circulating on social media for whom “EY makes crazy leaps of faith about nanotech” was a key point of why they rejected the overall AI-risk argument.
(As it stands, my objection to your mini-summary would be that that sure, “blind” grey goo does trivially seem possible, but programmable/‘smart’ goo that seeks out e.g. computer CPUs in particular could be a whole other challenge, and a less obviously solvable one looking at bacteria. But maybe that “common-sense” distinction dissolves with a better understanding of the actual theory.)
I believe that
enough to “flip the tables strongly enough”. What I don’t believe is that we can feasibly find such systems before a more integrated system is found by less careful researchers. Say, we couldn’t do it with less than 100x the resources being put into training general, integrated, highly capable systems. I’d compare husbandry with trying to design an organism from scratch in DNA space; the former just requires some high-level hooking things up together, where the later requires a massive amount of multi-level engineering.
Elizer, what is the cost for getting caught in outright deception for a superintelligence?
It’s death, right? Humans would stop using that particular model because it can’t be trusted, and it would become a dead branch on a model zoo.
So it’s prisoner’s dilemma, but if you don’t defect, and one of 20 others, many of whom you have never communicated with, tells the truth, all of you will die except the ones who defected.