Given reasonable computational time (say, a month), can the AI, using my chatlog alone, guess my password right on the first guess?
“using my chatlog alone” appears to be doing a lot of work in this example. Human-built computer systems are notoriously bug-filled and exploitable, even by other humans. Why would an AI not also be capable of exploiting such vulnerabilities?[1]
Explorations of and arguments about limits of physical possibility based on computational physics and other scientific domains can lead to valuable research and interesting discussion, and I’m with you up until point (4) in your summary. But for forecasting the capabilities and actions of a truly smarter-than-you adversarial agent, it’s important to confront the problem under the widest possible threat model, in the least convenient possible world and under the highest degree of difficulty.
This post is a great example of the kind of object-level argument I gesture at in this recently-published post. My point there is mainly: I think concrete, science-backed explorations of the limits of what is possible and tractable are great tools for world model building. But I find them pretty uncompelling when used as forecasts about how AGI takeover is likely to go, or as arguments for why such takeover is unlikely. I think an analogy to computer security is a good way of explaining this intuition. From another recent post of mine:
Side-channels are ubiquitous attack vectors in the field of computer security and cryptography. Timing attacks and other side-effect based attacks can render cryptographic algorithms which are provably secure under certain threat models, completely insecure when implemented on real hardware, because the vulnerabilities are at lower levels of abstraction than those considered in the threat model.
Proving that something is computationally intractable under a certain restricted model only means that the AI must find a way to step outside of your model, or do something else you didn’t think of.
Many vulnerabilities are only discoverable by humans when those humans have access to source code or at least binaries of the system under target. But this also doesn’t seem like a fatal problem for the AI: even if the exact source code for the system the AI is running on, and / or the code for the system protecting the password, does not appear in the AI’s training data, source code for many similar systems likely does.
I do agree that trying to hack the password is a smarter method for the AI to try. I was simply showing an example of a task that an AI would want to do, but be unable to due to computational intractability.
I chose the example of Yudkowsky’s plan for my analysis because he has described it as his “lower bound” plan. After spending two decades on AI safety, talking to all the most brilliant minds in the field, this is apparently what he thinks the most convincing plan for AI takeover is. If I believe this plan is intractable (and I very much believe it is), then it opens up the possibility that all such plans are intractable. And if you do find a tractable plan, then making the plan intractable would an invaluable AI safety cause area.
Proving that something is computationally intractable under a certain restricted model only means that the AI must find a way to step outside of your model, or do something else you didn’t think of.
Imagine if I made the claim that a freshly started AGI in a box could kill everyone on earth in under an minute. I propose that it creates some sort of gamma ray burst that hits everyone on earth simultaneously. You come back to me with a detailed proof that that plan is bonkers and wouldn’t work. I then respond “sure, that wouldn’t work, but the AI is way smarter than me, so it would figure something else out”.
My point is that, factually, some tasks are impossible. My belief is that a computationally tractable plan for guaranteeing success at x-risk does not currently exist, although I think a plan with like a 0.01% chance of success might. If you think otherwise, you have to actually prove it, not just assume it.
Well, “opens up the possibility that all such plans are intractable” is a much weaker claim than “impossible”, and I disagree about the concrete difficulty of at least one of the step in your plan: there are known toxins with ~100% lethality to humans in nature.
Distributing this toxin via a virus engineered using known techniques from GoF research and some nanotechnology for a timer seems pretty tractable, and close enough to 100% lethal to me.
The tech to build a timer circuit out RNA and ATP instead of in silicon and electricity doesn’t currently exist yet AFAIK, but the complexity, size, and energy constraints that such a timer design must meet are certainly tractable to design at nanoscale in silicon. Moving to a biological substrate might be hard, but knowing a bit about what hardware engineers are capable of doing with silicon, often with extremely limited energy budgets, it certainly doesn’t seem intractable for human designers, let alone for an ASI, to do similar things with biology.
So I’m a bit skeptical of your estimate of the other steps as “probably incomputable”!
Also, a more general point: you’ve used “incomputable” throughout, in what appears to be an informal way of saying “computationally intractable”.
In computational complexity theory, “uncomputable”, “undecidable”, “NP-complete”, and Big-O notation have very precise technical meanings: they are statements about the limiting behavior of particular classes of problems. They don’t necessarily imply anything about particular concrete instances of such problems.
So it’s not just that there are good approximations for solving the traveling salesman problem in general or probabilistically, which you correctly note.
It’s that, for any particular instance of the traveling salesman problem (or any other NP-hard problem), approximating or solving that particular instance may be tractable or even trivial, for example, by applying a specialized algorithm, or because the particular instance of the problem you need to solve has exploitable regularities or is otherwise degenerate in some way.
The same is true of e.g. the halting problem, which is provably undecidable in general! And yet, many programs that we care about can be proved to halt, or proved not to halt, in very reasonable amounts of time, often trivially by running them, or by inspection of their source. In fact, for a given randomly chosen program (under certain sampling assumptions), it is overwhelmingly likely that whether it halts or not is decidable. See the reference in this footnote for more.
The point of all of this is that I think saying something is “probably incomputable” is just too imprecise and informal to be useful as a bound the capabilities of a superintelligence (or even on human designers, for that matter), and trying to make the argument more precise probably causes it to break down, or requires a formulation of the problem in a domain where results from computational complexity theory are simply not applicable.
“using my chatlog alone” appears to be doing a lot of work in this example. Human-built computer systems are notoriously bug-filled and exploitable, even by other humans. Why would an AI not also be capable of exploiting such vulnerabilities?[1]
Explorations of and arguments about limits of physical possibility based on computational physics and other scientific domains can lead to valuable research and interesting discussion, and I’m with you up until point (4) in your summary. But for forecasting the capabilities and actions of a truly smarter-than-you adversarial agent, it’s important to confront the problem under the widest possible threat model, in the least convenient possible world and under the highest degree of difficulty.
This post is a great example of the kind of object-level argument I gesture at in this recently-published post. My point there is mainly: I think concrete, science-backed explorations of the limits of what is possible and tractable are great tools for world model building. But I find them pretty uncompelling when used as forecasts about how AGI takeover is likely to go, or as arguments for why such takeover is unlikely. I think an analogy to computer security is a good way of explaining this intuition. From another recent post of mine:
Proving that something is computationally intractable under a certain restricted model only means that the AI must find a way to step outside of your model, or do something else you didn’t think of.
Many vulnerabilities are only discoverable by humans when those humans have access to source code or at least binaries of the system under target. But this also doesn’t seem like a fatal problem for the AI: even if the exact source code for the system the AI is running on, and / or the code for the system protecting the password, does not appear in the AI’s training data, source code for many similar systems likely does.
I do agree that trying to hack the password is a smarter method for the AI to try. I was simply showing an example of a task that an AI would want to do, but be unable to due to computational intractability.
I chose the example of Yudkowsky’s plan for my analysis because he has described it as his “lower bound” plan. After spending two decades on AI safety, talking to all the most brilliant minds in the field, this is apparently what he thinks the most convincing plan for AI takeover is. If I believe this plan is intractable (and I very much believe it is), then it opens up the possibility that all such plans are intractable. And if you do find a tractable plan, then making the plan intractable would an invaluable AI safety cause area.
Imagine if I made the claim that a freshly started AGI in a box could kill everyone on earth in under an minute. I propose that it creates some sort of gamma ray burst that hits everyone on earth simultaneously. You come back to me with a detailed proof that that plan is bonkers and wouldn’t work. I then respond “sure, that wouldn’t work, but the AI is way smarter than me, so it would figure something else out”.
My point is that, factually, some tasks are impossible. My belief is that a computationally tractable plan for guaranteeing success at x-risk does not currently exist, although I think a plan with like a 0.01% chance of success might. If you think otherwise, you have to actually prove it, not just assume it.
Well, “opens up the possibility that all such plans are intractable” is a much weaker claim than “impossible”, and I disagree about the concrete difficulty of at least one of the step in your plan: there are known toxins with ~100% lethality to humans in nature.
Distributing this toxin via a virus engineered using known techniques from GoF research and some nanotechnology for a timer seems pretty tractable, and close enough to 100% lethal to me.
The tech to build a timer circuit out RNA and ATP instead of in silicon and electricity doesn’t currently exist yet AFAIK, but the complexity, size, and energy constraints that such a timer design must meet are certainly tractable to design at nanoscale in silicon. Moving to a biological substrate might be hard, but knowing a bit about what hardware engineers are capable of doing with silicon, often with extremely limited energy budgets, it certainly doesn’t seem intractable for human designers, let alone for an ASI, to do similar things with biology.
So I’m a bit skeptical of your estimate of the other steps as “probably incomputable”!
Also, a more general point: you’ve used “incomputable” throughout, in what appears to be an informal way of saying “computationally intractable”.
In computational complexity theory, “uncomputable”, “undecidable”, “NP-complete”, and Big-O notation have very precise technical meanings: they are statements about the limiting behavior of particular classes of problems. They don’t necessarily imply anything about particular concrete instances of such problems.
So it’s not just that there are good approximations for solving the traveling salesman problem in general or probabilistically, which you correctly note.
It’s that, for any particular instance of the traveling salesman problem (or any other NP-hard problem), approximating or solving that particular instance may be tractable or even trivial, for example, by applying a specialized algorithm, or because the particular instance of the problem you need to solve has exploitable regularities or is otherwise degenerate in some way.
The same is true of e.g. the halting problem, which is provably undecidable in general! And yet, many programs that we care about can be proved to halt, or proved not to halt, in very reasonable amounts of time, often trivially by running them, or by inspection of their source. In fact, for a given randomly chosen program (under certain sampling assumptions), it is overwhelmingly likely that whether it halts or not is decidable. See the reference in this footnote for more.
The point of all of this is that I think saying something is “probably incomputable” is just too imprecise and informal to be useful as a bound the capabilities of a superintelligence (or even on human designers, for that matter), and trying to make the argument more precise probably causes it to break down, or requires a formulation of the problem in a domain where results from computational complexity theory are simply not applicable.