This is an argument I don’t think I’ve seen made, or at least not made as strongly as it should be. So I will present it as starkly as possible. It is certainly a basic one.
The question I am asking is, is the conclusion below correct, that alignment is fundamentally impossible for any AI built by current methods? And by contraposition, that alignment is only achievable, if at all, for an AI built by deliberate construction? GOFAI never got very far, but that only shows that they never got the right ideas.
The argument:
A trained ML is an uninterpreted pile of numbers representing the program that it has been trained to be. By Rice’s theorem, no nontrivial fact can be proved about an arbitrary program. Therefore no attempt at alignment based on training it, then proving it safe, can work.
Provably correct software is not and cannot be created by writing code without much concern for correctness, then trying to make it correct (despite that being pretty much how most non-life-critical software is built). A fortiori, a pile of numbers generated by a training process cannot be tweaked at all, cannot be understood, cannot be proved to satisfy anything.
Provably correct software can only be developed by building from the outset with correctness in mind.
This is also true if “correctness” is replaced by “security“.
No concern for correctness enters into the process of training any sort of ML model. There is generally a criterion for judging the output. That is how it is trained. But that only measures performance — how often it is right on the test data — not correctness — whether it is necessarily right for all possible data. For it is said, testing can only prove the presence of faults, never their absence.
Therefore no AI built by current methods can be aligned.
Rice’s theorem says that there’s no algorithm for proving nontrivial facts about arbitrary programs, but it does not say that no nontrivial fact can be proven about a particular program. It also does not say that you can’t reason probabilistically/heuristically about arbitrary programs in lieu of formal proofs. It just says it’s possible to construct any program that breaks an algorithm that purports to prove a specific fact about all possible programs.
(And if it turns out we can’t formally prove something about a neural net (like alignment), then of course that also doesn’t mean negative thing about it is definitely true; it could be that we can’t prove alignment for a program and it happens to be aligned.)
Proving things of one particular program is not useful in this context. What is needed is to prove properties of all the AIs that may come out of whatever one’s research program is, rejecting those that fail and only accepting those whose safety is assured. This is not usefully different from the premise of Rice’s theorem.
Hoping that the AI happens to be aligned is not even an alignment strategy.
What is needed is to prove properties of all the AIs that may come out of whatever one’s research program is, rejecting those that fail and only accepting those whose safety is assured. This is not usefully different from the premise of Rice’s theorem.
First, we could still prove things about one particular program that comes out of the research program even if for some reason we couldn’t prove things about the programs that come out of that research program in general.
Second, that actually is something Rice’s theorem doesn’t cover. The fact that a program can be constructed that beats any alignment checking algorithm that purports to work for all possible programs doesn’t mean that one can’t prove something for the subset of programs created by your ML training process, nor does it means there aren’t probabilistic arguments you can make about those programs’ behavior that do better than chance.
The latter part isn’t being pedantic; companies still use endpoint defense software to guard against malware written adversarially to be as nice-seeming as possible, even though a full formal proof would be impossible in every circumstance.
Third, even if we were trying to pick an aligned program out of all possible programs, it’d still be possible to make an algorithm that explains it can’t answer the question in the cases that we don’t know, and use only those programs in which it could formally verify alignment. As an example, Turing’s original proof doesn’t work in the case that you limit programs to those that are shorter than the halting-checker and thus can’t necessarily embed it.
Hoping that the AI happens to be aligned is not even an alignment strategy.
Your conclusion was “Therefore no AI built by current methods can be aligned.”. I’m just explaining why that conclusion in particular is wrong. I agree it is a terrible alignment strategy to just train a DL model and hope for the best.
There are reasons to think that an AI is aligned between “hoping it is aligned” and “having a formal proof that it is aligned”. For example, we might be able to find sufficiently strong selection theorems, which tell us that certain types of optima tend to be chosen, even if we can’t prove theorems with certainty. We also might be able to find a working ELK strategy that gives us interpretability.
These might not be good strategies, but the statement “Therefore no AI built by current methods can be aligned” seems far too strong.
Replying to the unstated implication that ML-based alignment is not useful: Alignment is not a binary variable. Even if neural networks can’t be aligned in a way which robustly scales to arbitrary levels of capability, weakly aligned weakly superintelligent systems could still be useful tools as parts of research assistants (see Ought and Alignment Research Center’s work) which allow us to develop a cleaner seed AI with much better verifiability properties.
For a superintelligent AI, alignment might as well be binary, just as for practical purposes you either have a critical mass of U235 or you don’t, notwithstanding the narrow transition region. But can you expand the terms “weakly aligned” and “weakly superintelligent”? Even after searching alignmentforum.org and lesswrong.org for these, their intended meanings are not clear to me. One post says:
weak alignment means: do all of the things any competent AI researcher would obviously do when designing a safe AI.
For instance, you should ask the AI how it would respond in various hypothetical situations, and make sure it gives the “ethically correct” answer as judged by human beings.
To summarize, weak alignment, which is what this post is mostly about, would say that “everything will be all right in the end.” Strong alignment, which refers to the transient, would say that “everything will be all right in the end, and the journey there will be all right, too.”
I find it implausible that it is easier to build a machine that might destroy the world but is guaranteed to eventually rebuild it, than to build one that never destroys the world. It is easier to not make an omelette than it is to unmake one.
Agreed that for a post-intelligent explosion AI alignment is effectively binary. I do agree with the sharp left turn etc positions, and don’t expect patches and cobbled together solutions to hold up to the stratosphere.
Weakly aligned—Guided towards the kinds of things we want in ways which don’t have strong guarantees. A central example is InstructGPT, but this also includes most interpretability (unless dramatically more effective than current generation), and what I understand to be Paul’s main approaches.
Weakly superintelligent—Superintelligent in some domains, but has not yet undergone recursive self improvement.
These are probably non-standard terms, I’m very happy to be pointed at existing literature with different ones which I can adopt.
I am confident Eliezer would roll his eyes, I have read a great deal of his work and recent debates. I respectfully disagree with his claim that you can’t get useful cognitive work on alignment out of systems which have not yet FOOMed and taken a sharp left turn, based on my understanding of intelligence as babble and prune. I don’t expect us to get enough cognitive work out of these systems in time, but it seems like a path which has non-zero hope.
It is plausible that AIs unavoidably FOOM before the point that they can contribute, but this seems less and less likely as capabilities advance and we notice we’re not dead.
I don’t nearly agree with either of those, and FOOM basically requires physics violations like violating Landauer’s Principle and needing arbitrarily small processors. I’m being frank because I suspect that a lot of a doom position requires hard takeoff, and on physics and history of what happens as AI improves, only the first improvement is a discontinuity, the rest start being far more smooth and slow. So that’s a big crux I have here.
This is an argument I don’t think I’ve seen made, or at least not made as strongly as it should be. So I will present it as starkly as possible. It is certainly a basic one.
The question I am asking is, is the conclusion below correct, that alignment is fundamentally impossible for any AI built by current methods? And by contraposition, that alignment is only achievable, if at all, for an AI built by deliberate construction? GOFAI never got very far, but that only shows that they never got the right ideas.
The argument:
A trained ML is an uninterpreted pile of numbers representing the program that it has been trained to be. By Rice’s theorem, no nontrivial fact can be proved about an arbitrary program. Therefore no attempt at alignment based on training it, then proving it safe, can work.
Provably correct software is not and cannot be created by writing code without much concern for correctness, then trying to make it correct (despite that being pretty much how most non-life-critical software is built). A fortiori, a pile of numbers generated by a training process cannot be tweaked at all, cannot be understood, cannot be proved to satisfy anything.
Provably correct software can only be developed by building from the outset with correctness in mind.
This is also true if “correctness” is replaced by “security“.
No concern for correctness enters into the process of training any sort of ML model. There is generally a criterion for judging the output. That is how it is trained. But that only measures performance — how often it is right on the test data — not correctness — whether it is necessarily right for all possible data. For it is said, testing can only prove the presence of faults, never their absence.
Therefore no AI built by current methods can be aligned.
Rice’s theorem says that there’s no algorithm for proving nontrivial facts about arbitrary programs, but it does not say that no nontrivial fact can be proven about a particular program. It also does not say that you can’t reason probabilistically/heuristically about arbitrary programs in lieu of formal proofs. It just says it’s possible to construct any program that breaks an algorithm that purports to prove a specific fact about all possible programs.
(And if it turns out we can’t formally prove something about a neural net (like alignment), then of course that also doesn’t mean negative thing about it is definitely true; it could be that we can’t prove alignment for a program and it happens to be aligned.)
Proving things of one particular program is not useful in this context. What is needed is to prove properties of all the AIs that may come out of whatever one’s research program is, rejecting those that fail and only accepting those whose safety is assured. This is not usefully different from the premise of Rice’s theorem.
Hoping that the AI happens to be aligned is not even an alignment strategy.
First, we could still prove things about one particular program that comes out of the research program even if for some reason we couldn’t prove things about the programs that come out of that research program in general.
Second, that actually is something Rice’s theorem doesn’t cover. The fact that a program can be constructed that beats any alignment checking algorithm that purports to work for all possible programs doesn’t mean that one can’t prove something for the subset of programs created by your ML training process, nor does it means there aren’t probabilistic arguments you can make about those programs’ behavior that do better than chance.
The latter part isn’t being pedantic; companies still use endpoint defense software to guard against malware written adversarially to be as nice-seeming as possible, even though a full formal proof would be impossible in every circumstance.
Third, even if we were trying to pick an aligned program out of all possible programs, it’d still be possible to make an algorithm that explains it can’t answer the question in the cases that we don’t know, and use only those programs in which it could formally verify alignment. As an example, Turing’s original proof doesn’t work in the case that you limit programs to those that are shorter than the halting-checker and thus can’t necessarily embed it.
Your conclusion was “Therefore no AI built by current methods can be aligned.”. I’m just explaining why that conclusion in particular is wrong. I agree it is a terrible alignment strategy to just train a DL model and hope for the best.
There are reasons to think that an AI is aligned between “hoping it is aligned” and “having a formal proof that it is aligned”. For example, we might be able to find sufficiently strong selection theorems, which tell us that certain types of optima tend to be chosen, even if we can’t prove theorems with certainty. We also might be able to find a working ELK strategy that gives us interpretability.
These might not be good strategies, but the statement “Therefore no AI built by current methods can be aligned” seems far too strong.
Two points where I disagree with this argument:
We may not be able to prove something about an arbitrary AGI, but could interpret the resulting program and prove things about that
Alignment does not mean probably correct, I would define it as “empirically doesn’t kill us”
Replying to the unstated implication that ML-based alignment is not useful: Alignment is not a binary variable. Even if neural networks can’t be aligned in a way which robustly scales to arbitrary levels of capability, weakly aligned weakly superintelligent systems could still be useful tools as parts of research assistants (see Ought and Alignment Research Center’s work) which allow us to develop a cleaner seed AI with much better verifiability properties.
For a superintelligent AI, alignment might as well be binary, just as for practical purposes you either have a critical mass of U235 or you don’t, notwithstanding the narrow transition region. But can you expand the terms “weakly aligned” and “weakly superintelligent”? Even after searching alignmentforum.org and lesswrong.org for these, their intended meanings are not clear to me. One post says:
My shoulder Eliezer is rolling his eyes at this.
ETA: And here I find:
I find it implausible that it is easier to build a machine that might destroy the world but is guaranteed to eventually rebuild it, than to build one that never destroys the world. It is easier to not make an omelette than it is to unmake one.
Agreed that for a post-intelligent explosion AI alignment is effectively binary. I do agree with the sharp left turn etc positions, and don’t expect patches and cobbled together solutions to hold up to the stratosphere.
Weakly aligned—Guided towards the kinds of things we want in ways which don’t have strong guarantees. A central example is InstructGPT, but this also includes most interpretability (unless dramatically more effective than current generation), and what I understand to be Paul’s main approaches.
Weakly superintelligent—Superintelligent in some domains, but has not yet undergone recursive self improvement.
These are probably non-standard terms, I’m very happy to be pointed at existing literature with different ones which I can adopt.
I am confident Eliezer would roll his eyes, I have read a great deal of his work and recent debates. I respectfully disagree with his claim that you can’t get useful cognitive work on alignment out of systems which have not yet FOOMed and taken a sharp left turn, based on my understanding of intelligence as babble and prune. I don’t expect us to get enough cognitive work out of these systems in time, but it seems like a path which has non-zero hope.
It is plausible that AIs unavoidably FOOM before the point that they can contribute, but this seems less and less likely as capabilities advance and we notice we’re not dead.
I don’t nearly agree with either of those, and FOOM basically requires physics violations like violating Landauer’s Principle and needing arbitrarily small processors. I’m being frank because I suspect that a lot of a doom position requires hard takeoff, and on physics and history of what happens as AI improves, only the first improvement is a discontinuity, the rest start being far more smooth and slow. So that’s a big crux I have here.