Rice’s theorem says that there’s no algorithm for proving nontrivial facts about arbitrary programs, but it does not say that no nontrivial fact can be proven about a particular program. It also does not say that you can’t reason probabilistically/heuristically about arbitrary programs in lieu of formal proofs. It just says it’s possible to construct any program that breaks an algorithm that purports to prove a specific fact about all possible programs.
(And if it turns out we can’t formally prove something about a neural net (like alignment), then of course that also doesn’t mean negative thing about it is definitely true; it could be that we can’t prove alignment for a program and it happens to be aligned.)
Proving things of one particular program is not useful in this context. What is needed is to prove properties of all the AIs that may come out of whatever one’s research program is, rejecting those that fail and only accepting those whose safety is assured. This is not usefully different from the premise of Rice’s theorem.
Hoping that the AI happens to be aligned is not even an alignment strategy.
What is needed is to prove properties of all the AIs that may come out of whatever one’s research program is, rejecting those that fail and only accepting those whose safety is assured. This is not usefully different from the premise of Rice’s theorem.
First, we could still prove things about one particular program that comes out of the research program even if for some reason we couldn’t prove things about the programs that come out of that research program in general.
Second, that actually is something Rice’s theorem doesn’t cover. The fact that a program can be constructed that beats any alignment checking algorithm that purports to work for all possible programs doesn’t mean that one can’t prove something for the subset of programs created by your ML training process, nor does it means there aren’t probabilistic arguments you can make about those programs’ behavior that do better than chance.
The latter part isn’t being pedantic; companies still use endpoint defense software to guard against malware written adversarially to be as nice-seeming as possible, even though a full formal proof would be impossible in every circumstance.
Third, even if we were trying to pick an aligned program out of all possible programs, it’d still be possible to make an algorithm that explains it can’t answer the question in the cases that we don’t know, and use only those programs in which it could formally verify alignment. As an example, Turing’s original proof doesn’t work in the case that you limit programs to those that are shorter than the halting-checker and thus can’t necessarily embed it.
Hoping that the AI happens to be aligned is not even an alignment strategy.
Your conclusion was “Therefore no AI built by current methods can be aligned.”. I’m just explaining why that conclusion in particular is wrong. I agree it is a terrible alignment strategy to just train a DL model and hope for the best.
There are reasons to think that an AI is aligned between “hoping it is aligned” and “having a formal proof that it is aligned”. For example, we might be able to find sufficiently strong selection theorems, which tell us that certain types of optima tend to be chosen, even if we can’t prove theorems with certainty. We also might be able to find a working ELK strategy that gives us interpretability.
These might not be good strategies, but the statement “Therefore no AI built by current methods can be aligned” seems far too strong.
Rice’s theorem says that there’s no algorithm for proving nontrivial facts about arbitrary programs, but it does not say that no nontrivial fact can be proven about a particular program. It also does not say that you can’t reason probabilistically/heuristically about arbitrary programs in lieu of formal proofs. It just says it’s possible to construct any program that breaks an algorithm that purports to prove a specific fact about all possible programs.
(And if it turns out we can’t formally prove something about a neural net (like alignment), then of course that also doesn’t mean negative thing about it is definitely true; it could be that we can’t prove alignment for a program and it happens to be aligned.)
Proving things of one particular program is not useful in this context. What is needed is to prove properties of all the AIs that may come out of whatever one’s research program is, rejecting those that fail and only accepting those whose safety is assured. This is not usefully different from the premise of Rice’s theorem.
Hoping that the AI happens to be aligned is not even an alignment strategy.
First, we could still prove things about one particular program that comes out of the research program even if for some reason we couldn’t prove things about the programs that come out of that research program in general.
Second, that actually is something Rice’s theorem doesn’t cover. The fact that a program can be constructed that beats any alignment checking algorithm that purports to work for all possible programs doesn’t mean that one can’t prove something for the subset of programs created by your ML training process, nor does it means there aren’t probabilistic arguments you can make about those programs’ behavior that do better than chance.
The latter part isn’t being pedantic; companies still use endpoint defense software to guard against malware written adversarially to be as nice-seeming as possible, even though a full formal proof would be impossible in every circumstance.
Third, even if we were trying to pick an aligned program out of all possible programs, it’d still be possible to make an algorithm that explains it can’t answer the question in the cases that we don’t know, and use only those programs in which it could formally verify alignment. As an example, Turing’s original proof doesn’t work in the case that you limit programs to those that are shorter than the halting-checker and thus can’t necessarily embed it.
Your conclusion was “Therefore no AI built by current methods can be aligned.”. I’m just explaining why that conclusion in particular is wrong. I agree it is a terrible alignment strategy to just train a DL model and hope for the best.
There are reasons to think that an AI is aligned between “hoping it is aligned” and “having a formal proof that it is aligned”. For example, we might be able to find sufficiently strong selection theorems, which tell us that certain types of optima tend to be chosen, even if we can’t prove theorems with certainty. We also might be able to find a working ELK strategy that gives us interpretability.
These might not be good strategies, but the statement “Therefore no AI built by current methods can be aligned” seems far too strong.