What is needed is to prove properties of all the AIs that may come out of whatever one’s research program is, rejecting those that fail and only accepting those whose safety is assured. This is not usefully different from the premise of Rice’s theorem.
First, we could still prove things about one particular program that comes out of the research program even if for some reason we couldn’t prove things about the programs that come out of that research program in general.
Second, that actually is something Rice’s theorem doesn’t cover. The fact that a program can be constructed that beats any alignment checking algorithm that purports to work for all possible programs doesn’t mean that one can’t prove something for the subset of programs created by your ML training process, nor does it means there aren’t probabilistic arguments you can make about those programs’ behavior that do better than chance.
The latter part isn’t being pedantic; companies still use endpoint defense software to guard against malware written adversarially to be as nice-seeming as possible, even though a full formal proof would be impossible in every circumstance.
Third, even if we were trying to pick an aligned program out of all possible programs, it’d still be possible to make an algorithm that explains it can’t answer the question in the cases that we don’t know, and use only those programs in which it could formally verify alignment. As an example, Turing’s original proof doesn’t work in the case that you limit programs to those that are shorter than the halting-checker and thus can’t necessarily embed it.
Hoping that the AI happens to be aligned is not even an alignment strategy.
Your conclusion was “Therefore no AI built by current methods can be aligned.”. I’m just explaining why that conclusion in particular is wrong. I agree it is a terrible alignment strategy to just train a DL model and hope for the best.
First, we could still prove things about one particular program that comes out of the research program even if for some reason we couldn’t prove things about the programs that come out of that research program in general.
Second, that actually is something Rice’s theorem doesn’t cover. The fact that a program can be constructed that beats any alignment checking algorithm that purports to work for all possible programs doesn’t mean that one can’t prove something for the subset of programs created by your ML training process, nor does it means there aren’t probabilistic arguments you can make about those programs’ behavior that do better than chance.
The latter part isn’t being pedantic; companies still use endpoint defense software to guard against malware written adversarially to be as nice-seeming as possible, even though a full formal proof would be impossible in every circumstance.
Third, even if we were trying to pick an aligned program out of all possible programs, it’d still be possible to make an algorithm that explains it can’t answer the question in the cases that we don’t know, and use only those programs in which it could formally verify alignment. As an example, Turing’s original proof doesn’t work in the case that you limit programs to those that are shorter than the halting-checker and thus can’t necessarily embed it.
Your conclusion was “Therefore no AI built by current methods can be aligned.”. I’m just explaining why that conclusion in particular is wrong. I agree it is a terrible alignment strategy to just train a DL model and hope for the best.