Rice’s Theorem says that AIs can’t determine much from studying AI source code
While reading up on Roko’s Basilisk, I encountered many arguments of the form:
If agent A has the source code to agent B, then A could analyze B's source code to determine whether B has property X.
The viability of such an analysis would make intuitive sense to most people. But from a computer scientist’s point of view: For any non-trivial value of X and arbitrary B, Rice’s Theorem indicates that “B has property X” is undecidable. Which literally means that there does not exist an A that can accomplish such analysis reliably against arbitrary B.
In the case of Roko’s Basilisk in particular: doesn’t this prevent one agent from blackmailing another, since an agent can’t actually determine what form of decision theory the other agent uses? Even if the agents are clones, the problem still exists — an agent can’t even analyze its own source code and reliably determine what form of decision theory it uses. Per Rice’s Theorem.
This seems like such a simple logical issue that I feel like I must be missing something. Because arguments of this form (“Agent A analyzes source code to B and determines that B has X”) appear to be widespread when discussing complex decision theory hypotheticals and AI. What am I missing?
Rice’s Theorem just proves that there’s no algorithm that determines if property B holds for all programs, this doesn’t stop you from proving the same for specific programs. (For some intuition, it’s clearly possible to write programs that provably halt—like a program consisting of “return 0” and nothing else—even though the halting problem is insoluble in general)
Expanding on this, there are several programming languages (Idris, Coq, etc.) whose type system ensures that every program that type checks will halt when it’s run. One way to view a type system is as an automated search for a proof that your program is well-typed (and a type error is a counter-example). In a language like Idris or Coq, a program being well-typed implies that it halts. So machine generated proofs that programs halt aren’t just theoretically possible, they’re used extensively by some languages.
I cowrote a detailed response here
https://www.cser.ac.uk/news/response-superintelligence-contained/
Essentially, this type of reasoning proves too much, since it implies we cannot show any properties whatsoever of any program, which is clearly false.
You’re missing the fact Rice’s theorem relies on Turing’s bullshit proof of the halting problem, except that proof relies on nerfing your solver to never solve paradoxes..