RHollerith comments on Do infinite alternatives make AI alignment impossible?

RHollerith 16 Dec 2024 21:48 UTC
7 points
8
Although the argument you outline might be an argument against ever fully trusting tests (usually called “evals” on this site) that this or that AI is aligned, alignment researchers have other tools in their toolbox besides running tests or evals.

It would take a long time to explain these tools, particularly to someone unfamiliar with software development or a related field like digital-electronics design. People make careers in studying tools to make reliable software systems (and reliable digital designs).

The space shuttle is steered by changing the direction in which the rocket nozzles point relative to the entire shuttle. If at any point in flight, one of the rocket nozzles had pointed in a direction a few degrees different from the one it should point in, the shuttle would have been lost and all aboard would have died. The pointing or aiming of these rocket nozzles is under software control. How did the programmers at NASA make this software reliable? Not merely through testing!

These programmers at NASA relied on their usual tools (basically engineering-change-order culture) and did not need a more advanced tool called “formal verification”, which Intel turned to to make sure their microprocessors did not have any flaw necessitating another expensive recall after they spent 475 million dollars in 1994 in a famous product recall to fix the so-called Pentium FDIV bug.

Note that FDIV refers to division of (floating-point) numbers and that it is not possible in one human lifetime to test all possible dividends and divisors to ensure that a divider circuit is operating correctly. I.e., the “impossible even theoretically” argument you outline would have predicted that it is impossible to ensure the correct operation of even something as simple as a divider implemented in silicon, and yet Intel has during the 30 years since the 1994 recall avoided another major recall of any of their microprocessors for reasons of a mistake in their digital design.

“Memory allocation” errors (e.g., use-after-free errors) are an important source of bugs and security vulnerabilities in software, and testing has for decades been an important way to find and eliminate these errors (Valgrind probably being the most well-known framework for doing the testing) but the celebrated new programming language Rust completely eliminates the need for testing for this class of programming errors. Rust replaces testing with a more reliable methodology making use of a tool called a “borrow checker”. I am not asserting that a borrow checker will help people create an aligned super-intelligence: I am merely pointing at Rust and its borrow checker as an example of a methodology that is superior to testing for ensuring some desirable property (e.g., the absence of use-after-free errors that an attacker might be able to exploit) of a complex software artifact.

In summary, aligning a superhuman AI is humanly possible given sufficiently careful and talented people. The argument for stopping frontier AI research (or pausing it for 100 years) depends on considerations other than the “impossible even theoretically” argument you outline.

Methodologies that are superior to testing take time to develop. For example, the need for a better methodology to prevent “memory allocation” errors was recognized in the 1970s. Rust and its borrow checker are the result of a line of investigation inspired by a seminal paper published in 1987. But Rust has become a realistic option for most programming projects only within the last 10 years or less. And an alignment methodology that continue to be reliable even when the AI becomes super-humanly capable is a much taller order than a methodology to prevent use-after-free errors and related memory-allocation errors.