anonymousaisafety comments on Godzilla Strategies

anonymousaisafety 13 Jun 2022 0:23 UTC
LW: 51 AF: 18
36
AF
James Mickens is writing comedy. He worked in distributed systems. A “distributed system” is another way to say “a scenario in which you absolutely will have to use software to deal with your broken hardware”. I can 100% guarantee that this was written with his tongue in his cheek.
The modern world is built on software that works around HW failures.
- You likely have ECC ram in your computer.
- There are checksums along every type of data transfer (Ethernet frame check sequences, IP header checksums, UDP datagram checksums, ICMP checksums, eMMC checksums, cryptographic auth for tokens or certificates, etc).
- An individual SSD or HDD have algorithms for detecting and working around failed blocks / sectors in HW.
- There are fully redundant processors in safety-critical applications using techniques like active-standby, active-active, or some manner of voting for fault tolerance.
- In anything that involves HW sensors, there’s algorithms like an extended Kalman filter for combining the sensor readings to a single consistent view of reality, and stapled to that are algorithms for determining when sensors are invalid because they’ve railed high, railed low, or otherwise failed in a manner that SW can detect.
- Your phone’s WiFi works because the algorithm used for the radio is constantly working around dropouts and reconnecting to new sources as needed.
- We can read this post because it’s sent using TCP and is automatically retransmitted as many times as needed until it’s been ACK’d successfully.
- We can play multiplayer video games because they implement eventually consistent protocols on top of UDP.
- Almost all computer applications implement some form of error handling + retry logic for pretty much anything involving I/O (file operations, network operations, user input) because sometimes things fail, and almost always, retrying the thing that failed will work.
- Large data centers have hundreds of thousands of SSDs and they are constantly failing—why doesn’t Google fall over? Because SW + HW algorithms like RAID compensate for drives dying all of the time.
If we use one AI to oversee another AI, and something goes wrong, that’s not a recoverable error; we’re using AI assistance in the first place because we can’t notice the relevant problems without it. If two AIs debate each other in hopes of generating a good plan for a human, and something goes wrong, that’s not a recoverable error; it’s the AIs themselves which we depend on to notice problems. If we use one maybe-somewhat-aligned AI to build another, and something goes wrong, that’s not a recoverable error; if we had better ways to detect misalignment in the child we’d already have used them on the parent.
Replace “AI” with “computer” in this paragraph and it is obviously wrong because every example here is under-specified. There is a dearth of knowledge on this forum of anything resembling traditional systems engineering or software system safety and it shows in this thread and in the previous thread you made about air conditioners. I commented as such here.
“If we use one computer to oversee another computer, and something goes wrong, that’s not a recoverable error; we’re using computer assistance in the first place because we can’t notice the relevant problems without it.”
Here are some examples off the top of my head where we use one computer to oversee another computer:
1. It’s common to have one computer manage a pool of workers where each worker is another computer and workers may fail. The computer doing the management is able to detect a stalled or crashed worker, power cycle the hardware, and then resubmit the work. Depending on the criticality of this process, the “manager” might actually be multiple computers that work synchronously. The programming language Erlang is designed for this exact use-case—distributed, fault tolerance SW applications in contexts where I/O is fallible and it’s unacceptable for the program to crash.
2. We often use one computer program to calculate some type of path or optimal plan and it’s a very complicated program to understand, and then we use a 2nd computer program to validate the outputs from the 1st program—why do we use two programs? Because the first is inscrutable and difficult to explain, but the 2nd reads like straightforward requirements in English. In other words, it is often far easier to check a solution than it is to create a solution. The mathematically inclined will recognize this as a consequence of P vs NP problems, if P != NP.
3. It’s common in safety-critical applications to have a fail-safe or backup using a significantly less complicated architecture—e.g. you might use <complicated system> to do <some complex task>, but for a fail-safe like “power off everything”, you might have a tiny microprocessor sitting in-line with the main power rail serving as a glorified switch. So normally the <complicated system> is driving the show, but if that starts to go sideways, the tiny microprocessor can shut it down.
4. Almost all microprocessors have a “watchdog” built into them. A watchdog is a secondary processor that will reset the primary processor if the primary is non-responsive. Have you ever seen your Android phone mysteriously reboot when the UI locks up? That was a watchdog.
5. The “watchdog” concept is even used in pure SW contexts, e.g. when the Android OS kills an application on your phone because it has frozen, that’s a one computer program (the OS) overseeing another (the application). Ditto for Windows & the task manager.
6. We often use “voting” where we run the same SW on 3 or more systems and then only output to some downstream hazard if all 3 systems agree. If they don’t agree, we can fail-safe or try to recover, e.g. by power-cycling whichever system was out-of-family, or both—first try to recover, then fail-safe if that didn’t work. This is done by running code in “lockstep” on synchronized inputs, very similar to how old multiplayer RTS games used to do networking.
7. You can buy self-checking processors that implement lockstep comparison between 2 internal cores so that whenever instructions are executed, you know that the execution occurred identically across both cores.
These aren’t cherry-picked. This is the bread & butter of systems safety. We build complex, safe systems by identifying failure modes and then using redundant systems to either tolerate faults or to fail-safe. By focusing on the actual system, and the actual failure modes, and by not getting stuck with our head in the clouds considering a set of “all possible hypothetical systems”, it is possible to design & implement robust, reliable solutions in reality.
To claim it is not just impossible to do that, but that it is foolhardy to even try, is the exact opposite of a safety-critical mindset.
What links here?
- anonymousaisafety's comment on Godzilla Strategies by johnswentworth (13 Jun 2022 0:33 UTC; 2 points)
- leogao 15 Jun 2022 18:34 UTC
  LW: 12 AF: 8
  6
  AF Parent
  I agree that the SW/HW analogy is not a good analogy for AGI safety (I think security is actually a better analogy), but I would like to present a defence of the idea that normal systems reliability engineering is not enough for alignment (this is not necessarily a defence of any of the analogies/claims in the OP).
  
  Systems safety engineering leans heavily on the idea that failures happen randomly and (mostly) independently, so that enough failures happening together by coincidence to break the guarantees of the system is rare. That is:
  - RAID is based on the assumption that hard drive failures happen mostly independently, because the probability of too many drives failing at once is sufficiently low. Even in practice this assumption becomes a problem because a) drives purchased in the same batch will have correlated failures and b) rebuilding an array puts strain on the remaining drives, and people have to plan around this by adding more margin of error.
  - Checksums and ECC are robust against the occasional bitflip. This is because occasional bitflips are mostly random and getting bitflips that just happen to set the checksum correctly are very rare. Checksums are not robust against someone coming in and maliciously changing your data in-transit, you need signatures for that. Even time correlated runs of flips can create a problem for naive schemes and burn through the margin of error faster than you’d otherwise expect.
  - Voting between multiple systems assumes that the systems are all honest and just occasionally suffer transient hardware failures. Clean room reimplementations are to try and eliminate the correlations due to bugs, but they still don’t protect against correlated bad behaviour across all of the systems due to issues with your spec.
  My point here is that once your failures stop being random and independent, you leave the realm of safety engineering and enter the realm of security (and security against extremely powerful actors is really really hard). I argue that AGI alignment is much more like the latter, because we don’t expect AGIs to fail in random ways, but rather we expect them to intelligently steer the world into directions we don’t want. AGI induced failure looks like things that should have been impossible when multiplying out the probabilities somehow happening regardless.
  
  In particular, relying on independent AGIs not being correlated with each other is an extremely dangerous assumption: AGIs can coordinate even without communication, alignment is a very narrow target that’s hard to hit, and a parliament of misaligned AGIs is definitely not going to end well for us.
- anonymousaisafety 13 Jun 2022 1:04 UTC
  6 points
  0
  Parent
  If two AIs debate each other in hopes of generating a good plan for a human, and something goes wrong, that’s not a recoverable error;
  I alluded to this above in many examples, but let’s just do a theoretical calculation as well so it’s not just anecdotes.
  Suppose we have some AI “Foo” that has some probability of failure, or P(Failure of Foo).
  Then the overall probability of the system containing Foo failing is P(Failure of System) == P(Failure of Foo).
  On the other hand, suppose we have some AI “Bar” whose goal is to detect when AI “Foo” has failed, e.g. when Foo erroneously creating a plan that would harm humans or attempt to deceive them.
  We can now calculate the new likelihood of P(Failure of System) == P(Failure of Foo) * P(Failure of Bar), where P(Failure of Bar) is the likelihood that either Bar failed to detect the issue with Foo, or that Bar successfully detected the issue with Foo, but failed to notify us.
  These probabilities can be related in some way, but they don’t have to be. It is possible to drastically reduce the probability of a system failing by adding components within that system, even if those new components have chances of failure themselves.
  In particular, so long as the requirement allocated to Bar is narrow enough, we can make Bar more reliable than Foo, and then lower the overall chance of the system failing. One way this works is by limiting Bar’s functionalities so that if Bar failed, in isolation of Foo failing, the system is unaffected. In the context of fault tolerance, we’d refer to that as a one-fault tolerant system. We can tolerate Foo failing—Bar will catch it. And we can tolerate Bar failing—it doesn’t impact the system’s performance. We only have an issue if Foo failed and then Bar subsequently also failed.