leogao comments on Godzilla Strategies

leogao 15 Jun 2022 18:34 UTC
LW: 12 AF: 8
6
AF
I agree that the SW/HW analogy is not a good analogy for AGI safety (I think security is actually a better analogy), but I would like to present a defence of the idea that normal systems reliability engineering is not enough for alignment (this is not necessarily a defence of any of the analogies/claims in the OP).

Systems safety engineering leans heavily on the idea that failures happen randomly and (mostly) independently, so that enough failures happening together by coincidence to break the guarantees of the system is rare. That is:
- RAID is based on the assumption that hard drive failures happen mostly independently, because the probability of too many drives failing at once is sufficiently low. Even in practice this assumption becomes a problem because a) drives purchased in the same batch will have correlated failures and b) rebuilding an array puts strain on the remaining drives, and people have to plan around this by adding more margin of error.
- Checksums and ECC are robust against the occasional bitflip. This is because occasional bitflips are mostly random and getting bitflips that just happen to set the checksum correctly are very rare. Checksums are not robust against someone coming in and maliciously changing your data in-transit, you need signatures for that. Even time correlated runs of flips can create a problem for naive schemes and burn through the margin of error faster than you’d otherwise expect.
- Voting between multiple systems assumes that the systems are all honest and just occasionally suffer transient hardware failures. Clean room reimplementations are to try and eliminate the correlations due to bugs, but they still don’t protect against correlated bad behaviour across all of the systems due to issues with your spec.
My point here is that once your failures stop being random and independent, you leave the realm of safety engineering and enter the realm of security (and security against extremely powerful actors is really really hard). I argue that AGI alignment is much more like the latter, because we don’t expect AGIs to fail in random ways, but rather we expect them to intelligently steer the world into directions we don’t want. AGI induced failure looks like things that should have been impossible when multiplying out the probabilities somehow happening regardless.

In particular, relying on independent AGIs not being correlated with each other is an extremely dangerous assumption: AGIs can coordinate even without communication, alignment is a very narrow target that’s hard to hit, and a parliament of misaligned AGIs is definitely not going to end well for us.