Many systems get “spot-checked” by artificially forcing them into a rare but important-to-correctly-handle stressed state under controlled conditions where more monitoring and recovery resources are available (or where the stakes are lower) than would be the case during a real instance of the stressed state.
These serve to practice procedures, yes, but they also serve to evaluate whether the procedures would be followed correctly in a crisis, and whether the procedures even work.
Drills
Fire/tornado/earthquake/nuclear-attack drills
Military drills (the kind where you tell everyone to get to battle stations, not the useless marching around in formation kind)
Large cloud computing companies I’ve worked at need to stay online in the face of loss of a single computer, or a single datacenter. They periodically check to see that these failures are survivable by directly powering off computers, disconnecting entire datacenters from the network, or simply running through a datacenter failover procedure beginning to end to check that it works.
Stress tests
Many systems get “spot-checked” by artificially forcing them into a rare but important-to-correctly-handle stressed state under controlled conditions where more monitoring and recovery resources are available (or where the stakes are lower) than would be the case during a real instance of the stressed state.
These serve to practice procedures, yes, but they also serve to evaluate whether the procedures would be followed correctly in a crisis, and whether the procedures even work.
Drills
Fire/tornado/earthquake/nuclear-attack drills
Military drills (the kind where you tell everyone to get to battle stations, not the useless marching around in formation kind)
Large cloud computing companies I’ve worked at need to stay online in the face of loss of a single computer, or a single datacenter. They periodically check to see that these failures are survivable by directly powering off computers, disconnecting entire datacenters from the network, or simply running through a datacenter failover procedure beginning to end to check that it works.
https://en.wikipedia.org/wiki/Stress_test_(financial)