Two kinds of cascading catastrophes one could imagine in software systems...
A codebase is such a spaghetti tower (and/or coding practices so bad) that fixing a bug introduces, on average, more than one new bug. Software engineers toil away fixing bugs, making the software steadily more buggy over time.
Software services managed by different groups have dependencies—A calls B, B calls C, etc. Eventually, the dependence graph becomes connected enough and loopy enough that a sufficiently-large chunk going down brings down most of the rest, and nothing can go back up until everything else goes back up (i.e. there’s circular dependence/deadlock).
How could we measure how “close” we are to one of these scenarios going supercritical?
For the first, we’d need to have attribution of bugs—i.e. track which change introduced each bug. Assuming most bugs are found and attributed after some reasonable amount of time, we can then estimate how many bugs each bug fix introduces, on average.
(I could also imagine a similar technique for e.g. medicine: check how many new problems result from each treatment of a problem.)
For the second, we’d need visibility into codebases maintained by different groups, which would be easy within a company but much harder across companies. In principle, within a company, some kind of static analysis tool could go look for all the calls to apis between services, map out the whole graph, and then calculate which “core” pieces could be involved in a catastrophic failure.
(Note that this problem could be mostly-avoided by intentionally taking down services occasionally, so engineers are forced to build around that possibility. I don’t think any analogue of this approach would work for the first failure-type, though.)
Two kinds of cascading catastrophes one could imagine in software systems...
A codebase is such a spaghetti tower (and/or coding practices so bad) that fixing a bug introduces, on average, more than one new bug. Software engineers toil away fixing bugs, making the software steadily more buggy over time.
Software services managed by different groups have dependencies—A calls B, B calls C, etc. Eventually, the dependence graph becomes connected enough and loopy enough that a sufficiently-large chunk going down brings down most of the rest, and nothing can go back up until everything else goes back up (i.e. there’s circular dependence/deadlock).
How could we measure how “close” we are to one of these scenarios going supercritical?
For the first, we’d need to have attribution of bugs—i.e. track which change introduced each bug. Assuming most bugs are found and attributed after some reasonable amount of time, we can then estimate how many bugs each bug fix introduces, on average.
(I could also imagine a similar technique for e.g. medicine: check how many new problems result from each treatment of a problem.)
For the second, we’d need visibility into codebases maintained by different groups, which would be easy within a company but much harder across companies. In principle, within a company, some kind of static analysis tool could go look for all the calls to apis between services, map out the whole graph, and then calculate which “core” pieces could be involved in a catastrophic failure.
(Note that this problem could be mostly-avoided by intentionally taking down services occasionally, so engineers are forced to build around that possibility. I don’t think any analogue of this approach would work for the first failure-type, though.)