On a separate note, I think that incidents are the opposite of this—that requires that people go through and find what is wrong immediately, because the response is urgent. If anything, a Root Cause Analysis after the fact would be more similar. Or possibly the outage investigation. You might be interested in Julia Evan’s debugging puzzles, which are small-scale but good introductions. I could imagine similar scenarios with real servers (and a senior dev moderating) being good training on debugging server issues and learning new tools.
I should have been more specific that the post-mortem is a critical part of the incident handling. I see a lot of similarity in tactical decision-making, both in the incident handling (the decisions made) and in the post-mortem (the analysis and rationale).
Strategic decision-making, tradeoffs about solving a narrow problem simply or leaving room for a class of problems, with more complexity (and structures to handle that complexity), is a related, but different set of skills.
On a separate note, I think that incidents are the opposite of this—that requires that people go through and find what is wrong immediately, because the response is urgent. If anything, a Root Cause Analysis after the fact would be more similar. Or possibly the outage investigation. You might be interested in Julia Evan’s debugging puzzles, which are small-scale but good introductions. I could imagine similar scenarios with real servers (and a senior dev moderating) being good training on debugging server issues and learning new tools.
I should have been more specific that the post-mortem is a critical part of the incident handling. I see a lot of similarity in tactical decision-making, both in the incident handling (the decisions made) and in the post-mortem (the analysis and rationale).
Strategic decision-making, tradeoffs about solving a narrow problem simply or leaving room for a class of problems, with more complexity (and structures to handle that complexity), is a related, but different set of skills.