If you are talking about what “causes” what, maybe you should think about the problem causally first, not as a standard hypothesis testing problem (although hypothesis testing may reappear later). What does it mean for a line to be a cause of bad behavior? What is “a causal effect”? Often people formalize causal effect as a difference of means between the “control group” and a “test group.” In your case the control group is the original program, and the test group is the program where you intervened to comment out the offending line, say line 500.
You have program output O, which is either good (o1) or bad (o2). You have the original program, where nothing was done to it, where you get crashes sometimes P(O = o2) > 0. You also have an altered program, where you intervened to comment out line 500, say, where P(O = o2 | do(line500 = noop)) = 0.
The statistic you want is, for example, E(O) - E(O | do(line500 = noop)). If this statistic is not zero, we say line 500 has a causal effect on the crash.
Since you can just intervene directly in your system, you can just gather enough samples of this statistic to figure out whether there is a causal effect or not. In systems that are not computer programs, people often cannot intervene directly, and so resort to “trickery” to get statistics like the above mean difference.
If this seems simple, that’s because it is. This setup mirrors how people actually debug—they intervene in systems and compare with “the test group,” sometimes doing multiple runs if the bug is a “Heisenbug.”
There is also the issue of whether you can really treat the outputs of repeated program runs as iid samples. Sometimes you can, often you cannot, as other posters pointed out.
If you are talking about what “causes” what, maybe you should think about the problem causally first, not as a standard hypothesis testing problem (although hypothesis testing may reappear later). What does it mean for a line to be a cause of bad behavior? What is “a causal effect”? Often people formalize causal effect as a difference of means between the “control group” and a “test group.” In your case the control group is the original program, and the test group is the program where you intervened to comment out the offending line, say line 500.
You have program output O, which is either good (o1) or bad (o2). You have the original program, where nothing was done to it, where you get crashes sometimes P(O = o2) > 0. You also have an altered program, where you intervened to comment out line 500, say, where P(O = o2 | do(line500 = noop)) = 0.
The statistic you want is, for example, E(O) - E(O | do(line500 = noop)). If this statistic is not zero, we say line 500 has a causal effect on the crash.
Since you can just intervene directly in your system, you can just gather enough samples of this statistic to figure out whether there is a causal effect or not. In systems that are not computer programs, people often cannot intervene directly, and so resort to “trickery” to get statistics like the above mean difference.
If this seems simple, that’s because it is. This setup mirrors how people actually debug—they intervene in systems and compare with “the test group,” sometimes doing multiple runs if the bug is a “Heisenbug.”
There is also the issue of whether you can really treat the outputs of repeated program runs as iid samples. Sometimes you can, often you cannot, as other posters pointed out.