This strikes me as particularly galling because I have in fact repeated this claim to someone new to the field. I think I prefaced it with “studies have conclusively shown...”. Of course, it was unreasonable of me to think that what is being touted by so many as well-researched was not, in fact, so.
Mind, it seems to me that defects do follow both patterns: Introducing defects earlier and/or fixing them later should come at a higher dollar cost, that just makes sense. However, it could be the same type of “makes sense” that made Aristotle conclude that heavy objects fall faster than light objects—getting actual data would be much better than reasoning alone, especially is it would tell us just how much costlier, if at all, these differences are—it would be an actual precise tool rather than a crude (and uncertain) rule of thumb.
I do have one nagging worry about this example: These days a lot of projects collect a lot of metrics. It seems dubious to me that no one has tried to replicate these results.
These days a lot of projects collect a lot of metrics.
Mostly the ones that are easy to collect: a classic case of “looking under the lamppost where there is light rather than where you actually lost your keys”.
it could be the same type of “makes sense”
Now we’re starting to think. Could we (I don’t have a prefabricated answer to this one) think of a cheap and easy to run experiment that would help us see more clearly what’s going on?
Now we’re starting to think. Could we (I don’t have a prefabricated answer to this one) think of a cheap and easy to run experiment that would help us see more clearly what’s going on?
Here’s an idea. There are a number of open-source software projects that exist. Many of these are in some sort of version control system which, generally, keeps a number of important records; any change made to the software will include a timestamp, a note by the programmer detailing what was the intention of the change, and a list of changes to the files that resulted from the change.
A simple experiment might then be to simply collate data from either one large project, or a number of smaller projects. The cost of fixing a bug can be estimated from the number of lines of code changed to fix the bug; the amount of time since the bug was introduced can be found by looking back through previous versions and comparing timestamps. A scatter plot of time vs. lines-of-code-changed can then be produced, and investigated for trends.
Of course, this would require a fair investment of time to do it properly.
Hmmm. I’d parsed ‘cheap and easy’ as ‘can be done by a university student, on a university student’s budget, in furtherance of a degree’ - which possibly undervalues time somewhat.
At the cost of some amount of accuracy, however, a less time-consuming method might be the following; to automate the query, under the assumption that the bug being repaired was introduced at the earliest time when one of the lines of code modified to fix the bug was last modified (that is, if three lines of code were changed to fix the bug, two of which had last been changed on 24 June and one of which had last been changed on 22 June, then the bug would be assumed to have been introduced on 22 June). Without human inspection of each result, some extra noise will be introduced into the final graph. (A human (or suitable AGI, if you have one available) inspection of a small subset of the results could give an estimate of the noise introduced)
By “cheap and easy” what I mean is “do the very hard work of reasoning out how the world would behave if the hypothesis were true, versus if it were false, and locate the smallest observation that discriminates between these two logically possible worlds”.
That’s hard and time-consuming work (therefore expensive), but the experiment itself is cheap and easy.
My intuition (and I could well be Wrong on this) tells me that experiments of the sort you are proposing are sort of the opposite: cheap in the front and expensive in the back. What I’m after is a mullet of an experiment, business in front and party in back.
An exploratory experiment might consist of taking notes the next time you yourself fix a bug, and note the answers to a bunch of hard questions: how did I measure the “cost” of this fix? How did I ascertain that this was in fact a “bug” (vs. some other kind of change)? How did I figure out when the bug was introduced? What else was going on at the same time that might make the measurements invalid?
Asking these questions, ISTM, is the real work of experimental design to be done here.
An exploratory experiment might consist of taking notes the next time you yourself fix a bug, and note the answers to a bunch of hard questions: how did I measure the “cost” of this fix? How did I ascertain that this was in fact a “bug” (vs. some other kind of change)? How did I figure out when the bug was introduced? What else was going on at the same time that might make the measurements invalid?
Well, for a recent bug; first, some background:
Bug: Given certain input, a general utility function returns erroneous output (NaN)
Detail: It turns out that, due to rounding, the function was taking the arccos of a number fractionally higher than one
Fix: Check for the boundary cojdition; take arccos(1) instead of arccos(1.00000001). Less than a dozen lines of code, and not very complicated lines
Then, to answer your questions in order:
how did I measure the “cost” of this fix?
Once the problem code was identified, the fix was done in a few minutes. Identifying the problem code took a little longer, as the problem was a rare and sporadic one—it happened first during a particularly irritating test case (and then, entirely by coincidence, a second time on a similar test case, which caused some searching in the wrong bit of code at first)
How did I ascertain that this was in fact a “bug” (vs. some other kind of change)?
A numeric value, displayed to the user, was showing “NaN”.
How did I figure out when the bug was introduced?
The bug was introduced by failing to consider a rare but theoretically possible test case at the time (near the beginning of a long project) that a certain utility function was produced. I could get a time estimate by checking version control to see when the function in question had been first written; but it was some time ago.
What else was going on at the same time that might make the measurements invalid?
A more recent change made the bug slightly more likely to crop up (by increasing the potential for rounding errors). The bug may otherwise have gone unnoticed for some time.
Of course, that example may well be an outlier.
Hmmm. Thinking about this further, I can imagine whole rafts of changes to the specifications which can be made just before final testing at very little cost (e.g. “Can you swap the positions of those two buttons?”) Depending on the software development methodology, I can even imagine pretty severe errors creeping into the code early on that are trivial to fix later, once properly identified.
The only circumstances I can think of that might change how long a bug takes to fix as a function of how long the development has run are:
After long enough, it becomes more difficult to properly identify the bug because there are more places to look to try to find it (for many bugs, this becomes trivial with proper debugging software; but there are still edge cases where even the best debugging software is little help)
If there is some chance that someone, somewhere else in the code, wrote code that relies on the bug—forcing an extended debugging effort, possibly a complete rewrite of the damaged code
If some major, structural change to the code is required (most bugs that I deal with are not of this type)
If the code is poorly written, hard to follow, and/or poorly understood
A numeric value, displayed to the user, was showing “NaN”.
That doesn’t tell me why it’s a bug. How is ‘bug-ness’ measured? What’s the “objective” procedure to determine whether a change is a bug fix, vs something else (dev gold-plating, change request, optimization, etc)?
NaN is an error code. The display was supposed to show the answer to an arithmetical computation; NaN (“Not a Number”) means that, at some point in the calculation, an invalid operation was performed (division by zero, arccos of a number greater than 1, or similar).
It is a bug because it does not answer the question that the arithmetical computation was supposed to solve. It merely indicates that, at some point in the code, the computer was told to perform an operation that does not have a defined answer.
That strikes me as a highly specific description of the “bug predicate”—I can see how it applies in this instance, but if you have 1000 bugs to classify, of which this is one, you’ll have to write 999 more predicates at this level. It seems to me, too, that we’ve only moved the question one step back—to why you deem an operation or a displayed result “invalid”. (The calculator applet on my computer lets me compute 1⁄0 giving back the result “ERROR”, but since that’s been the behavior over several OS versions, I suspect it’s not considered a “bug”.)
Is there a more abstract way of framing the predicate “this behavior is a bug”? (What is “bug” even a property of?)
Ah, I see—you’re looking for a general rule, not a specific reason.
In that case, the general rule under which this bug falls is the following:
For any valid input, the software should not produce an error message. For any invalid input, the software should unambiguously display a clear error message.
‘Valid input’ is defined as any input for which there is a sensible, correct output value.
So, for example, in a calculator application, 1⁄0 is not valid input because division by zero is undefined. Thus, “ERROR” (or some variant thereof) is a reasonable output. 1⁄0.2, on the other hand, is a valid operation, with a correct output value of 5. Returning “ERROR” in that case would be a bug.
Or, to put it another way; error messages should always have a clear external cause (up to and including hardware failure). It should be obvious that the external cause is a case of using the software incorrectly. An error should never start within the software, but should always be detected by the software and (where possible) unambiguously communicated to the user.
Granting that this definition of what constitutes a “bug” is diagnostic in the case we’ve been looking at (I’m not quite convinced, but let’s move on), will it suffice for the 999 other cases? Roughly how many general rules are we going to need to sort 1000 typical bugs?
Can we even tell, in the case we’ve been discussing, that the above definition applies, just by looking at the source code or revision history of the source code? Or do we need to have a conversation with the developers and possibly other stakeholders for every bug?
(I did warn up front that I consider the task of even asking the question properly to be very hard, so I’ll make no apologies for the decidedly Socratic turn of this thread.)
Roughly how many general rules are we going to need to sort 1000 typical bugs?
I can think, off the top of my head, of six rules that seem to cover most cases (each additional rule addressing one category in the list above). If I think about it for a few minutes longer, I may be able to think of exceptions (and then rules to cover those exceptions); however, I think it very probable that over 990 of those thousand bugs would fall under no more than a dozen similarly broad rules. I also expect the rare bug that is very hard to classify, that is likely to turn up in a random sample of 1000 bugs.
Can we even tell, in the case we’ve been discussing, that the above definition applies, just by looking at the source code or revision history of the source code?
Hmmm. That depends. I can, because I know the program, and the test case that triggered the bug. Any developer presented with the snippet of code should recognise its purpose, and that it should be present, though it would not be obvious what valid input, if any, triggers the bug. Someone who is not a developer may need to get a developer to look at the code, then talk to the developer. In this specific case, talking with a stakeholder should not be necessary; an independent developer would be sufficient (there are bugs where talking to a stakeholder would be required to properly identify them as bugs). I don’t think that identifying this fix as a bug can be easily automated.
If I were to try to automate the task of identifying bugs with a computer, I’d search through the version history for the word “fix”. It’s not foolproof, but the presence of “fix” in the version history is strong evidence that something was, indeed, fixed. (This fails when the full comment includes the phrase ”...still need to fix...”). Annoyingly, it would fail to pick up this particular bug (the version history mentions “adding boundary checks” without once using the word “fix”).
I’d search through the version history for the word “fix”.
That’s a useful approximation for finding fixes, and simpler enough compared to a half-dozen rules that I would personally accept the risk of uncertainty (e.g. repeated fixes for the same issues would be counted more than once). As you point out, you have to make it a systematic rule prior to the project, which makes it perhaps less applicable to existing open-source projects. (Many developers diligently mark commits according to their nature, but I don’t know what proportion of all open-source devs do, I suspect not enough.)
It’s too bad we can’t do the same to find when bugs were introduced—developers don’t generally label as such commits that contain bugs.
It’s too bad we can’t do the same to find when bugs were introduced—developers don’t generally label as such commits that contain bugs.
If they did, it would make the bugs easier to find.
If I had to automate that, I’d consider the lines of code changed by the update. For each line changed, I’d find the last time that that line had been changed; I’d take the earliest of these dates.
However, many bugs are fixed not by lines changed, but by lines added. I’m not sure how to date those; the date of the creation of the function containing the new line? The date of the last change to that function? I can imagine situations where either of those could be valid. Again, I would take the earliest applicable date.
I should probably also ignore lines that are only comments.
This one is interesting—it remained undetected for two years, was very cheap to fix (just add the commented out line back in), but had large and hard to estimate indirect costs.
Among people who buy into the “rising cost of defects” theory, there’s a common mistake: conflating “cost to fix” and “cost of the bug”. This is especially apparent in arguments that bugs in the field are “obviously” very costly to fix, because the software has been distributed in many places, etc. That strikes me as a category error.
many bugs are fixed not by lines changed, but by lines added
Many bugs are also fixed by adding or changing (or in fact deleting) code elsewhere than the place where the bug was introduced—the well-known game of workarounds.
At least one well-known bug I know about consisted of commenting out a single line of code.
I take your point. I should only ignore lines that are comments both before and after the change; commenting or uncommenting code can clearly be a bugfix. (Or can introduce a bug, of course).
Among people who buy into the “rising cost of defects” theory, there’s a common mistake: conflating “cost to fix” and “cost of the bug”. This is especially apparent in arguments that bugs in the field are “obviously” very costly to fix, because the software has been distributed in many places, etc. That strikes me as a category error.
Hmmm. “Cost to fix”, to my mind, should include the cost to find the bug and the cost to repair the bug. “Cost of the bug” should include all the knock-on effects of the bug having been active in the field for some time (which could be lost productivity, financial losses, information leakage, and just about anything, depending on the bug).
Many bugs are also fixed by adding or changing (or in fact deleting) code elsewhere than the place where the bug was introduced—the well-known game of workarounds.
I would assert that this does not fix the bug at all; it simply makes the bug less relevant (hopefully, irrelevant to the end user). If I write a function that’s supposed to return a+b, and it instead returns a+b+1, then this can easily be worked around by subtracting one from the return value every time it is used; but the downside is that the function is still returning the wrong value (a trap for any future maintainers) and, moreover, it makes the actual bug even more expensive to fix (since once it is fixed, all the extraneous minus-ones must be tracked down and removed).
A costly, but simple way would be to gather groups of SW engineers and have them work on projects where you intentionally introduce defects at various stages, and measure the costs of fixing them. To be statistically meaningful, this probably means thousands of engineer hours just to that effect.
A cheap (but not simple) way would be to go around as many companies as possible and hold the relevant measurements on actual products. This entails a lot of variables, however—engineer groups tend to work in many different ways. This might cause the data to be less than conclusive. In addition, the politics of working with existing companies may also tilt the results of such a research.
I can think of simple experiments that are not cheap; and of cheap experiments that are not simple. I’m having difficulty satisfying the conjunction and I suspect one doesn’t exist that would give a meaningful answer for high-cost bugs.
It’s not that costly if you do with university students:
Get two groups of 4 university students. One group is told “test early and often”. One group is told “test after the code is integrated”. For every bug they fix, measure the effort it is to fix it (by having them “sign a clock” for every task they do). Then, do analysis on when the bug was introduced (this seems easy post-fixing the bug, which is easy if they use something like Trac and SVN). All it takes is a month-long project that a group of 4 software engineering students can do. It seems like any university with a software engineering department can do it for the course-worth of one course. Seems to me it’s under $50K to fund?
But it can’t really be done the way you envision it. Variance in developer quality is high. Getting a meaningful result would require a lot more than 8 developers. And very few research groups can afford to run an experiment of that size—particularly since the usual experience in science is that you have to try the study a few times before you have the procedure right.
That would be cheap and simple, but wouldn’t give a meaningful answer for high-cost bugs, which don’t manifest in such small projects. Furthermore, with only eight people total, individual ability differences would overwhelmingly dominate all the other factors.
You can try to find people who produce such an experiment as a side-effect, but in that case you don’t get to specify parameters (that may lead to a failure to control some variable—or not).
Overall cost of experiment for all involved parties will be not too low, though (although marginal cost of the experiment relative to just doing business as usual can be reduced, probably).
A “high-cost bug” seems to imply tens of hours spent overall on fixing. Otherwise, it is not clear how to measure the cost—from my experience quite similar bugs can take from 5 minutes to a couple of hours to locate and fix without clear signs of either case. Exploration depends on your shape, after all. On the other hand, it should be a relatively small part of the entire project, otherwise it seems to be not a bug, but the entire project goal (this skews data about both locating the bug and cost of integrating the fix).
if 10-20 hours (how could you predict how high-cost will a bug be?) are a small part of a project, you are talking about at least hundreds of man-hours (it is not a good measure of project complexity, but it is an estimate of cost). Now, you need to repeat, you need to try alternative strategies to get more data on early detection and on late detection and so on.
It can be that you have access to some resource that you can spend on this (I dunno, a hundred students with a few hours per week for a year dedicated to some programming practice where you have a relative freedom?) but not on anything better; it may be that you can influence set of measurements of some real projects.. But the experiment will only be cheap by making someone else cover the main cost (probably, for a good unrelated reason).
Also notice that if you cannot influence how things are done, only how they are measured, you need to specify what is measured much better than the cited papers do. What is the moment of introduction of a bug? What is cost of fixing a bug? Note that fixing a high-cost bug may include doing some improvements that were put off before. This putting off could be a decision with a reason, or just irrational. It would be nice if someone proposed a methodology of measuring enough control variables in such a project—but not because it would let us run this experiment, but because it would be a very useful piece of research on software project costs in general.
A “high-cost bug” seems to imply tens of hours spent overall on fixing. Otherwise, it is not clear how to measure the cost—from my experience quite similar bugs can take from 5 minutes to a couple of hours to locate and fix without clear signs of either case.
A high-cost bug can also be one that reduces the benefit of having the program by a large amount.
For instance, suppose the “program” is a profitable web service that makes $200/hour of revenue when it is up, and costs $100/hour to operate (in hosting fees, ISP fees, sysadmin time, etc.), thus turning a tidy profit of $100/hour. When the service is down, it still costs $100/hour but makes no revenue.
Bug A is a crashing bug that causes data corruption that takes time to recover; it strikes once, and causes the service to be down for 24 hours, which time is spent fixing it. This has the revenue impact of $200 · 24 = $4800.
Bug B is a small algorithmic inefficiency; fixing it takes an eight-hour code audit, and causes the operational cost of the service to come down from $100/hour to $99/hour. This has the revenue impact of $1 · 24 · 365 = $8760/year.
Bug C is a user interface design flaw that makes the service unusable to the 5% of the population who are colorblind. It takes five minutes of CSS editing to fix. Colorblind people spend as much money as everyone else, if they can; so fixing it increases the service’s revenue by 4.8% to $209.50/hour. This has the revenue impact of $9.50 · 24 · 365 = $83,220/year.
The definition of cost you use (damage-if-unfixed-by-release) is distinct from all the previous definitions of cost (cost-to-fix-when-found). Neither is easy to measure. Actual cited articles discuss the latter definition.
I asked to include the original description of the values plotted in the article, but this it not there yet.
Of course, existence of the high-cost bug in your definition implies that the project is not just a cheap experiment.
Futhermore, following your example makes the claim the article contests as plausible story without facts behind it the matter of simple arithmetics (the longer the bug lives, the higher is time multiplier of its value). On the other hand, given that many bugs become irrelevant because of some upgrade/rewrite before they are found, it is even harder to estimate the number of bugs, let alone cost of each one. Also, how an inefficiency affects operating costs can be difficult enough to estimate that nobody knows whether it is better to fix a cost-increaser or add a new feature to increase revenue.
I asked to include the original description of the values plotted in the article, but this it not there yet.
Is that a request addressed to me? :)
If so, all I can say is that what is being measured is very rarely operationalized in the cited articles: for instance, the Brady 1999 “paper” isn’t really a paper in the usual sense, it’s a PowerPoint, with absolutely no accompanying text. The Brady 1989 article I quote even states that these costs weren’t accurately measured.
The older literature, such as Boehm’s 1976 article “Software Engineering”, does talk about cost to fix, not total cost of the consequences. He doesn’t say what he means by “fixing”. Other papers mention “development cost required to detect and resolve a software bug” or “cost of reworking errors in programs”—those point more strongly to excluding the economic consequences other than programmer labor.
Of course. My point is that you focused a bit too much on misciting instead of going for quick kill and saying that they measure something underspecified.
Also, if you think that their main transgression is citing things wrong, exact labels from the graphs you show seem to be a natural thing to include. I don’t expect you to tell us what they measured—I expect you to quote them precisely on that.
The main issue is that people just aren’t paying attention. My focus on citation stems from observing that a pair of parentheses, a name and a year seem to function, for a large number of people in my field, as a powerful narcotic suspending their critical reason.
I expect you to quote them precisely on that.
If this is a tu quoque argument, it is spectacularly mis-aimed.
as a powerful narcotic suspending their critical reason.
The distinction I made is about the level of suspension. It looks like people suspend their reasoning about statements having a well-defined meaning, not just reasoning about the mere truth of facts presented. I find the former way worse than the latter.
I expect you to quote them precisely on that.
If this is a tu quoque argument, it is spectacularly mis-aimed.
It is not about you, sorry for stating it slightly wrong. I thought about unfortunate implications but found no good way to evade them. I needed to contrast “copy” and “explain”.
I had no intention to say you were being hypocritical, but discussion started to depend on some highly relevant (from my point of view) objectively short piece of data that you had but did not include. I actually was wrong about one of my assumptions about original labels...
As to your other question: I suspect that the first author to mis-cite Grady was Karl Wiegers in his requirements book (from 2003 or 2004), he’s also the author of the Serena paper listed above. A very nice person, by the way—he kindly sent me an electronic copy of the Grady presentation. At least he’s read it. I’m pretty damn sure that secondary citations afterwards are from people who haven’t.
Well, if he has read the Grady paper and cited it wrong, most likely that he has got his nice graph from somewhere… I wonder who and why published this graph for the first time.
About references—well, what discipline is not diseased like that? We are talking about something that people (rightly or wrongly) equate with common sense in the field. People want to cite some widely accepted statement, which agrees with their perceived experience. And the deadline is nigh. If they find an article with such a result, they are happy. If they find a couple of articles referencing this result, they steal the citation. After all, who cares what to cite, everybody knows this, right?
I am not sure that even in maths the situation is significantly better. There are fresher results where you understand how to find a paper to reference, there are older results that can be found in university textbooks, and there is middle ground where you either find something that looks like a good enough reference or have to include a sketch if the proof. (I have done the latter for some relatively simple result in a maths article).
We know that late detection is sometimes much more expensive, simply because depending on the domain, some bugs can do harm (letting bad data into the database, making your customers’ credit card numbers accessible to the Russian Mafia, delivering a satellite to the bottom of the Atlantic instead of into orbit) much more expensive than the cost of fixing the code itself. So it’s clear that on average, cost does increase with time of detection. But are those high-profile disasters part of a smooth graph, or is it a step function where the cost of fixing the code typically doesn’t increase very much, but once bugs slip past final QA all the way into production, there is suddenly the opportunity for expensive harm to be done?
In my experience, the truth is closer to the latter than the former, so that instead of constantly pushing for everything to be done as early as possible, we would be better off focusing our efforts on e.g. better automatic verification to make sure potentially costly bugs are caught no later than final QA.
But obviously there is no easy way to measure this, particularly since the profile varies greatly across domains.
The real problem with these graphs is not that they were cited wrong. After all, it does look like both are taken from different data sets, however they were collected, and support the same conclusion.
The true problem is that it is hard to say what do they measure at all.
If this true problem didn’t exist, and these graphs measured something that can be measured, I’d bet that these graphs not being refuted would actually mean that they are both showing true sign of correlation. The reason is quite simple: every possible metric gets collected for a stupid presentation from time to time. If the correlation was falsifiable and wrong, we would likely see falsifications on TheDailyWTF forum as an anecdots.
I don’t understand why you think the graphs are not measuring a quantifiable metric, nor why it would not be falsifiable. Especially if the ratios are as dramatic as often depicted, I can think of a lot of things that would falsify it.
I also don’t find it difficult to say what they measure: The cost of fixing a bug depending on which stage it was introduced in (one graph) or which stage it was fixed in (other graph). Both things seem pretty straightforward to me, even if “stages” of development can sometimes be a little fuzzy.
I agree with your point that falsifications should have been forthcoming by now, but then again, I don’t know that anyone is actually collecting this sort of metrics—so anecdotal evidence might be all people have to go on, and we know how unreliable that is.
There are things that could falsify it dramatically, most probably. Apparently they are not true facts. I specifically said “falsifiable and wrong”—in the parts where this correlation is falsifiable, it is not wrong for majority of the projects.
About dramatic ratio: you cannot falsify a single data point. It simply happenned like this—or so the story goes. There are so many things that will be different in another experiment that can change (although not reverse) the ratio without disproving the general strong correlation...
Actually, we do not even know what are axis labels. I guess they are fungible enough.
Saying that cost of fixing is something straightforward seems to be too optimistic. Estimating true cost of the entire project is not always simple when you have more than one project at once and some people are involved with both. What do you call cost of fixing a bug?
Any metrics that contains “cost” in the name get requested by some manager from time to time somewhere in the world. How it is calculated is another question. Actually, this is the question that actually matters.
This strikes me as particularly galling because I have in fact repeated this claim to someone new to the field. I think I prefaced it with “studies have conclusively shown...”. Of course, it was unreasonable of me to think that what is being touted by so many as well-researched was not, in fact, so.
Mind, it seems to me that defects do follow both patterns: Introducing defects earlier and/or fixing them later should come at a higher dollar cost, that just makes sense. However, it could be the same type of “makes sense” that made Aristotle conclude that heavy objects fall faster than light objects—getting actual data would be much better than reasoning alone, especially is it would tell us just how much costlier, if at all, these differences are—it would be an actual precise tool rather than a crude (and uncertain) rule of thumb.
I do have one nagging worry about this example: These days a lot of projects collect a lot of metrics. It seems dubious to me that no one has tried to replicate these results.
Mostly the ones that are easy to collect: a classic case of “looking under the lamppost where there is light rather than where you actually lost your keys”.
Now we’re starting to think. Could we (I don’t have a prefabricated answer to this one) think of a cheap and easy to run experiment that would help us see more clearly what’s going on?
Here’s an idea. There are a number of open-source software projects that exist. Many of these are in some sort of version control system which, generally, keeps a number of important records; any change made to the software will include a timestamp, a note by the programmer detailing what was the intention of the change, and a list of changes to the files that resulted from the change.
A simple experiment might then be to simply collate data from either one large project, or a number of smaller projects. The cost of fixing a bug can be estimated from the number of lines of code changed to fix the bug; the amount of time since the bug was introduced can be found by looking back through previous versions and comparing timestamps. A scatter plot of time vs. lines-of-code-changed can then be produced, and investigated for trends.
Of course, this would require a fair investment of time to do it properly.
And time is money, so that doesn’t really fit the “cheap and easy” constraint I specified.
Hmmm. I’d parsed ‘cheap and easy’ as ‘can be done by a university student, on a university student’s budget, in furtherance of a degree’ - which possibly undervalues time somewhat.
At the cost of some amount of accuracy, however, a less time-consuming method might be the following; to automate the query, under the assumption that the bug being repaired was introduced at the earliest time when one of the lines of code modified to fix the bug was last modified (that is, if three lines of code were changed to fix the bug, two of which had last been changed on 24 June and one of which had last been changed on 22 June, then the bug would be assumed to have been introduced on 22 June). Without human inspection of each result, some extra noise will be introduced into the final graph. (A human (or suitable AGI, if you have one available) inspection of a small subset of the results could give an estimate of the noise introduced)
By “cheap and easy” what I mean is “do the very hard work of reasoning out how the world would behave if the hypothesis were true, versus if it were false, and locate the smallest observation that discriminates between these two logically possible worlds”.
That’s hard and time-consuming work (therefore expensive), but the experiment itself is cheap and easy.
My intuition (and I could well be Wrong on this) tells me that experiments of the sort you are proposing are sort of the opposite: cheap in the front and expensive in the back. What I’m after is a mullet of an experiment, business in front and party in back.
An exploratory experiment might consist of taking notes the next time you yourself fix a bug, and note the answers to a bunch of hard questions: how did I measure the “cost” of this fix? How did I ascertain that this was in fact a “bug” (vs. some other kind of change)? How did I figure out when the bug was introduced? What else was going on at the same time that might make the measurements invalid?
Asking these questions, ISTM, is the real work of experimental design to be done here.
Well, for a recent bug; first, some background:
Bug: Given certain input, a general utility function returns erroneous output (NaN)
Detail: It turns out that, due to rounding, the function was taking the arccos of a number fractionally higher than one
Fix: Check for the boundary cojdition; take arccos(1) instead of arccos(1.00000001). Less than a dozen lines of code, and not very complicated lines
Then, to answer your questions in order:
Once the problem code was identified, the fix was done in a few minutes. Identifying the problem code took a little longer, as the problem was a rare and sporadic one—it happened first during a particularly irritating test case (and then, entirely by coincidence, a second time on a similar test case, which caused some searching in the wrong bit of code at first)
A numeric value, displayed to the user, was showing “NaN”.
The bug was introduced by failing to consider a rare but theoretically possible test case at the time (near the beginning of a long project) that a certain utility function was produced. I could get a time estimate by checking version control to see when the function in question had been first written; but it was some time ago.
A more recent change made the bug slightly more likely to crop up (by increasing the potential for rounding errors). The bug may otherwise have gone unnoticed for some time.
Of course, that example may well be an outlier.
Hmmm. Thinking about this further, I can imagine whole rafts of changes to the specifications which can be made just before final testing at very little cost (e.g. “Can you swap the positions of those two buttons?”) Depending on the software development methodology, I can even imagine pretty severe errors creeping into the code early on that are trivial to fix later, once properly identified.
The only circumstances I can think of that might change how long a bug takes to fix as a function of how long the development has run are:
After long enough, it becomes more difficult to properly identify the bug because there are more places to look to try to find it (for many bugs, this becomes trivial with proper debugging software; but there are still edge cases where even the best debugging software is little help)
If there is some chance that someone, somewhere else in the code, wrote code that relies on the bug—forcing an extended debugging effort, possibly a complete rewrite of the damaged code
If some major, structural change to the code is required (most bugs that I deal with are not of this type)
If the code is poorly written, hard to follow, and/or poorly understood
Good stuff! One crucial nitpick:
That doesn’t tell me why it’s a bug. How is ‘bug-ness’ measured? What’s the “objective” procedure to determine whether a change is a bug fix, vs something else (dev gold-plating, change request, optimization, etc)?
NaN is an error code. The display was supposed to show the answer to an arithmetical computation; NaN (“Not a Number”) means that, at some point in the calculation, an invalid operation was performed (division by zero, arccos of a number greater than 1, or similar).
It is a bug because it does not answer the question that the arithmetical computation was supposed to solve. It merely indicates that, at some point in the code, the computer was told to perform an operation that does not have a defined answer.
That strikes me as a highly specific description of the “bug predicate”—I can see how it applies in this instance, but if you have 1000 bugs to classify, of which this is one, you’ll have to write 999 more predicates at this level. It seems to me, too, that we’ve only moved the question one step back—to why you deem an operation or a displayed result “invalid”. (The calculator applet on my computer lets me compute 1⁄0 giving back the result “ERROR”, but since that’s been the behavior over several OS versions, I suspect it’s not considered a “bug”.)
Is there a more abstract way of framing the predicate “this behavior is a bug”? (What is “bug” even a property of?)
Ah, I see—you’re looking for a general rule, not a specific reason.
In that case, the general rule under which this bug falls is the following:
For any valid input, the software should not produce an error message. For any invalid input, the software should unambiguously display a clear error message.
‘Valid input’ is defined as any input for which there is a sensible, correct output value.
So, for example, in a calculator application, 1⁄0 is not valid input because division by zero is undefined. Thus, “ERROR” (or some variant thereof) is a reasonable output. 1⁄0.2, on the other hand, is a valid operation, with a correct output value of 5. Returning “ERROR” in that case would be a bug.
Or, to put it another way; error messages should always have a clear external cause (up to and including hardware failure). It should be obvious that the external cause is a case of using the software incorrectly. An error should never start within the software, but should always be detected by the software and (where possible) unambiguously communicated to the user.
Granting that this definition of what constitutes a “bug” is diagnostic in the case we’ve been looking at (I’m not quite convinced, but let’s move on), will it suffice for the 999 other cases? Roughly how many general rules are we going to need to sort 1000 typical bugs?
Can we even tell, in the case we’ve been discussing, that the above definition applies, just by looking at the source code or revision history of the source code? Or do we need to have a conversation with the developers and possibly other stakeholders for every bug?
(I did warn up front that I consider the task of even asking the question properly to be very hard, so I’ll make no apologies for the decidedly Socratic turn of this thread.)
No. I have not yet addressed the issues of:
Incorrect output
Program crashes
Irrelevant output
Output that takes too long
Bad user interface
I can think, off the top of my head, of six rules that seem to cover most cases (each additional rule addressing one category in the list above). If I think about it for a few minutes longer, I may be able to think of exceptions (and then rules to cover those exceptions); however, I think it very probable that over 990 of those thousand bugs would fall under no more than a dozen similarly broad rules. I also expect the rare bug that is very hard to classify, that is likely to turn up in a random sample of 1000 bugs.
Hmmm. That depends. I can, because I know the program, and the test case that triggered the bug. Any developer presented with the snippet of code should recognise its purpose, and that it should be present, though it would not be obvious what valid input, if any, triggers the bug. Someone who is not a developer may need to get a developer to look at the code, then talk to the developer. In this specific case, talking with a stakeholder should not be necessary; an independent developer would be sufficient (there are bugs where talking to a stakeholder would be required to properly identify them as bugs). I don’t think that identifying this fix as a bug can be easily automated.
If I were to try to automate the task of identifying bugs with a computer, I’d search through the version history for the word “fix”. It’s not foolproof, but the presence of “fix” in the version history is strong evidence that something was, indeed, fixed. (This fails when the full comment includes the phrase ”...still need to fix...”). Annoyingly, it would fail to pick up this particular bug (the version history mentions “adding boundary checks” without once using the word “fix”).
That’s a useful approximation for finding fixes, and simpler enough compared to a half-dozen rules that I would personally accept the risk of uncertainty (e.g. repeated fixes for the same issues would be counted more than once). As you point out, you have to make it a systematic rule prior to the project, which makes it perhaps less applicable to existing open-source projects. (Many developers diligently mark commits according to their nature, but I don’t know what proportion of all open-source devs do, I suspect not enough.)
It’s too bad we can’t do the same to find when bugs were introduced—developers don’t generally label as such commits that contain bugs.
If they did, it would make the bugs easier to find.
If I had to automate that, I’d consider the lines of code changed by the update. For each line changed, I’d find the last time that that line had been changed; I’d take the earliest of these dates.
However, many bugs are fixed not by lines changed, but by lines added. I’m not sure how to date those; the date of the creation of the function containing the new line? The date of the last change to that function? I can imagine situations where either of those could be valid. Again, I would take the earliest applicable date.
I should probably also ignore lines that are only comments.
At least one well-known bug I know about consisted of commenting out a single line of code.
This one is interesting—it remained undetected for two years, was very cheap to fix (just add the commented out line back in), but had large and hard to estimate indirect costs.
Among people who buy into the “rising cost of defects” theory, there’s a common mistake: conflating “cost to fix” and “cost of the bug”. This is especially apparent in arguments that bugs in the field are “obviously” very costly to fix, because the software has been distributed in many places, etc. That strikes me as a category error.
Many bugs are also fixed by adding or changing (or in fact deleting) code elsewhere than the place where the bug was introduced—the well-known game of workarounds.
I take your point. I should only ignore lines that are comments both before and after the change; commenting or uncommenting code can clearly be a bugfix. (Or can introduce a bug, of course).
Hmmm. “Cost to fix”, to my mind, should include the cost to find the bug and the cost to repair the bug. “Cost of the bug” should include all the knock-on effects of the bug having been active in the field for some time (which could be lost productivity, financial losses, information leakage, and just about anything, depending on the bug).
I would assert that this does not fix the bug at all; it simply makes the bug less relevant (hopefully, irrelevant to the end user). If I write a function that’s supposed to return a+b, and it instead returns a+b+1, then this can easily be worked around by subtracting one from the return value every time it is used; but the downside is that the function is still returning the wrong value (a trap for any future maintainers) and, moreover, it makes the actual bug even more expensive to fix (since once it is fixed, all the extraneous minus-ones must be tracked down and removed).
A costly, but simple way would be to gather groups of SW engineers and have them work on projects where you intentionally introduce defects at various stages, and measure the costs of fixing them. To be statistically meaningful, this probably means thousands of engineer hours just to that effect.
A cheap (but not simple) way would be to go around as many companies as possible and hold the relevant measurements on actual products. This entails a lot of variables, however—engineer groups tend to work in many different ways. This might cause the data to be less than conclusive. In addition, the politics of working with existing companies may also tilt the results of such a research.
I can think of simple experiments that are not cheap; and of cheap experiments that are not simple. I’m having difficulty satisfying the conjunction and I suspect one doesn’t exist that would give a meaningful answer for high-cost bugs.
(Minor edit: Added the missing “hours” word)
It’s not that costly if you do with university students: Get two groups of 4 university students. One group is told “test early and often”. One group is told “test after the code is integrated”. For every bug they fix, measure the effort it is to fix it (by having them “sign a clock” for every task they do). Then, do analysis on when the bug was introduced (this seems easy post-fixing the bug, which is easy if they use something like Trac and SVN). All it takes is a month-long project that a group of 4 software engineering students can do. It seems like any university with a software engineering department can do it for the course-worth of one course. Seems to me it’s under $50K to fund?
Yes, it would be nice to have such a study.
But it can’t really be done the way you envision it. Variance in developer quality is high. Getting a meaningful result would require a lot more than 8 developers. And very few research groups can afford to run an experiment of that size—particularly since the usual experience in science is that you have to try the study a few times before you have the procedure right.
That would be cheap and simple, but wouldn’t give a meaningful answer for high-cost bugs, which don’t manifest in such small projects. Furthermore, with only eight people total, individual ability differences would overwhelmingly dominate all the other factors.
By definition, no cheap experiment can give meaningful data about high-cost bugs.
That sounds intuitively appealing, but I’m not quite convinced that it actually follows.
You can try to find people who produce such an experiment as a side-effect, but in that case you don’t get to specify parameters (that may lead to a failure to control some variable—or not).
Overall cost of experiment for all involved parties will be not too low, though (although marginal cost of the experiment relative to just doing business as usual can be reduced, probably).
A “high-cost bug” seems to imply tens of hours spent overall on fixing. Otherwise, it is not clear how to measure the cost—from my experience quite similar bugs can take from 5 minutes to a couple of hours to locate and fix without clear signs of either case. Exploration depends on your shape, after all. On the other hand, it should be a relatively small part of the entire project, otherwise it seems to be not a bug, but the entire project goal (this skews data about both locating the bug and cost of integrating the fix).
if 10-20 hours (how could you predict how high-cost will a bug be?) are a small part of a project, you are talking about at least hundreds of man-hours (it is not a good measure of project complexity, but it is an estimate of cost). Now, you need to repeat, you need to try alternative strategies to get more data on early detection and on late detection and so on.
It can be that you have access to some resource that you can spend on this (I dunno, a hundred students with a few hours per week for a year dedicated to some programming practice where you have a relative freedom?) but not on anything better; it may be that you can influence set of measurements of some real projects.. But the experiment will only be cheap by making someone else cover the main cost (probably, for a good unrelated reason).
Also notice that if you cannot influence how things are done, only how they are measured, you need to specify what is measured much better than the cited papers do. What is the moment of introduction of a bug? What is cost of fixing a bug? Note that fixing a high-cost bug may include doing some improvements that were put off before. This putting off could be a decision with a reason, or just irrational. It would be nice if someone proposed a methodology of measuring enough control variables in such a project—but not because it would let us run this experiment, but because it would be a very useful piece of research on software project costs in general.
A high-cost bug can also be one that reduces the benefit of having the program by a large amount.
For instance, suppose the “program” is a profitable web service that makes $200/hour of revenue when it is up, and costs $100/hour to operate (in hosting fees, ISP fees, sysadmin time, etc.), thus turning a tidy profit of $100/hour. When the service is down, it still costs $100/hour but makes no revenue.
Bug A is a crashing bug that causes data corruption that takes time to recover; it strikes once, and causes the service to be down for 24 hours, which time is spent fixing it. This has the revenue impact of $200 · 24 = $4800.
Bug B is a small algorithmic inefficiency; fixing it takes an eight-hour code audit, and causes the operational cost of the service to come down from $100/hour to $99/hour. This has the revenue impact of $1 · 24 · 365 = $8760/year.
Bug C is a user interface design flaw that makes the service unusable to the 5% of the population who are colorblind. It takes five minutes of CSS editing to fix. Colorblind people spend as much money as everyone else, if they can; so fixing it increases the service’s revenue by 4.8% to $209.50/hour. This has the revenue impact of $9.50 · 24 · 365 = $83,220/year.
Which bug is the highest-cost? Seems clear to me.
The definition of cost you use (damage-if-unfixed-by-release) is distinct from all the previous definitions of cost (cost-to-fix-when-found). Neither is easy to measure. Actual cited articles discuss the latter definition.
I asked to include the original description of the values plotted in the article, but this it not there yet.
Of course, existence of the high-cost bug in your definition implies that the project is not just a cheap experiment.
Futhermore, following your example makes the claim the article contests as plausible story without facts behind it the matter of simple arithmetics (the longer the bug lives, the higher is time multiplier of its value). On the other hand, given that many bugs become irrelevant because of some upgrade/rewrite before they are found, it is even harder to estimate the number of bugs, let alone cost of each one. Also, how an inefficiency affects operating costs can be difficult enough to estimate that nobody knows whether it is better to fix a cost-increaser or add a new feature to increase revenue.
Is that a request addressed to me? :)
If so, all I can say is that what is being measured is very rarely operationalized in the cited articles: for instance, the Brady 1999 “paper” isn’t really a paper in the usual sense, it’s a PowerPoint, with absolutely no accompanying text. The Brady 1989 article I quote even states that these costs weren’t accurately measured.
The older literature, such as Boehm’s 1976 article “Software Engineering”, does talk about cost to fix, not total cost of the consequences. He doesn’t say what he means by “fixing”. Other papers mention “development cost required to detect and resolve a software bug” or “cost of reworking errors in programs”—those point more strongly to excluding the economic consequences other than programmer labor.
Of course. My point is that you focused a bit too much on misciting instead of going for quick kill and saying that they measure something underspecified.
Also, if you think that their main transgression is citing things wrong, exact labels from the graphs you show seem to be a natural thing to include. I don’t expect you to tell us what they measured—I expect you to quote them precisely on that.
The main issue is that people just aren’t paying attention. My focus on citation stems from observing that a pair of parentheses, a name and a year seem to function, for a large number of people in my field, as a powerful narcotic suspending their critical reason.
If this is a tu quoque argument, it is spectacularly mis-aimed.
The distinction I made is about the level of suspension. It looks like people suspend their reasoning about statements having a well-defined meaning, not just reasoning about the mere truth of facts presented. I find the former way worse than the latter.
It is not about you, sorry for stating it slightly wrong. I thought about unfortunate implications but found no good way to evade them. I needed to contrast “copy” and “explain”.
I had no intention to say you were being hypocritical, but discussion started to depend on some highly relevant (from my point of view) objectively short piece of data that you had but did not include. I actually was wrong about one of my assumptions about original labels...
No offence taken.
As to your other question: I suspect that the first author to mis-cite Grady was Karl Wiegers in his requirements book (from 2003 or 2004), he’s also the author of the Serena paper listed above. A very nice person, by the way—he kindly sent me an electronic copy of the Grady presentation. At least he’s read it. I’m pretty damn sure that secondary citations afterwards are from people who haven’t.
Well, if he has read the Grady paper and cited it wrong, most likely that he has got his nice graph from somewhere… I wonder who and why published this graph for the first time.
About references—well, what discipline is not diseased like that? We are talking about something that people (rightly or wrongly) equate with common sense in the field. People want to cite some widely accepted statement, which agrees with their perceived experience. And the deadline is nigh. If they find an article with such a result, they are happy. If they find a couple of articles referencing this result, they steal the citation. After all, who cares what to cite, everybody knows this, right?
I am not sure that even in maths the situation is significantly better. There are fresher results where you understand how to find a paper to reference, there are older results that can be found in university textbooks, and there is middle ground where you either find something that looks like a good enough reference or have to include a sketch if the proof. (I have done the latter for some relatively simple result in a maths article).
Or to put that another way, there can’t be any low-hanging fruit, otherwise someone would have plucked it already.
We know that late detection is sometimes much more expensive, simply because depending on the domain, some bugs can do harm (letting bad data into the database, making your customers’ credit card numbers accessible to the Russian Mafia, delivering a satellite to the bottom of the Atlantic instead of into orbit) much more expensive than the cost of fixing the code itself. So it’s clear that on average, cost does increase with time of detection. But are those high-profile disasters part of a smooth graph, or is it a step function where the cost of fixing the code typically doesn’t increase very much, but once bugs slip past final QA all the way into production, there is suddenly the opportunity for expensive harm to be done?
In my experience, the truth is closer to the latter than the former, so that instead of constantly pushing for everything to be done as early as possible, we would be better off focusing our efforts on e.g. better automatic verification to make sure potentially costly bugs are caught no later than final QA.
But obviously there is no easy way to measure this, particularly since the profile varies greatly across domains.
The real problem with these graphs is not that they were cited wrong. After all, it does look like both are taken from different data sets, however they were collected, and support the same conclusion.
The true problem is that it is hard to say what do they measure at all.
If this true problem didn’t exist, and these graphs measured something that can be measured, I’d bet that these graphs not being refuted would actually mean that they are both showing true sign of correlation. The reason is quite simple: every possible metric gets collected for a stupid presentation from time to time. If the correlation was falsifiable and wrong, we would likely see falsifications on TheDailyWTF forum as an anecdots.
I don’t understand why you think the graphs are not measuring a quantifiable metric, nor why it would not be falsifiable. Especially if the ratios are as dramatic as often depicted, I can think of a lot of things that would falsify it.
I also don’t find it difficult to say what they measure: The cost of fixing a bug depending on which stage it was introduced in (one graph) or which stage it was fixed in (other graph). Both things seem pretty straightforward to me, even if “stages” of development can sometimes be a little fuzzy.
I agree with your point that falsifications should have been forthcoming by now, but then again, I don’t know that anyone is actually collecting this sort of metrics—so anecdotal evidence might be all people have to go on, and we know how unreliable that is.
There are things that could falsify it dramatically, most probably. Apparently they are not true facts. I specifically said “falsifiable and wrong”—in the parts where this correlation is falsifiable, it is not wrong for majority of the projects.
About dramatic ratio: you cannot falsify a single data point. It simply happenned like this—or so the story goes. There are so many things that will be different in another experiment that can change (although not reverse) the ratio without disproving the general strong correlation...
Actually, we do not even know what are axis labels. I guess they are fungible enough.
Saying that cost of fixing is something straightforward seems to be too optimistic. Estimating true cost of the entire project is not always simple when you have more than one project at once and some people are involved with both. What do you call cost of fixing a bug?
Any metrics that contains “cost” in the name get requested by some manager from time to time somewhere in the world. How it is calculated is another question. Actually, this is the question that actually matters.