Hmmm. I’d parsed ‘cheap and easy’ as ‘can be done by a university student, on a university student’s budget, in furtherance of a degree’ - which possibly undervalues time somewhat.
At the cost of some amount of accuracy, however, a less time-consuming method might be the following; to automate the query, under the assumption that the bug being repaired was introduced at the earliest time when one of the lines of code modified to fix the bug was last modified (that is, if three lines of code were changed to fix the bug, two of which had last been changed on 24 June and one of which had last been changed on 22 June, then the bug would be assumed to have been introduced on 22 June). Without human inspection of each result, some extra noise will be introduced into the final graph. (A human (or suitable AGI, if you have one available) inspection of a small subset of the results could give an estimate of the noise introduced)
By “cheap and easy” what I mean is “do the very hard work of reasoning out how the world would behave if the hypothesis were true, versus if it were false, and locate the smallest observation that discriminates between these two logically possible worlds”.
That’s hard and time-consuming work (therefore expensive), but the experiment itself is cheap and easy.
My intuition (and I could well be Wrong on this) tells me that experiments of the sort you are proposing are sort of the opposite: cheap in the front and expensive in the back. What I’m after is a mullet of an experiment, business in front and party in back.
An exploratory experiment might consist of taking notes the next time you yourself fix a bug, and note the answers to a bunch of hard questions: how did I measure the “cost” of this fix? How did I ascertain that this was in fact a “bug” (vs. some other kind of change)? How did I figure out when the bug was introduced? What else was going on at the same time that might make the measurements invalid?
Asking these questions, ISTM, is the real work of experimental design to be done here.
An exploratory experiment might consist of taking notes the next time you yourself fix a bug, and note the answers to a bunch of hard questions: how did I measure the “cost” of this fix? How did I ascertain that this was in fact a “bug” (vs. some other kind of change)? How did I figure out when the bug was introduced? What else was going on at the same time that might make the measurements invalid?
Well, for a recent bug; first, some background:
Bug: Given certain input, a general utility function returns erroneous output (NaN)
Detail: It turns out that, due to rounding, the function was taking the arccos of a number fractionally higher than one
Fix: Check for the boundary cojdition; take arccos(1) instead of arccos(1.00000001). Less than a dozen lines of code, and not very complicated lines
Then, to answer your questions in order:
how did I measure the “cost” of this fix?
Once the problem code was identified, the fix was done in a few minutes. Identifying the problem code took a little longer, as the problem was a rare and sporadic one—it happened first during a particularly irritating test case (and then, entirely by coincidence, a second time on a similar test case, which caused some searching in the wrong bit of code at first)
How did I ascertain that this was in fact a “bug” (vs. some other kind of change)?
A numeric value, displayed to the user, was showing “NaN”.
How did I figure out when the bug was introduced?
The bug was introduced by failing to consider a rare but theoretically possible test case at the time (near the beginning of a long project) that a certain utility function was produced. I could get a time estimate by checking version control to see when the function in question had been first written; but it was some time ago.
What else was going on at the same time that might make the measurements invalid?
A more recent change made the bug slightly more likely to crop up (by increasing the potential for rounding errors). The bug may otherwise have gone unnoticed for some time.
Of course, that example may well be an outlier.
Hmmm. Thinking about this further, I can imagine whole rafts of changes to the specifications which can be made just before final testing at very little cost (e.g. “Can you swap the positions of those two buttons?”) Depending on the software development methodology, I can even imagine pretty severe errors creeping into the code early on that are trivial to fix later, once properly identified.
The only circumstances I can think of that might change how long a bug takes to fix as a function of how long the development has run are:
After long enough, it becomes more difficult to properly identify the bug because there are more places to look to try to find it (for many bugs, this becomes trivial with proper debugging software; but there are still edge cases where even the best debugging software is little help)
If there is some chance that someone, somewhere else in the code, wrote code that relies on the bug—forcing an extended debugging effort, possibly a complete rewrite of the damaged code
If some major, structural change to the code is required (most bugs that I deal with are not of this type)
If the code is poorly written, hard to follow, and/or poorly understood
A numeric value, displayed to the user, was showing “NaN”.
That doesn’t tell me why it’s a bug. How is ‘bug-ness’ measured? What’s the “objective” procedure to determine whether a change is a bug fix, vs something else (dev gold-plating, change request, optimization, etc)?
NaN is an error code. The display was supposed to show the answer to an arithmetical computation; NaN (“Not a Number”) means that, at some point in the calculation, an invalid operation was performed (division by zero, arccos of a number greater than 1, or similar).
It is a bug because it does not answer the question that the arithmetical computation was supposed to solve. It merely indicates that, at some point in the code, the computer was told to perform an operation that does not have a defined answer.
That strikes me as a highly specific description of the “bug predicate”—I can see how it applies in this instance, but if you have 1000 bugs to classify, of which this is one, you’ll have to write 999 more predicates at this level. It seems to me, too, that we’ve only moved the question one step back—to why you deem an operation or a displayed result “invalid”. (The calculator applet on my computer lets me compute 1⁄0 giving back the result “ERROR”, but since that’s been the behavior over several OS versions, I suspect it’s not considered a “bug”.)
Is there a more abstract way of framing the predicate “this behavior is a bug”? (What is “bug” even a property of?)
Ah, I see—you’re looking for a general rule, not a specific reason.
In that case, the general rule under which this bug falls is the following:
For any valid input, the software should not produce an error message. For any invalid input, the software should unambiguously display a clear error message.
‘Valid input’ is defined as any input for which there is a sensible, correct output value.
So, for example, in a calculator application, 1⁄0 is not valid input because division by zero is undefined. Thus, “ERROR” (or some variant thereof) is a reasonable output. 1⁄0.2, on the other hand, is a valid operation, with a correct output value of 5. Returning “ERROR” in that case would be a bug.
Or, to put it another way; error messages should always have a clear external cause (up to and including hardware failure). It should be obvious that the external cause is a case of using the software incorrectly. An error should never start within the software, but should always be detected by the software and (where possible) unambiguously communicated to the user.
Granting that this definition of what constitutes a “bug” is diagnostic in the case we’ve been looking at (I’m not quite convinced, but let’s move on), will it suffice for the 999 other cases? Roughly how many general rules are we going to need to sort 1000 typical bugs?
Can we even tell, in the case we’ve been discussing, that the above definition applies, just by looking at the source code or revision history of the source code? Or do we need to have a conversation with the developers and possibly other stakeholders for every bug?
(I did warn up front that I consider the task of even asking the question properly to be very hard, so I’ll make no apologies for the decidedly Socratic turn of this thread.)
Roughly how many general rules are we going to need to sort 1000 typical bugs?
I can think, off the top of my head, of six rules that seem to cover most cases (each additional rule addressing one category in the list above). If I think about it for a few minutes longer, I may be able to think of exceptions (and then rules to cover those exceptions); however, I think it very probable that over 990 of those thousand bugs would fall under no more than a dozen similarly broad rules. I also expect the rare bug that is very hard to classify, that is likely to turn up in a random sample of 1000 bugs.
Can we even tell, in the case we’ve been discussing, that the above definition applies, just by looking at the source code or revision history of the source code?
Hmmm. That depends. I can, because I know the program, and the test case that triggered the bug. Any developer presented with the snippet of code should recognise its purpose, and that it should be present, though it would not be obvious what valid input, if any, triggers the bug. Someone who is not a developer may need to get a developer to look at the code, then talk to the developer. In this specific case, talking with a stakeholder should not be necessary; an independent developer would be sufficient (there are bugs where talking to a stakeholder would be required to properly identify them as bugs). I don’t think that identifying this fix as a bug can be easily automated.
If I were to try to automate the task of identifying bugs with a computer, I’d search through the version history for the word “fix”. It’s not foolproof, but the presence of “fix” in the version history is strong evidence that something was, indeed, fixed. (This fails when the full comment includes the phrase ”...still need to fix...”). Annoyingly, it would fail to pick up this particular bug (the version history mentions “adding boundary checks” without once using the word “fix”).
I’d search through the version history for the word “fix”.
That’s a useful approximation for finding fixes, and simpler enough compared to a half-dozen rules that I would personally accept the risk of uncertainty (e.g. repeated fixes for the same issues would be counted more than once). As you point out, you have to make it a systematic rule prior to the project, which makes it perhaps less applicable to existing open-source projects. (Many developers diligently mark commits according to their nature, but I don’t know what proportion of all open-source devs do, I suspect not enough.)
It’s too bad we can’t do the same to find when bugs were introduced—developers don’t generally label as such commits that contain bugs.
It’s too bad we can’t do the same to find when bugs were introduced—developers don’t generally label as such commits that contain bugs.
If they did, it would make the bugs easier to find.
If I had to automate that, I’d consider the lines of code changed by the update. For each line changed, I’d find the last time that that line had been changed; I’d take the earliest of these dates.
However, many bugs are fixed not by lines changed, but by lines added. I’m not sure how to date those; the date of the creation of the function containing the new line? The date of the last change to that function? I can imagine situations where either of those could be valid. Again, I would take the earliest applicable date.
I should probably also ignore lines that are only comments.
This one is interesting—it remained undetected for two years, was very cheap to fix (just add the commented out line back in), but had large and hard to estimate indirect costs.
Among people who buy into the “rising cost of defects” theory, there’s a common mistake: conflating “cost to fix” and “cost of the bug”. This is especially apparent in arguments that bugs in the field are “obviously” very costly to fix, because the software has been distributed in many places, etc. That strikes me as a category error.
many bugs are fixed not by lines changed, but by lines added
Many bugs are also fixed by adding or changing (or in fact deleting) code elsewhere than the place where the bug was introduced—the well-known game of workarounds.
At least one well-known bug I know about consisted of commenting out a single line of code.
I take your point. I should only ignore lines that are comments both before and after the change; commenting or uncommenting code can clearly be a bugfix. (Or can introduce a bug, of course).
Among people who buy into the “rising cost of defects” theory, there’s a common mistake: conflating “cost to fix” and “cost of the bug”. This is especially apparent in arguments that bugs in the field are “obviously” very costly to fix, because the software has been distributed in many places, etc. That strikes me as a category error.
Hmmm. “Cost to fix”, to my mind, should include the cost to find the bug and the cost to repair the bug. “Cost of the bug” should include all the knock-on effects of the bug having been active in the field for some time (which could be lost productivity, financial losses, information leakage, and just about anything, depending on the bug).
Many bugs are also fixed by adding or changing (or in fact deleting) code elsewhere than the place where the bug was introduced—the well-known game of workarounds.
I would assert that this does not fix the bug at all; it simply makes the bug less relevant (hopefully, irrelevant to the end user). If I write a function that’s supposed to return a+b, and it instead returns a+b+1, then this can easily be worked around by subtracting one from the return value every time it is used; but the downside is that the function is still returning the wrong value (a trap for any future maintainers) and, moreover, it makes the actual bug even more expensive to fix (since once it is fixed, all the extraneous minus-ones must be tracked down and removed).
Hmmm. I’d parsed ‘cheap and easy’ as ‘can be done by a university student, on a university student’s budget, in furtherance of a degree’ - which possibly undervalues time somewhat.
At the cost of some amount of accuracy, however, a less time-consuming method might be the following; to automate the query, under the assumption that the bug being repaired was introduced at the earliest time when one of the lines of code modified to fix the bug was last modified (that is, if three lines of code were changed to fix the bug, two of which had last been changed on 24 June and one of which had last been changed on 22 June, then the bug would be assumed to have been introduced on 22 June). Without human inspection of each result, some extra noise will be introduced into the final graph. (A human (or suitable AGI, if you have one available) inspection of a small subset of the results could give an estimate of the noise introduced)
By “cheap and easy” what I mean is “do the very hard work of reasoning out how the world would behave if the hypothesis were true, versus if it were false, and locate the smallest observation that discriminates between these two logically possible worlds”.
That’s hard and time-consuming work (therefore expensive), but the experiment itself is cheap and easy.
My intuition (and I could well be Wrong on this) tells me that experiments of the sort you are proposing are sort of the opposite: cheap in the front and expensive in the back. What I’m after is a mullet of an experiment, business in front and party in back.
An exploratory experiment might consist of taking notes the next time you yourself fix a bug, and note the answers to a bunch of hard questions: how did I measure the “cost” of this fix? How did I ascertain that this was in fact a “bug” (vs. some other kind of change)? How did I figure out when the bug was introduced? What else was going on at the same time that might make the measurements invalid?
Asking these questions, ISTM, is the real work of experimental design to be done here.
Well, for a recent bug; first, some background:
Bug: Given certain input, a general utility function returns erroneous output (NaN)
Detail: It turns out that, due to rounding, the function was taking the arccos of a number fractionally higher than one
Fix: Check for the boundary cojdition; take arccos(1) instead of arccos(1.00000001). Less than a dozen lines of code, and not very complicated lines
Then, to answer your questions in order:
Once the problem code was identified, the fix was done in a few minutes. Identifying the problem code took a little longer, as the problem was a rare and sporadic one—it happened first during a particularly irritating test case (and then, entirely by coincidence, a second time on a similar test case, which caused some searching in the wrong bit of code at first)
A numeric value, displayed to the user, was showing “NaN”.
The bug was introduced by failing to consider a rare but theoretically possible test case at the time (near the beginning of a long project) that a certain utility function was produced. I could get a time estimate by checking version control to see when the function in question had been first written; but it was some time ago.
A more recent change made the bug slightly more likely to crop up (by increasing the potential for rounding errors). The bug may otherwise have gone unnoticed for some time.
Of course, that example may well be an outlier.
Hmmm. Thinking about this further, I can imagine whole rafts of changes to the specifications which can be made just before final testing at very little cost (e.g. “Can you swap the positions of those two buttons?”) Depending on the software development methodology, I can even imagine pretty severe errors creeping into the code early on that are trivial to fix later, once properly identified.
The only circumstances I can think of that might change how long a bug takes to fix as a function of how long the development has run are:
After long enough, it becomes more difficult to properly identify the bug because there are more places to look to try to find it (for many bugs, this becomes trivial with proper debugging software; but there are still edge cases where even the best debugging software is little help)
If there is some chance that someone, somewhere else in the code, wrote code that relies on the bug—forcing an extended debugging effort, possibly a complete rewrite of the damaged code
If some major, structural change to the code is required (most bugs that I deal with are not of this type)
If the code is poorly written, hard to follow, and/or poorly understood
Good stuff! One crucial nitpick:
That doesn’t tell me why it’s a bug. How is ‘bug-ness’ measured? What’s the “objective” procedure to determine whether a change is a bug fix, vs something else (dev gold-plating, change request, optimization, etc)?
NaN is an error code. The display was supposed to show the answer to an arithmetical computation; NaN (“Not a Number”) means that, at some point in the calculation, an invalid operation was performed (division by zero, arccos of a number greater than 1, or similar).
It is a bug because it does not answer the question that the arithmetical computation was supposed to solve. It merely indicates that, at some point in the code, the computer was told to perform an operation that does not have a defined answer.
That strikes me as a highly specific description of the “bug predicate”—I can see how it applies in this instance, but if you have 1000 bugs to classify, of which this is one, you’ll have to write 999 more predicates at this level. It seems to me, too, that we’ve only moved the question one step back—to why you deem an operation or a displayed result “invalid”. (The calculator applet on my computer lets me compute 1⁄0 giving back the result “ERROR”, but since that’s been the behavior over several OS versions, I suspect it’s not considered a “bug”.)
Is there a more abstract way of framing the predicate “this behavior is a bug”? (What is “bug” even a property of?)
Ah, I see—you’re looking for a general rule, not a specific reason.
In that case, the general rule under which this bug falls is the following:
For any valid input, the software should not produce an error message. For any invalid input, the software should unambiguously display a clear error message.
‘Valid input’ is defined as any input for which there is a sensible, correct output value.
So, for example, in a calculator application, 1⁄0 is not valid input because division by zero is undefined. Thus, “ERROR” (or some variant thereof) is a reasonable output. 1⁄0.2, on the other hand, is a valid operation, with a correct output value of 5. Returning “ERROR” in that case would be a bug.
Or, to put it another way; error messages should always have a clear external cause (up to and including hardware failure). It should be obvious that the external cause is a case of using the software incorrectly. An error should never start within the software, but should always be detected by the software and (where possible) unambiguously communicated to the user.
Granting that this definition of what constitutes a “bug” is diagnostic in the case we’ve been looking at (I’m not quite convinced, but let’s move on), will it suffice for the 999 other cases? Roughly how many general rules are we going to need to sort 1000 typical bugs?
Can we even tell, in the case we’ve been discussing, that the above definition applies, just by looking at the source code or revision history of the source code? Or do we need to have a conversation with the developers and possibly other stakeholders for every bug?
(I did warn up front that I consider the task of even asking the question properly to be very hard, so I’ll make no apologies for the decidedly Socratic turn of this thread.)
No. I have not yet addressed the issues of:
Incorrect output
Program crashes
Irrelevant output
Output that takes too long
Bad user interface
I can think, off the top of my head, of six rules that seem to cover most cases (each additional rule addressing one category in the list above). If I think about it for a few minutes longer, I may be able to think of exceptions (and then rules to cover those exceptions); however, I think it very probable that over 990 of those thousand bugs would fall under no more than a dozen similarly broad rules. I also expect the rare bug that is very hard to classify, that is likely to turn up in a random sample of 1000 bugs.
Hmmm. That depends. I can, because I know the program, and the test case that triggered the bug. Any developer presented with the snippet of code should recognise its purpose, and that it should be present, though it would not be obvious what valid input, if any, triggers the bug. Someone who is not a developer may need to get a developer to look at the code, then talk to the developer. In this specific case, talking with a stakeholder should not be necessary; an independent developer would be sufficient (there are bugs where talking to a stakeholder would be required to properly identify them as bugs). I don’t think that identifying this fix as a bug can be easily automated.
If I were to try to automate the task of identifying bugs with a computer, I’d search through the version history for the word “fix”. It’s not foolproof, but the presence of “fix” in the version history is strong evidence that something was, indeed, fixed. (This fails when the full comment includes the phrase ”...still need to fix...”). Annoyingly, it would fail to pick up this particular bug (the version history mentions “adding boundary checks” without once using the word “fix”).
That’s a useful approximation for finding fixes, and simpler enough compared to a half-dozen rules that I would personally accept the risk of uncertainty (e.g. repeated fixes for the same issues would be counted more than once). As you point out, you have to make it a systematic rule prior to the project, which makes it perhaps less applicable to existing open-source projects. (Many developers diligently mark commits according to their nature, but I don’t know what proportion of all open-source devs do, I suspect not enough.)
It’s too bad we can’t do the same to find when bugs were introduced—developers don’t generally label as such commits that contain bugs.
If they did, it would make the bugs easier to find.
If I had to automate that, I’d consider the lines of code changed by the update. For each line changed, I’d find the last time that that line had been changed; I’d take the earliest of these dates.
However, many bugs are fixed not by lines changed, but by lines added. I’m not sure how to date those; the date of the creation of the function containing the new line? The date of the last change to that function? I can imagine situations where either of those could be valid. Again, I would take the earliest applicable date.
I should probably also ignore lines that are only comments.
At least one well-known bug I know about consisted of commenting out a single line of code.
This one is interesting—it remained undetected for two years, was very cheap to fix (just add the commented out line back in), but had large and hard to estimate indirect costs.
Among people who buy into the “rising cost of defects” theory, there’s a common mistake: conflating “cost to fix” and “cost of the bug”. This is especially apparent in arguments that bugs in the field are “obviously” very costly to fix, because the software has been distributed in many places, etc. That strikes me as a category error.
Many bugs are also fixed by adding or changing (or in fact deleting) code elsewhere than the place where the bug was introduced—the well-known game of workarounds.
I take your point. I should only ignore lines that are comments both before and after the change; commenting or uncommenting code can clearly be a bugfix. (Or can introduce a bug, of course).
Hmmm. “Cost to fix”, to my mind, should include the cost to find the bug and the cost to repair the bug. “Cost of the bug” should include all the knock-on effects of the bug having been active in the field for some time (which could be lost productivity, financial losses, information leakage, and just about anything, depending on the bug).
I would assert that this does not fix the bug at all; it simply makes the bug less relevant (hopefully, irrelevant to the end user). If I write a function that’s supposed to return a+b, and it instead returns a+b+1, then this can easily be worked around by subtracting one from the return value every time it is used; but the downside is that the function is still returning the wrong value (a trap for any future maintainers) and, moreover, it makes the actual bug even more expensive to fix (since once it is fixed, all the extraneous minus-ones must be tracked down and removed).