This is such an awesome programming story that I would like to see it expanded—more exact details of how the bug worked, with more specific code. In particular, I’d like to see this section expanded:
It turned out that the class of identifiers that failed to match the regular expression were a subset of a set of identifiers for which the lookup didn’t have to be done, because some previously-cached data would give the same results.
We use a program called BLAST to compare every new gene to every known gene. To reduce the number of comparisons, we use a database called Panda that groups together known genes that produce the same protein sequence. A Panda header is a line containing an identifier and a string called an ‘annotation’ for each gene, separated by ^|^, like this:
The entries in a Panda header are sorted by the reliability of the database that each entry came from. SwissProt (SP) identifiers come first, because SwissProt is curated, meaning a human has looked at and approved every entry in the database. RefSeq (RF) and Genbank (GB) come near the end, because they typically have annotations that were assigned by some other computer program running BLAST and transferring it from a SwissProt match. The less-reliable computer-produced databases are, naturally, the largest; so most matches are to RF or GB identifiers.
So, we only BLAST each gene against the first gene listed in each line of the Panda database. We save the 250 best matches and their Panda headers in a file called the btab.
When trying to assign a function to the new gene, the program, among other things, picks the best matches that pass various tests, and looks to see if any of those genes have known functions. It does this by looking up their identifiers in various databases.
When I inherited the program, this is what it did: Take the identifier of the matched gene, load its Panda header from the btab and store it in a variable. Then look up the Panda header for that identifier using a slow program called yank-panda, and write that over the version it just copied from the btab. Take only the first entry in the returned header, and look up that identifier in the databases.
This was silly, both because we already had the Panda header, and because the first entry in the saved header was always the same as the identifier used to call yank_panda to find the header. It was a very slow way of using an identifier to look itself up. But I realized that many of the databases we checked for function assignments listed genes under the alternate identifiers. So I had the program check all of the genes in the returned Panda header; and this resulted in getting more annotations.
Then I modified the program to just read the copy of the Panda header from the btab; and not call yank-panda at all; making the program run much, much faster. But this caused the output to change so as to miss some annotations. I discovered that, in some cases, the copy of the Panda header in the btab was truncated; and it was difficult to tell whether a header was truncated or not, since the annotation has no standard format.
So I restored the call to yank-panda, and stored the result in the variable that previously had the (possibly-truncated) header copied from the btab, to make sure to get the full header.
The call to yank-panda failed silently for 3 types of identifiers: RefSeq, GenBank, and Omnium. That’s because these identifiers have 4 parts instead of 3, like so:
RF|NP_853671.1|31791178|NC_002945
These identifiers fail to match this regular expression, which is deep inside a bioinformatics library that yank_panda calls that parses identifiers:
my $search_match = qr/^([A-Za-z]{2,})|([\w.]+)?|?([\w.]+)?$/;
However, because Panda headers are sorted by database, and RefSeq (RF) and GenBank (GB) identifiers always come near the end, it’s never the case both that a GB or RF identifier is the first in the header, and that the header is long enough to be truncated. So in these cases the regular expression fails to match, yank-panda returns nothing, and the program fails to notice and goes ahead and uses the header from the btab, which is the same as would have been returned by yank-panda, because headers starting with RF or GB are never truncated.
So, you could argue that this beneficial bug was possible only because the code was poorly-written in the first place. It could also be the case that the program’s original author knew that RF and GB identifiers would fail when calling yank-panda, and that’s why he copied the original btab header. That would make this story not very interesting. I don’t think he did that, because in his version of the program there was never any reason to call yank-panda; you used only the first entry in the header, which was always present in the btab whether the header was truncated or not. But it still could be that, before that, there was a time when there was a need to call yank-panda; and some programmer years cleverly realized that the databases ordered last in Panda were the same ones that had a non-standard identifier format, and uncleverly designed that into the code without writing any comments or allowing for the possibility that identifier formats or Panda formats or databases used would change.
This is such an awesome programming story that I would like to see it expanded—more exact details of how the bug worked, with more specific code. In particular, I’d like to see this section expanded:
If you really want the thrilling details...
We use a program called BLAST to compare every new gene to every known gene. To reduce the number of comparisons, we use a database called Panda that groups together known genes that produce the same protein sequence. A Panda header is a line containing an identifier and a string called an ‘annotation’ for each gene, separated by ^|^, like this:
SP|P49991|DNAA_MYCBO Chromosomal replication initiator protein dnaA taxon:1765 {Mycobacterium bovis;} (exp=0; wgp=-1; cg=-1; closed=-1; pub=0; rf_status= ;)^|^RF|NP_853671.1|31791178|NC_002945 chromosomal replication initiation protein taxon:233413 {Mycobacterium bovis AF2122/97;} (exp=1; wgp=1; cg=1; closed=1; pub=1; rf_status=provisional;)^|^GB|CAD92863.1|31616763|BX248334 CHROMOSOMAL REPLICATION INITIATOR PROTEIN DNAA taxon:233413 {Mycobacterium bovis AF2122/97;} (exp=1; wgp=1; cg=1; closed=1; pub=1; rf_status=;)^|^OMNI|NTL01MB0001||31616763| CHROMOSOMAL REPLICATION INITIATOR PROTEIN DNAA taxon:233413 {Mycobacterium bovis AF2122/97;} (exp=-1; wgp=-1; cg=-1; closed=-1; pub=-1; rf_status= ;)
The entries in a Panda header are sorted by the reliability of the database that each entry came from. SwissProt (SP) identifiers come first, because SwissProt is curated, meaning a human has looked at and approved every entry in the database. RefSeq (RF) and Genbank (GB) come near the end, because they typically have annotations that were assigned by some other computer program running BLAST and transferring it from a SwissProt match. The less-reliable computer-produced databases are, naturally, the largest; so most matches are to RF or GB identifiers.
So, we only BLAST each gene against the first gene listed in each line of the Panda database. We save the 250 best matches and their Panda headers in a file called the btab.
When trying to assign a function to the new gene, the program, among other things, picks the best matches that pass various tests, and looks to see if any of those genes have known functions. It does this by looking up their identifiers in various databases.
When I inherited the program, this is what it did: Take the identifier of the matched gene, load its Panda header from the btab and store it in a variable. Then look up the Panda header for that identifier using a slow program called yank-panda, and write that over the version it just copied from the btab. Take only the first entry in the returned header, and look up that identifier in the databases.
This was silly, both because we already had the Panda header, and because the first entry in the saved header was always the same as the identifier used to call yank_panda to find the header. It was a very slow way of using an identifier to look itself up. But I realized that many of the databases we checked for function assignments listed genes under the alternate identifiers. So I had the program check all of the genes in the returned Panda header; and this resulted in getting more annotations.
Then I modified the program to just read the copy of the Panda header from the btab; and not call yank-panda at all; making the program run much, much faster. But this caused the output to change so as to miss some annotations. I discovered that, in some cases, the copy of the Panda header in the btab was truncated; and it was difficult to tell whether a header was truncated or not, since the annotation has no standard format.
So I restored the call to yank-panda, and stored the result in the variable that previously had the (possibly-truncated) header copied from the btab, to make sure to get the full header.
The call to yank-panda failed silently for 3 types of identifiers: RefSeq, GenBank, and Omnium. That’s because these identifiers have 4 parts instead of 3, like so:
RF|NP_853671.1|31791178|NC_002945
These identifiers fail to match this regular expression, which is deep inside a bioinformatics library that yank_panda calls that parses identifiers:
my $search_match = qr/^([A-Za-z]{2,})|([\w.]+)?|?([\w.]+)?$/;
However, because Panda headers are sorted by database, and RefSeq (RF) and GenBank (GB) identifiers always come near the end, it’s never the case both that a GB or RF identifier is the first in the header, and that the header is long enough to be truncated. So in these cases the regular expression fails to match, yank-panda returns nothing, and the program fails to notice and goes ahead and uses the header from the btab, which is the same as would have been returned by yank-panda, because headers starting with RF or GB are never truncated.
So, you could argue that this beneficial bug was possible only because the code was poorly-written in the first place. It could also be the case that the program’s original author knew that RF and GB identifiers would fail when calling yank-panda, and that’s why he copied the original btab header. That would make this story not very interesting. I don’t think he did that, because in his version of the program there was never any reason to call yank-panda; you used only the first entry in the header, which was always present in the btab whether the header was truncated or not. But it still could be that, before that, there was a time when there was a need to call yank-panda; and some programmer years cleverly realized that the databases ordered last in Panda were the same ones that had a non-standard identifier format, and uncleverly designed that into the code without writing any comments or allowing for the possibility that identifier formats or Panda formats or databases used would change.