Bioinformatics is a neat example of how this bias can arise. Forgive me: I’m going to go into excessively nerdy detail about a specific example, because bioinformatics is cool.
Suppose that a biologist has amino acid sequences for 100 species’ versions of the same protein. The species are not closely related, and the protein sequences for each species have a lot of variation between them. The biologist wants to find parts of the protein sequence that have remained similar for a long evolutionary time. The usual way to do this is to try to line up matching parts of the sequences, inserting gaps and accepting mismatches where necessary.
Aligning multiple sequences is a very hard problem, computationally, so we have to use approximate methods. The most common way is to break it down to a problem we can solve much more easily: aligning two sequences, then computing an average sequence for that clump which can be used to add another sequence to it. And another, and another. These algorithms compare all pairs of sequences to measure how similar they are, and then starts clumping together similar sequences in a tree that looks a lot like a diagram of evolutionary ancestry. At the end, your sequences should be aligned acceptably. You hope.
Of course, this assumes that some of the sequences are more closely related than others, and that you can form a nice tree shape. And it’s very approximate, and there’s lots of opportunity for error to creep in. So for some data, this works great, and for some data, it gives nonsense. Another method looks for small, very common subsequences, and iteratively refines the alignment based on these. Again, this works great for some data, and not so well for others. And then of course there are dozens of other methods, based on things like genetic algorithms, or simulated annealing, or hidden Markov models, and all of these have times when they work well and times when they don’t.
So what does a biologist do? Try several methods, of course! The Right Way to do this is to run the same input through a bunch of algorithms, check them for agreement, see if several algorithms are giving the same results, and then apply any biological knowledge you may have to help decide what’s working. The way that I’m sure some people end up using is that they reach for whatever multiple sequence alignment program they like most, run it, and trust its output if it’s not too surprising. If it surprises them, they might try another algorithm. If that produces something more like what they were expecting to see, they may then stop, because they’re busy, damn it, and they have work to do. Most biologists really don’t want to be computer scientists, and it’s tempting to treat the computer as a magic answer machine. (The obvious solution here is to rent some computing cloud time and run a bunch of different algorithms on your data every time you do multiple sequence alignment, and have your software automatically compare the results. More dakka helps.)
I don’t know how common this sort of error is in practice, but there’s certainly the potential for it.
Bioinformatics is a neat example of how this bias can arise. Forgive me: I’m going to go into excessively nerdy detail about a specific example, because bioinformatics is cool.
Suppose that a biologist has amino acid sequences for 100 species’ versions of the same protein. The species are not closely related, and the protein sequences for each species have a lot of variation between them. The biologist wants to find parts of the protein sequence that have remained similar for a long evolutionary time. The usual way to do this is to try to line up matching parts of the sequences, inserting gaps and accepting mismatches where necessary.
Aligning multiple sequences is a very hard problem, computationally, so we have to use approximate methods. The most common way is to break it down to a problem we can solve much more easily: aligning two sequences, then computing an average sequence for that clump which can be used to add another sequence to it. And another, and another. These algorithms compare all pairs of sequences to measure how similar they are, and then starts clumping together similar sequences in a tree that looks a lot like a diagram of evolutionary ancestry. At the end, your sequences should be aligned acceptably. You hope.
Of course, this assumes that some of the sequences are more closely related than others, and that you can form a nice tree shape. And it’s very approximate, and there’s lots of opportunity for error to creep in. So for some data, this works great, and for some data, it gives nonsense. Another method looks for small, very common subsequences, and iteratively refines the alignment based on these. Again, this works great for some data, and not so well for others. And then of course there are dozens of other methods, based on things like genetic algorithms, or simulated annealing, or hidden Markov models, and all of these have times when they work well and times when they don’t.
So what does a biologist do? Try several methods, of course! The Right Way to do this is to run the same input through a bunch of algorithms, check them for agreement, see if several algorithms are giving the same results, and then apply any biological knowledge you may have to help decide what’s working. The way that I’m sure some people end up using is that they reach for whatever multiple sequence alignment program they like most, run it, and trust its output if it’s not too surprising. If it surprises them, they might try another algorithm. If that produces something more like what they were expecting to see, they may then stop, because they’re busy, damn it, and they have work to do. Most biologists really don’t want to be computer scientists, and it’s tempting to treat the computer as a magic answer machine. (The obvious solution here is to rent some computing cloud time and run a bunch of different algorithms on your data every time you do multiple sequence alignment, and have your software automatically compare the results. More dakka helps.)
I don’t know how common this sort of error is in practice, but there’s certainly the potential for it.