I received exactly the same number of SNPs from BGI, so it looks like our data were processed under the same pipeline. I’ve found three people who have publicly posted their BGI data: two at the Personal Genome Project (hu2FEC01 and hu41F03B, each with 5,095,048 SNPs), and one on a personal website (with 18,217,058 SNPs).
Then there are a few thousand SNPs that one or other analysis (in 26 cases, both) list in their output but don’t report anything for. What causes this?
The double dashes are no calls. 23andme reports on a set list of SNPs, and instead of omitting an SNP when they can’t confidently determine the genotype, they indicate this with a double dash.
Is this amount of mismatch typical for such analyses?
This seems normal considering the error rates from 23andme that others have been reporting (example). I don’t know about BGI’s error rates.
I think it might be possible to accurately guess the actual genotypes for some of the mismatches by imputing the genotypes with something like Impute2 (for each mismatched SNP, leave it out and impute it using the nearby SNPs). This will take many hours of work, though, and you might as well phase and impute across the whole genome if you have the time, interest, and processing power to do so (I’ve been meaning to try this out to learn more about how these things work).
Interesting. Thanks for posting this!
I received exactly the same number of SNPs from BGI, so it looks like our data were processed under the same pipeline. I’ve found three people who have publicly posted their BGI data: two at the Personal Genome Project (hu2FEC01 and hu41F03B, each with 5,095,048 SNPs), and one on a personal website (with 18,217,058 SNPs).
The double dashes are no calls. 23andme reports on a set list of SNPs, and instead of omitting an SNP when they can’t confidently determine the genotype, they indicate this with a double dash.
This seems normal considering the error rates from 23andme that others have been reporting (example). I don’t know about BGI’s error rates.
I think it might be possible to accurately guess the actual genotypes for some of the mismatches by imputing the genotypes with something like Impute2 (for each mismatched SNP, leave it out and impute it using the nearby SNPs). This will take many hours of work, though, and you might as well phase and impute across the whole genome if you have the time, interest, and processing power to do so (I’ve been meaning to try this out to learn more about how these things work).