I received a similar email and was able to download my genome file a few days ago. The file is 23andMe format output by Plink. It was text even though it had a .gz suffix. I had trouble uploading the file to Promethease, but was able to get it working by changing the header to one copied from an actual 23andMe file and removing the missing (--) SNPs. Unfortunately, despite being ~125MB (~5x the size of an example 23andMe file I have) my file is missing many of the 23andMe SNPs (7948 genotypes annotated in Promethease vs. 20k+ for the 23andMe example). I have an email in to BGI requesting additional information. For example, Promethease directly supports the dbSNPAnnotated.bz2 Complete Genomics file and I was hoping to get a copy of that file for my data.
Have you had any success analyzing your results? Would anyone be interested in starting a discussion group for analyzing our BGI results?
Are you sure you’ve downloaded your entire genome file? My uncompressed file is about 500 MB, and I got about 26000 annotations on Promethease. It seems like your file might have gotten truncated during the download.
Short step-by-step guide for those who want to get their genome annotated by Promethease:
Use the ‘Download All Files’ link on the SpiderOak page to download your genome file.*
Unzip then gunzip to get the raw text file genome.txt.
Open the file in a text editor. Remove all the commented lines at beginning of the file except the last one (i.e., keep the line starting with # rsid; Promethease chokes if you don’t) and save. This is required to get Promethease to recognize the file.
(optional) Compress the edited file with zip, gzip, or bzip2 to save upload time and bandwidth.
Upload to Promethease and follow the directions there.
* I advise against downloading the genome.txt.gz file directly because for some reason SpiderOak has Content-encoding: gzip in their HTTP response header, which means that browsers will transparently uncompress that file. This makes me uneasy because there is no checksum provided for the (somewhat large) plain text file, so we have little protection against corruption and truncation. In contrast, by using ‘Download All Files’ to download everything in a zip, the data’s integrity will be automatically verified against CRC-32 checksums when we unzip and gunzip locally.
Thanks for the explanation and tips! I used your procedure and ended up with the same 131MB file. Interestingly I did not need to remove the “—” entries. I have been exchanging email with BGI and they indicated files could have significantly different number of entries (but I am surprised at >3x!). Is there any chance your sequencing had greater than 4x coverage? My VCF file is queued up and should be available in a few months which should help clarify what I am seeing.
I signed up with 23andMe, a few days before getting that letter from BGI. I’m currently waiting for both results. Can anyone point me to a good resource for studying what the data mean and what I can do with them?
I think Promethease (http://promethease.com) is a good and inexpensive ($5) start. If you have both sets of results I would recommend using 23andMe given my experience with uploading BGI data. Web searching “promethease review” will give some details and alternatives. Hopefully those of us in the BGI study can work out a good way of analyzing that data.
I received a similar email and was able to download my genome file a few days ago. The file is 23andMe format output by Plink. It was text even though it had a .gz suffix. I had trouble uploading the file to Promethease, but was able to get it working by changing the header to one copied from an actual 23andMe file and removing the missing (--) SNPs. Unfortunately, despite being ~125MB (~5x the size of an example 23andMe file I have) my file is missing many of the 23andMe SNPs (7948 genotypes annotated in Promethease vs. 20k+ for the 23andMe example). I have an email in to BGI requesting additional information. For example, Promethease directly supports the dbSNPAnnotated.bz2 Complete Genomics file and I was hoping to get a copy of that file for my data.
Have you had any success analyzing your results? Would anyone be interested in starting a discussion group for analyzing our BGI results?
Are you sure you’ve downloaded your entire genome file? My uncompressed file is about 500 MB, and I got about 26000 annotations on Promethease. It seems like your file might have gotten truncated during the download.
Short step-by-step guide for those who want to get their genome annotated by Promethease:
Use the ‘Download All Files’ link on the SpiderOak page to download your genome file.*
Unzip then gunzip to get the raw text file
genome.txt
.Open the file in a text editor. Remove all the commented lines at beginning of the file except the last one (i.e., keep the line starting with
# rsid
; Promethease chokes if you don’t) and save. This is required to get Promethease to recognize the file.(optional) Compress the edited file with zip, gzip, or bzip2 to save upload time and bandwidth.
Upload to Promethease and follow the directions there.
* I advise against downloading the
genome.txt.gz
file directly because for some reason SpiderOak hasContent-encoding: gzip
in their HTTP response header, which means that browsers will transparently uncompress that file. This makes me uneasy because there is no checksum provided for the (somewhat large) plain text file, so we have little protection against corruption and truncation. In contrast, by using ‘Download All Files’ to download everything in a zip, the data’s integrity will be automatically verified against CRC-32 checksums when we unzip and gunzip locally.Thanks for the explanation and tips! I used your procedure and ended up with the same 131MB file. Interestingly I did not need to remove the “—” entries. I have been exchanging email with BGI and they indicated files could have significantly different number of entries (but I am surprised at >3x!). Is there any chance your sequencing had greater than 4x coverage? My VCF file is queued up and should be available in a few months which should help clarify what I am seeing.
I don’t know. How do I find out?
I think the VCF would tell you if you had it. Another possibility would be using a lower quality threshold for calling SNPs, but that seems unlikely.
I signed up with 23andMe, a few days before getting that letter from BGI. I’m currently waiting for both results. Can anyone point me to a good resource for studying what the data mean and what I can do with them?
I think Promethease (http://promethease.com) is a good and inexpensive ($5) start. If you have both sets of results I would recommend using 23andMe given my experience with uploading BGI data. Web searching “promethease review” will give some details and alternatives. Hopefully those of us in the BGI study can work out a good way of analyzing that data.
I’m another participant. I’m still waiting for my results, but would be interested in any discussion group for analysis.