Are you sure you’ve downloaded your entire genome file? My uncompressed file is about 500 MB, and I got about 26000 annotations on Promethease. It seems like your file might have gotten truncated during the download.
Short step-by-step guide for those who want to get their genome annotated by Promethease:
Use the ‘Download All Files’ link on the SpiderOak page to download your genome file.*
Unzip then gunzip to get the raw text file genome.txt.
Open the file in a text editor. Remove all the commented lines at beginning of the file except the last one (i.e., keep the line starting with # rsid; Promethease chokes if you don’t) and save. This is required to get Promethease to recognize the file.
(optional) Compress the edited file with zip, gzip, or bzip2 to save upload time and bandwidth.
Upload to Promethease and follow the directions there.
* I advise against downloading the genome.txt.gz file directly because for some reason SpiderOak has Content-encoding: gzip in their HTTP response header, which means that browsers will transparently uncompress that file. This makes me uneasy because there is no checksum provided for the (somewhat large) plain text file, so we have little protection against corruption and truncation. In contrast, by using ‘Download All Files’ to download everything in a zip, the data’s integrity will be automatically verified against CRC-32 checksums when we unzip and gunzip locally.
Thanks for the explanation and tips! I used your procedure and ended up with the same 131MB file. Interestingly I did not need to remove the “—” entries. I have been exchanging email with BGI and they indicated files could have significantly different number of entries (but I am surprised at >3x!). Is there any chance your sequencing had greater than 4x coverage? My VCF file is queued up and should be available in a few months which should help clarify what I am seeing.
Are you sure you’ve downloaded your entire genome file? My uncompressed file is about 500 MB, and I got about 26000 annotations on Promethease. It seems like your file might have gotten truncated during the download.
Short step-by-step guide for those who want to get their genome annotated by Promethease:
Use the ‘Download All Files’ link on the SpiderOak page to download your genome file.*
Unzip then gunzip to get the raw text file
genome.txt
.Open the file in a text editor. Remove all the commented lines at beginning of the file except the last one (i.e., keep the line starting with
# rsid
; Promethease chokes if you don’t) and save. This is required to get Promethease to recognize the file.(optional) Compress the edited file with zip, gzip, or bzip2 to save upload time and bandwidth.
Upload to Promethease and follow the directions there.
* I advise against downloading the
genome.txt.gz
file directly because for some reason SpiderOak hasContent-encoding: gzip
in their HTTP response header, which means that browsers will transparently uncompress that file. This makes me uneasy because there is no checksum provided for the (somewhat large) plain text file, so we have little protection against corruption and truncation. In contrast, by using ‘Download All Files’ to download everything in a zip, the data’s integrity will be automatically verified against CRC-32 checksums when we unzip and gunzip locally.Thanks for the explanation and tips! I used your procedure and ended up with the same 131MB file. Interestingly I did not need to remove the “—” entries. I have been exchanging email with BGI and they indicated files could have significantly different number of entries (but I am surprised at >3x!). Is there any chance your sequencing had greater than 4x coverage? My VCF file is queued up and should be available in a few months which should help clarify what I am seeing.
I don’t know. How do I find out?
I think the VCF would tell you if you had it. Another possibility would be using a lower quality threshold for calling SNPs, but that seems unlikely.