Without wanting to claim complete coverage of the subject, let me talk about a few relevant issues:: Let’s look at what’s the word ‘gene’ supposed to mean in the first place.
A while back there was the belief that DNA mainly exists to be translated into proteins. A gene was supposed to be a sequence that’s translated into a protein.
Today we now that a lot of DNA exists to be translated into RNA without producing proteins. Depending on the circumstance you might count RNA producing DNA as genes or not.
When you take a string of DNA that can produce a protein it’s possible that different splicing on introns produces a different protein. Humans seem to have something between 20000 and 25000 protein coding genes but >100,000 proteins. That drastic difference in numbers was a surprise to everyone when we did the human genome project.
There seem to be multiple copies of some genes. It’s not clear whether you count them multiple times and you can’t count repetitions in DNA well because we sequence DNA via shotgun sequencing.
If you compare the gene for human insulin with the one for champanzee insulin you can count it as both having insulin. You could use the match score between human insulin and champanzee insulin. You could say that it’s a different gene because it’s not exactly the same.
In the last case you have to think about what “the same” mean. Is it enough that the same protein gets produced or do you also want the exact same DNA? There are 64 different 3 base pair combination and only 20 (+1) different amino acids, so some amino acids get encoded by multiple base pairs. Those changes could however change the amount of protein that get’s produced. When producing human insulin in the lab one for example switches those base pairs to maximize protein production.
Lastly it’s not quite clear which DNA sequences actually get translated into proteins. One test is to try to let yeast or another organism produce the protein based on the gene and that’s expensive. It’s also possible that yeast simply lacks something to read that particular gene. In absence of that proof we have imperfect computer models that suggest to us which DNA sequences look like genes and which don’t.
The official protein database Uniprot therefore has Tremble (uncurated data, with errors) and Swissprot (curated data, that’s supposed to be more trustworthy)
That uncertainity is high enough that the official number of human protein-coding genes gets still quoted as 20000-25000.
In addition to looking at the sequenced DNA you can also look at single-nucleotide polymorphisms. Those chips go for a selection of specific mutations and could also be used as a basis for number of how two organisms differ in their genes. At the moment the lastest 23andMe chip looks at 577,382 atDNA SNPs.
Without wanting to claim complete coverage of the subject, let me talk about a few relevant issues::
Let’s look at what’s the word ‘gene’ supposed to mean in the first place.
A while back there was the belief that DNA mainly exists to be translated into proteins. A gene was supposed to be a sequence that’s translated into a protein.
Today we now that a lot of DNA exists to be translated into RNA without producing proteins. Depending on the circumstance you might count RNA producing DNA as genes or not.
When you take a string of DNA that can produce a protein it’s possible that different splicing on introns produces a different protein. Humans seem to have something between 20000 and 25000 protein coding genes but >100,000 proteins. That drastic difference in numbers was a surprise to everyone when we did the human genome project.
There seem to be multiple copies of some genes. It’s not clear whether you count them multiple times and you can’t count repetitions in DNA well because we sequence DNA via shotgun sequencing.
If you compare the gene for human insulin with the one for champanzee insulin you can count it as both having insulin. You could use the match score between human insulin and champanzee insulin. You could say that it’s a different gene because it’s not exactly the same.
In the last case you have to think about what “the same” mean. Is it enough that the same protein gets produced or do you also want the exact same DNA? There are 64 different 3 base pair combination and only 20 (+1) different amino acids, so some amino acids get encoded by multiple base pairs. Those changes could however change the amount of protein that get’s produced. When producing human insulin in the lab one for example switches those base pairs to maximize protein production.
Lastly it’s not quite clear which DNA sequences actually get translated into proteins. One test is to try to let yeast or another organism produce the protein based on the gene and that’s expensive. It’s also possible that yeast simply lacks something to read that particular gene. In absence of that proof we have imperfect computer models that suggest to us which DNA sequences look like genes and which don’t.
The official protein database Uniprot therefore has Tremble (uncurated data, with errors) and Swissprot (curated data, that’s supposed to be more trustworthy)
That uncertainity is high enough that the official number of human protein-coding genes gets still quoted as 20000-25000.
In addition to looking at the sequenced DNA you can also look at single-nucleotide polymorphisms. Those chips go for a selection of specific mutations and could also be used as a basis for number of how two organisms differ in their genes. At the moment the lastest 23andMe chip looks at 577,382 atDNA SNPs.