These seem pretty easy to answer even for a non-expert.
It is variously said that we share 99% of our genes with a chimpanzee, 95% of our genes with a random human, and 50% of our genes with a sibling. Explain how these can all be true statements.
Without wanting to claim complete coverage of the subject, let me talk about a few relevant issues:: Let’s look at what’s the word ‘gene’ supposed to mean in the first place.
A while back there was the belief that DNA mainly exists to be translated into proteins. A gene was supposed to be a sequence that’s translated into a protein.
Today we now that a lot of DNA exists to be translated into RNA without producing proteins. Depending on the circumstance you might count RNA producing DNA as genes or not.
When you take a string of DNA that can produce a protein it’s possible that different splicing on introns produces a different protein. Humans seem to have something between 20000 and 25000 protein coding genes but >100,000 proteins. That drastic difference in numbers was a surprise to everyone when we did the human genome project.
There seem to be multiple copies of some genes. It’s not clear whether you count them multiple times and you can’t count repetitions in DNA well because we sequence DNA via shotgun sequencing.
If you compare the gene for human insulin with the one for champanzee insulin you can count it as both having insulin. You could use the match score between human insulin and champanzee insulin. You could say that it’s a different gene because it’s not exactly the same.
In the last case you have to think about what “the same” mean. Is it enough that the same protein gets produced or do you also want the exact same DNA? There are 64 different 3 base pair combination and only 20 (+1) different amino acids, so some amino acids get encoded by multiple base pairs. Those changes could however change the amount of protein that get’s produced. When producing human insulin in the lab one for example switches those base pairs to maximize protein production.
Lastly it’s not quite clear which DNA sequences actually get translated into proteins. One test is to try to let yeast or another organism produce the protein based on the gene and that’s expensive. It’s also possible that yeast simply lacks something to read that particular gene. In absence of that proof we have imperfect computer models that suggest to us which DNA sequences look like genes and which don’t.
The official protein database Uniprot therefore has Tremble (uncurated data, with errors) and Swissprot (curated data, that’s supposed to be more trustworthy)
That uncertainity is high enough that the official number of human protein-coding genes gets still quoted as 20000-25000.
In addition to looking at the sequenced DNA you can also look at single-nucleotide polymorphisms. Those chips go for a selection of specific mutations and could also be used as a basis for number of how two organisms differ in their genes. At the moment the lastest 23andMe chip looks at 577,382 atDNA SNPs.
99% of our genes have a chimp equivalent and vice versa
95% of our genes with a random human
In 95% of something or other of their genome two random humans have the exact same allele. In the other 5% they are no more similar than a human and a chimp are in the 99% that are shared between the species.
and 50% of our genes with a sibling
Of the loci where humanity have different alleles two whole siblings have identical alleles. This is a theoretical number for average siblings in a population with no inbreeding or any population structure, actual siblings tend to be much more similar.
If you sequence your DNA and the DNA of a random chimp, and consider only the substrings that can be identified as genes, and measure string similarity between them, you will get a number between 98% and 99%, depending on the choice of string similarity measure (there are many reasonable choices).
95% of our genes with a random human
Never heard that before.
50% of our genes with a sibling
Suppose an unique id tag was attached to all the gene strings in the DNA of each of your parents. Even if the same gene appears in both of your parents, or even if it appears multiple times in the same parent, each instance gets a different id. Then your parents mate and produce you and your sibling. On average, you and your sibling will share 50% of these gene ids. Of course, many of these genes with different ids will be identical strings, hence the genetic similarity measured as in the human-chimp case will be > 99.9%.
Saying this as a non-expert, the percentages are obviously taken over different gene pools (e.g. there is no reason to count genes in common with a chimpanzee when you are comparing two humans or two siblings.)
This confuses me. I find it highly unlikely the average human shares more genes with a chimpanzee than another human and even more unlikely that siblings only share 50% of their genes.
probability estimates (statement is true):
99% genetic similarity to a chimpanzee = 75%
95% genetic similarity to a random human = a low nonzero number
50% genetic similarity to a sibling = 0%
95% genetic similarity to a random human given 99% genetic similarity to a chimp = 0%
I am going to research this.
EDIT: findings:
Researching an an actual number is exceeding difficult. About 50% of the pages are non-secular websites (this may be my non-optimized google searching). The rest are a mix between technical articles and articles formatted for the average human (average being living in a English speaking and developed nations).
99% genetic similarity to a chimpanzee
Mostly correct. Estimates range between 95%^[1] and 98.8%^[2]
95% genetic similarity to a random human
Incorrect. Estimates are at 0.1%^[1]. I did not notice other numbers.
50% genetic similarity to a sibling
Incorrect as you stated it (comparing total gene dissimilarity). You might want to reword it since you were probably comparing what percentage of gene can be attributed to a parent.
This confuses me. I find it highly unlikely the average human shares more genes with a chimpanzee than another human and even more unlikely that siblings only share 50% of their genes.
It puzzles me as well. I believe the answer is that there are multiple concepts of “shared genes”, but I have never been clear what they are.
That depends on the meaning of “our”. A smaller and smaller subset of genes is being considered, as you shift focus from chimp to human to sibling. In the chimp example, the statistic may as well have been made for your entire genome, including stuff like genes coding for cell membrane (which doesn’t vary wildly with species/taxonomy, more likely to vary with tissue type—don’t know, not a biologist). In the sibling example, you take for granted that the greatest part of your genome is going to be shared by virtue of both of you being human, exclude those genes, and only count the rest.
If you establish similarity/difference by counting the same set of genes (for instance all of them, like with chimps), the difference between you and your sibling might only differ by very, very few percentage points down from 100%, and that’s not exactly telling us anything useful, is it?
At least this is how I understand it, and why that type of sentence doesn’t confuse me. Again, not a biologist, sorry for possible stupid mistakes/inaccuracies.
Disclaimer: Not remotely an expert at biology, but I will try to explain.
One can think of the word “gene” as having multiple related uses.
Use 1: “Genotype”. Even if we have different color hair, we likely both have the same “gene” for hair which could be considered shared with chimpanzees. If you could re-write DNA nucleobases, you could change your hair color without changing the gene itself, you would merely be changing the “gene encoding”. The word “genotype” refers to a “function” which takes in a “gene encoding” and outputs a “gene phenotype”
Use 2: “Gene phenotype”. If we both have the same color hair, we would have the same “Gene phenotype”. Suppose the genotype for hair is a gene that uses simple dominance. In this case, we could have the same phenotype even with different gene encodings. Suppose you have the gene encoding “BB” whereas I have the gene encoding “Bb”. In this case, we could both have black hair, the same “Gene phenotype”, but have different “Gene encodings”.
Use 3: “Gene encoding”. If we have different color hair, then we have different gene encodings (but we have the same “genotype” as described in “Use 1”). This “gene encoding” is commonly not shared between siblings and less commonly shared between species.
So “we share 99% of our genes with a chimpanzee” likely refers to “Genotype”.
“95% of our genes with a random human” likely refers to “Gene phenotype”.
“50% of our genes with a sibling” likely refers to “Gene encoding”.
It is variously said that we share 99% of our genes with a chimpanzee, 95% of our genes with a random human, and 50% of our genes with a sibling. Explain how these can all be true statements.
Without wanting to claim complete coverage of the subject, let me talk about a few relevant issues::
Let’s look at what’s the word ‘gene’ supposed to mean in the first place.
A while back there was the belief that DNA mainly exists to be translated into proteins. A gene was supposed to be a sequence that’s translated into a protein.
Today we now that a lot of DNA exists to be translated into RNA without producing proteins. Depending on the circumstance you might count RNA producing DNA as genes or not.
When you take a string of DNA that can produce a protein it’s possible that different splicing on introns produces a different protein. Humans seem to have something between 20000 and 25000 protein coding genes but >100,000 proteins. That drastic difference in numbers was a surprise to everyone when we did the human genome project.
There seem to be multiple copies of some genes. It’s not clear whether you count them multiple times and you can’t count repetitions in DNA well because we sequence DNA via shotgun sequencing.
If you compare the gene for human insulin with the one for champanzee insulin you can count it as both having insulin. You could use the match score between human insulin and champanzee insulin. You could say that it’s a different gene because it’s not exactly the same.
In the last case you have to think about what “the same” mean. Is it enough that the same protein gets produced or do you also want the exact same DNA? There are 64 different 3 base pair combination and only 20 (+1) different amino acids, so some amino acids get encoded by multiple base pairs. Those changes could however change the amount of protein that get’s produced. When producing human insulin in the lab one for example switches those base pairs to maximize protein production.
Lastly it’s not quite clear which DNA sequences actually get translated into proteins. One test is to try to let yeast or another organism produce the protein based on the gene and that’s expensive. It’s also possible that yeast simply lacks something to read that particular gene. In absence of that proof we have imperfect computer models that suggest to us which DNA sequences look like genes and which don’t.
The official protein database Uniprot therefore has Tremble (uncurated data, with errors) and Swissprot (curated data, that’s supposed to be more trustworthy)
That uncertainity is high enough that the official number of human protein-coding genes gets still quoted as 20000-25000.
In addition to looking at the sequenced DNA you can also look at single-nucleotide polymorphisms. Those chips go for a selection of specific mutations and could also be used as a basis for number of how two organisms differ in their genes. At the moment the lastest 23andMe chip looks at 577,382 atDNA SNPs.
99% of our genes have a chimp equivalent and vice versa
In 95% of something or other of their genome two random humans have the exact same allele. In the other 5% they are no more similar than a human and a chimp are in the 99% that are shared between the species.
Of the loci where humanity have different alleles two whole siblings have identical alleles. This is a theoretical number for average siblings in a population with no inbreeding or any population structure, actual siblings tend to be much more similar.
Non-expert there, but here are my two cents:
If you sequence your DNA and the DNA of a random chimp, and consider only the substrings that can be identified as genes, and measure string similarity between them, you will get a number between 98% and 99%, depending on the choice of string similarity measure (there are many reasonable choices).
Never heard that before.
Suppose an unique id tag was attached to all the gene strings in the DNA of each of your parents. Even if the same gene appears in both of your parents, or even if it appears multiple times in the same parent, each instance gets a different id.
Then your parents mate and produce you and your sibling. On average, you and your sibling will share 50% of these gene ids.
Of course, many of these genes with different ids will be identical strings, hence the genetic similarity measured as in the human-chimp case will be > 99.9%.
Saying this as a non-expert, the percentages are obviously taken over different gene pools (e.g. there is no reason to count genes in common with a chimpanzee when you are comparing two humans or two siblings.)
This confuses me. I find it highly unlikely the average human shares more genes with a chimpanzee than another human and even more unlikely that siblings only share 50% of their genes.
probability estimates (statement is true):
99% genetic similarity to a chimpanzee = 75%
95% genetic similarity to a random human = a low nonzero number
50% genetic similarity to a sibling = 0%
95% genetic similarity to a random human given 99% genetic similarity to a chimp = 0%
I am going to research this.
EDIT: findings:
Researching an an actual number is exceeding difficult. About 50% of the pages are non-secular websites (this may be my non-optimized google searching). The rest are a mix between technical articles and articles formatted for the average human (average being living in a English speaking and developed nations).
99% genetic similarity to a chimpanzee
Mostly correct. Estimates range between 95%^[1] and 98.8%^[2]
95% genetic similarity to a random human
Incorrect. Estimates are at 0.1%^[1]. I did not notice other numbers.
50% genetic similarity to a sibling
Incorrect as you stated it (comparing total gene dissimilarity). You might want to reword it since you were probably comparing what percentage of gene can be attributed to a parent.
[1] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC129726/ [2] http://humanorigins.si.edu/evidence/genetics
It puzzles me as well. I believe the answer is that there are multiple concepts of “shared genes”, but I have never been clear what they are.
That depends on the meaning of “our”. A smaller and smaller subset of genes is being considered, as you shift focus from chimp to human to sibling. In the chimp example, the statistic may as well have been made for your entire genome, including stuff like genes coding for cell membrane (which doesn’t vary wildly with species/taxonomy, more likely to vary with tissue type—don’t know, not a biologist). In the sibling example, you take for granted that the greatest part of your genome is going to be shared by virtue of both of you being human, exclude those genes, and only count the rest.
If you establish similarity/difference by counting the same set of genes (for instance all of them, like with chimps), the difference between you and your sibling might only differ by very, very few percentage points down from 100%, and that’s not exactly telling us anything useful, is it?
At least this is how I understand it, and why that type of sentence doesn’t confuse me. Again, not a biologist, sorry for possible stupid mistakes/inaccuracies.
Disclaimer: Not remotely an expert at biology, but I will try to explain.
One can think of the word “gene” as having multiple related uses.
Use 1: “Genotype”. Even if we have different color hair, we likely both have the same “gene” for hair which could be considered shared with chimpanzees. If you could re-write DNA nucleobases, you could change your hair color without changing the gene itself, you would merely be changing the “gene encoding”. The word “genotype” refers to a “function” which takes in a “gene encoding” and outputs a “gene phenotype”
Use 2: “Gene phenotype”. If we both have the same color hair, we would have the same “Gene phenotype”. Suppose the genotype for hair is a gene that uses simple dominance. In this case, we could have the same phenotype even with different gene encodings. Suppose you have the gene encoding “BB” whereas I have the gene encoding “Bb”. In this case, we could both have black hair, the same “Gene phenotype”, but have different “Gene encodings”.
Use 3: “Gene encoding”. If we have different color hair, then we have different gene encodings (but we have the same “genotype” as described in “Use 1”). This “gene encoding” is commonly not shared between siblings and less commonly shared between species.
So “we share 99% of our genes with a chimpanzee” likely refers to “Genotype”.
“95% of our genes with a random human” likely refers to “Gene phenotype”.
“50% of our genes with a sibling” likely refers to “Gene encoding”.