These seem pretty easy to answer even for a non-expert.
It is variously said that we share 99% of our genes with a chimpanzee, 95% of our genes with a random human, and 50% of our genes with a sibling. Explain how these can all be true statements.
Without wanting to claim complete coverage of the subject, let me talk about a few relevant issues:: Let’s look at what’s the word ‘gene’ supposed to mean in the first place.
A while back there was the belief that DNA mainly exists to be translated into proteins. A gene was supposed to be a sequence that’s translated into a protein.
Today we now that a lot of DNA exists to be translated into RNA without producing proteins. Depending on the circumstance you might count RNA producing DNA as genes or not.
When you take a string of DNA that can produce a protein it’s possible that different splicing on introns produces a different protein. Humans seem to have something between 20000 and 25000 protein coding genes but >100,000 proteins. That drastic difference in numbers was a surprise to everyone when we did the human genome project.
There seem to be multiple copies of some genes. It’s not clear whether you count them multiple times and you can’t count repetitions in DNA well because we sequence DNA via shotgun sequencing.
If you compare the gene for human insulin with the one for champanzee insulin you can count it as both having insulin. You could use the match score between human insulin and champanzee insulin. You could say that it’s a different gene because it’s not exactly the same.
In the last case you have to think about what “the same” mean. Is it enough that the same protein gets produced or do you also want the exact same DNA? There are 64 different 3 base pair combination and only 20 (+1) different amino acids, so some amino acids get encoded by multiple base pairs. Those changes could however change the amount of protein that get’s produced. When producing human insulin in the lab one for example switches those base pairs to maximize protein production.
Lastly it’s not quite clear which DNA sequences actually get translated into proteins. One test is to try to let yeast or another organism produce the protein based on the gene and that’s expensive. It’s also possible that yeast simply lacks something to read that particular gene. In absence of that proof we have imperfect computer models that suggest to us which DNA sequences look like genes and which don’t.
The official protein database Uniprot therefore has Tremble (uncurated data, with errors) and Swissprot (curated data, that’s supposed to be more trustworthy)
That uncertainity is high enough that the official number of human protein-coding genes gets still quoted as 20000-25000.
In addition to looking at the sequenced DNA you can also look at single-nucleotide polymorphisms. Those chips go for a selection of specific mutations and could also be used as a basis for number of how two organisms differ in their genes. At the moment the lastest 23andMe chip looks at 577,382 atDNA SNPs.
99% of our genes have a chimp equivalent and vice versa
95% of our genes with a random human
In 95% of something or other of their genome two random humans have the exact same allele. In the other 5% they are no more similar than a human and a chimp are in the 99% that are shared between the species.
and 50% of our genes with a sibling
Of the loci where humanity have different alleles two whole siblings have identical alleles. This is a theoretical number for average siblings in a population with no inbreeding or any population structure, actual siblings tend to be much more similar.
If you sequence your DNA and the DNA of a random chimp, and consider only the substrings that can be identified as genes, and measure string similarity between them, you will get a number between 98% and 99%, depending on the choice of string similarity measure (there are many reasonable choices).
95% of our genes with a random human
Never heard that before.
50% of our genes with a sibling
Suppose an unique id tag was attached to all the gene strings in the DNA of each of your parents. Even if the same gene appears in both of your parents, or even if it appears multiple times in the same parent, each instance gets a different id. Then your parents mate and produce you and your sibling. On average, you and your sibling will share 50% of these gene ids. Of course, many of these genes with different ids will be identical strings, hence the genetic similarity measured as in the human-chimp case will be > 99.9%.
Saying this as a non-expert, the percentages are obviously taken over different gene pools (e.g. there is no reason to count genes in common with a chimpanzee when you are comparing two humans or two siblings.)
This confuses me. I find it highly unlikely the average human shares more genes with a chimpanzee than another human and even more unlikely that siblings only share 50% of their genes.
probability estimates (statement is true):
99% genetic similarity to a chimpanzee = 75%
95% genetic similarity to a random human = a low nonzero number
50% genetic similarity to a sibling = 0%
95% genetic similarity to a random human given 99% genetic similarity to a chimp = 0%
I am going to research this.
EDIT: findings:
Researching an an actual number is exceeding difficult. About 50% of the pages are non-secular websites (this may be my non-optimized google searching). The rest are a mix between technical articles and articles formatted for the average human (average being living in a English speaking and developed nations).
99% genetic similarity to a chimpanzee
Mostly correct. Estimates range between 95%^[1] and 98.8%^[2]
95% genetic similarity to a random human
Incorrect. Estimates are at 0.1%^[1]. I did not notice other numbers.
50% genetic similarity to a sibling
Incorrect as you stated it (comparing total gene dissimilarity). You might want to reword it since you were probably comparing what percentage of gene can be attributed to a parent.
This confuses me. I find it highly unlikely the average human shares more genes with a chimpanzee than another human and even more unlikely that siblings only share 50% of their genes.
It puzzles me as well. I believe the answer is that there are multiple concepts of “shared genes”, but I have never been clear what they are.
That depends on the meaning of “our”. A smaller and smaller subset of genes is being considered, as you shift focus from chimp to human to sibling. In the chimp example, the statistic may as well have been made for your entire genome, including stuff like genes coding for cell membrane (which doesn’t vary wildly with species/taxonomy, more likely to vary with tissue type—don’t know, not a biologist). In the sibling example, you take for granted that the greatest part of your genome is going to be shared by virtue of both of you being human, exclude those genes, and only count the rest.
If you establish similarity/difference by counting the same set of genes (for instance all of them, like with chimps), the difference between you and your sibling might only differ by very, very few percentage points down from 100%, and that’s not exactly telling us anything useful, is it?
At least this is how I understand it, and why that type of sentence doesn’t confuse me. Again, not a biologist, sorry for possible stupid mistakes/inaccuracies.
Disclaimer: Not remotely an expert at biology, but I will try to explain.
One can think of the word “gene” as having multiple related uses.
Use 1: “Genotype”. Even if we have different color hair, we likely both have the same “gene” for hair which could be considered shared with chimpanzees. If you could re-write DNA nucleobases, you could change your hair color without changing the gene itself, you would merely be changing the “gene encoding”. The word “genotype” refers to a “function” which takes in a “gene encoding” and outputs a “gene phenotype”
Use 2: “Gene phenotype”. If we both have the same color hair, we would have the same “Gene phenotype”. Suppose the genotype for hair is a gene that uses simple dominance. In this case, we could have the same phenotype even with different gene encodings. Suppose you have the gene encoding “BB” whereas I have the gene encoding “Bb”. In this case, we could both have black hair, the same “Gene phenotype”, but have different “Gene encodings”.
Use 3: “Gene encoding”. If we have different color hair, then we have different gene encodings (but we have the same “genotype” as described in “Use 1”). This “gene encoding” is commonly not shared between siblings and less commonly shared between species.
So “we share 99% of our genes with a chimpanzee” likely refers to “Genotype”.
“95% of our genes with a random human” likely refers to “Gene phenotype”.
“50% of our genes with a sibling” likely refers to “Gene encoding”.
Off the top of my head, “A collection of organisms which can interbreed with each other and produce fertile offspring”, for sexual organisms, and “what humans decide is a species” for asexual organisms. Would an expert be able to do better? The word seems too old and the concept to vague to have a tight definition.
“Species” is one rung on the phylogenetic ladder. Whether a given edge case should be classified as a species or as a subspecies can be debated, but in practical terms it is useful to have a tree-like map, because it allows you to assess the phylogenetic distance between two groups.
Also, compared to the range from class to genus, “species” is relatively clear-cut.
It depends on your goal. What a lot of non-biologists don’t realize is that the ladder keeps going after species down through subspecies and beyond. In terms of bacteria, which do undergo horizontal gene transfer, we generally refer to them by their strain in addition to their species. The strain tells you where you got the culture, and, in lab settings, what it’s used for. CAMP Staphylococcus aureus is used for the CAMP test, for example—because you know where the strain comes from, you can be reasonably confident that it will behave like other bacteria of that strain. If you have a different strain of Staphylococcus aureus, you expect that it would probably also work for this test, but by the time you get as far away as Staphylococcus epidermidis, it’s quite unlikely that you could use it successfully for the CAMP test.
In theory, you could do a DNA extraction and see if your organism has the right genes to do what you want. In practice, it’s usually cheaper and easier to use a strain that you know has the right characteristics—even among bacteria with 20 minute generation times, genetic drift is still pretty slow, and what little selective pressure there is is pushing for the strain to keep its useful properties (i.e. we throw away bad cultures).
The phylogenetic tree model is used because it makes useful predictions about the world, not because it represents the way the world actually is.
The phylogenetic tree model is used because it makes useful predictions about the world, not because it represents the way the world actually is.
Yes. I’m not denying that such models do have use. But on the other hand people outside of biology do often consider them to represent the world as it actually is.
All of the organisms descended from a most recent common ancestor; we pick the MRCA semi-arbitrarily based on criteria like “sexual compatibility of descendents”.
I think species can be paraphyletic. If we sent a family of llamas into outer space and they evolved into Space Llamas, there would be no common ancestor which included all terrestrial _L. glama_s but excluded L. astrollama.
There are various genetic issues that make individuals sterile. We don’t say that they are suddenly another species just because they are sterile and thus not sexually compatilbe.
And there are groups (like oribatid mites) where parthenogenesis is very common. No sex at all, though males occur. (Here’s a challenge: you think of anything common among vertebrates, then look for invertebrates (including single-cellular animals) for whom it’s not common. The sea-dwellers are very good for this search.)
Some people would tell you that only Homo sapiens exists as a species. Suppose a ‘species’ exists as a set of disjointed populations, which will never meet each other (or the probability of it happening is so much smaller than of them going extinct)...
No, that’s a clade or a monophyletic taxon. Most species are clades, but as solipsist points out not all species are necessarily clades, and most clades are not species.
These seem pretty easy to answer even for a non-expert.
It is variously said that we share 99% of our genes with a chimpanzee, 95% of our genes with a random human, and 50% of our genes with a sibling. Explain how these can all be true statements.
Without wanting to claim complete coverage of the subject, let me talk about a few relevant issues::
Let’s look at what’s the word ‘gene’ supposed to mean in the first place.
A while back there was the belief that DNA mainly exists to be translated into proteins. A gene was supposed to be a sequence that’s translated into a protein.
Today we now that a lot of DNA exists to be translated into RNA without producing proteins. Depending on the circumstance you might count RNA producing DNA as genes or not.
When you take a string of DNA that can produce a protein it’s possible that different splicing on introns produces a different protein. Humans seem to have something between 20000 and 25000 protein coding genes but >100,000 proteins. That drastic difference in numbers was a surprise to everyone when we did the human genome project.
There seem to be multiple copies of some genes. It’s not clear whether you count them multiple times and you can’t count repetitions in DNA well because we sequence DNA via shotgun sequencing.
If you compare the gene for human insulin with the one for champanzee insulin you can count it as both having insulin. You could use the match score between human insulin and champanzee insulin. You could say that it’s a different gene because it’s not exactly the same.
In the last case you have to think about what “the same” mean. Is it enough that the same protein gets produced or do you also want the exact same DNA? There are 64 different 3 base pair combination and only 20 (+1) different amino acids, so some amino acids get encoded by multiple base pairs. Those changes could however change the amount of protein that get’s produced. When producing human insulin in the lab one for example switches those base pairs to maximize protein production.
Lastly it’s not quite clear which DNA sequences actually get translated into proteins. One test is to try to let yeast or another organism produce the protein based on the gene and that’s expensive. It’s also possible that yeast simply lacks something to read that particular gene. In absence of that proof we have imperfect computer models that suggest to us which DNA sequences look like genes and which don’t.
The official protein database Uniprot therefore has Tremble (uncurated data, with errors) and Swissprot (curated data, that’s supposed to be more trustworthy)
That uncertainity is high enough that the official number of human protein-coding genes gets still quoted as 20000-25000.
In addition to looking at the sequenced DNA you can also look at single-nucleotide polymorphisms. Those chips go for a selection of specific mutations and could also be used as a basis for number of how two organisms differ in their genes. At the moment the lastest 23andMe chip looks at 577,382 atDNA SNPs.
99% of our genes have a chimp equivalent and vice versa
In 95% of something or other of their genome two random humans have the exact same allele. In the other 5% they are no more similar than a human and a chimp are in the 99% that are shared between the species.
Of the loci where humanity have different alleles two whole siblings have identical alleles. This is a theoretical number for average siblings in a population with no inbreeding or any population structure, actual siblings tend to be much more similar.
Non-expert there, but here are my two cents:
If you sequence your DNA and the DNA of a random chimp, and consider only the substrings that can be identified as genes, and measure string similarity between them, you will get a number between 98% and 99%, depending on the choice of string similarity measure (there are many reasonable choices).
Never heard that before.
Suppose an unique id tag was attached to all the gene strings in the DNA of each of your parents. Even if the same gene appears in both of your parents, or even if it appears multiple times in the same parent, each instance gets a different id.
Then your parents mate and produce you and your sibling. On average, you and your sibling will share 50% of these gene ids.
Of course, many of these genes with different ids will be identical strings, hence the genetic similarity measured as in the human-chimp case will be > 99.9%.
Saying this as a non-expert, the percentages are obviously taken over different gene pools (e.g. there is no reason to count genes in common with a chimpanzee when you are comparing two humans or two siblings.)
This confuses me. I find it highly unlikely the average human shares more genes with a chimpanzee than another human and even more unlikely that siblings only share 50% of their genes.
probability estimates (statement is true):
99% genetic similarity to a chimpanzee = 75%
95% genetic similarity to a random human = a low nonzero number
50% genetic similarity to a sibling = 0%
95% genetic similarity to a random human given 99% genetic similarity to a chimp = 0%
I am going to research this.
EDIT: findings:
Researching an an actual number is exceeding difficult. About 50% of the pages are non-secular websites (this may be my non-optimized google searching). The rest are a mix between technical articles and articles formatted for the average human (average being living in a English speaking and developed nations).
99% genetic similarity to a chimpanzee
Mostly correct. Estimates range between 95%^[1] and 98.8%^[2]
95% genetic similarity to a random human
Incorrect. Estimates are at 0.1%^[1]. I did not notice other numbers.
50% genetic similarity to a sibling
Incorrect as you stated it (comparing total gene dissimilarity). You might want to reword it since you were probably comparing what percentage of gene can be attributed to a parent.
[1] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC129726/ [2] http://humanorigins.si.edu/evidence/genetics
It puzzles me as well. I believe the answer is that there are multiple concepts of “shared genes”, but I have never been clear what they are.
That depends on the meaning of “our”. A smaller and smaller subset of genes is being considered, as you shift focus from chimp to human to sibling. In the chimp example, the statistic may as well have been made for your entire genome, including stuff like genes coding for cell membrane (which doesn’t vary wildly with species/taxonomy, more likely to vary with tissue type—don’t know, not a biologist). In the sibling example, you take for granted that the greatest part of your genome is going to be shared by virtue of both of you being human, exclude those genes, and only count the rest.
If you establish similarity/difference by counting the same set of genes (for instance all of them, like with chimps), the difference between you and your sibling might only differ by very, very few percentage points down from 100%, and that’s not exactly telling us anything useful, is it?
At least this is how I understand it, and why that type of sentence doesn’t confuse me. Again, not a biologist, sorry for possible stupid mistakes/inaccuracies.
Disclaimer: Not remotely an expert at biology, but I will try to explain.
One can think of the word “gene” as having multiple related uses.
Use 1: “Genotype”. Even if we have different color hair, we likely both have the same “gene” for hair which could be considered shared with chimpanzees. If you could re-write DNA nucleobases, you could change your hair color without changing the gene itself, you would merely be changing the “gene encoding”. The word “genotype” refers to a “function” which takes in a “gene encoding” and outputs a “gene phenotype”
Use 2: “Gene phenotype”. If we both have the same color hair, we would have the same “Gene phenotype”. Suppose the genotype for hair is a gene that uses simple dominance. In this case, we could have the same phenotype even with different gene encodings. Suppose you have the gene encoding “BB” whereas I have the gene encoding “Bb”. In this case, we could both have black hair, the same “Gene phenotype”, but have different “Gene encodings”.
Use 3: “Gene encoding”. If we have different color hair, then we have different gene encodings (but we have the same “genotype” as described in “Use 1”). This “gene encoding” is commonly not shared between siblings and less commonly shared between species.
So “we share 99% of our genes with a chimpanzee” likely refers to “Genotype”.
“95% of our genes with a random human” likely refers to “Gene phenotype”.
“50% of our genes with a sibling” likely refers to “Gene encoding”.
Exactly. You have to be an expert in order to know about all of the edge-cases that make definitions difficult.
Try it—“species” is definitely messy.
Off the top of my head, “A collection of organisms which can interbreed with each other and produce fertile offspring”, for sexual organisms, and “what humans decide is a species” for asexual organisms. Would an expert be able to do better? The word seems too old and the concept to vague to have a tight definition.
“Species” is not a clean concept in a world with viruses, clines, and ring species.
More precisely, “species” is a map marker made by someone who likes discrete, mostly tree-like maps (legacy of Aristotle?)
“Species” is one rung on the phylogenetic ladder. Whether a given edge case should be classified as a species or as a subspecies can be debated, but in practical terms it is useful to have a tree-like map, because it allows you to assess the phylogenetic distance between two groups.
Also, compared to the range from class to genus, “species” is relatively clear-cut.
That works as long as a virus doesn’t transfer genes from one species to the next and thus invalidates the tree structure.
It depends on your goal. What a lot of non-biologists don’t realize is that the ladder keeps going after species down through subspecies and beyond. In terms of bacteria, which do undergo horizontal gene transfer, we generally refer to them by their strain in addition to their species. The strain tells you where you got the culture, and, in lab settings, what it’s used for. CAMP Staphylococcus aureus is used for the CAMP test, for example—because you know where the strain comes from, you can be reasonably confident that it will behave like other bacteria of that strain. If you have a different strain of Staphylococcus aureus, you expect that it would probably also work for this test, but by the time you get as far away as Staphylococcus epidermidis, it’s quite unlikely that you could use it successfully for the CAMP test.
In theory, you could do a DNA extraction and see if your organism has the right genes to do what you want. In practice, it’s usually cheaper and easier to use a strain that you know has the right characteristics—even among bacteria with 20 minute generation times, genetic drift is still pretty slow, and what little selective pressure there is is pushing for the strain to keep its useful properties (i.e. we throw away bad cultures).
The phylogenetic tree model is used because it makes useful predictions about the world, not because it represents the way the world actually is.
Yes. I’m not denying that such models do have use. But on the other hand people outside of biology do often consider them to represent the world as it actually is.
I think we’re in agreement here.
Only if you define the tree genetically and not via ancestorship. Trying to go from one approach to the other is bound to be messy.
In the age of DNA sequencing all the good maps are done based on genetic data.
All of the organisms descended from a most recent common ancestor; we pick the MRCA semi-arbitrarily based on criteria like “sexual compatibility of descendents”.
“I know we’re both humpback whales, but he’s nowhere near as adventurous as I’d like him to be...”
I think species can be paraphyletic. If we sent a family of llamas into outer space and they evolved into Space Llamas, there would be no common ancestor which included all terrestrial _L. glama_s but excluded L. astrollama.
There are various genetic issues that make individuals sterile. We don’t say that they are suddenly another species just because they are sterile and thus not sexually compatilbe.
And there are groups (like oribatid mites) where parthenogenesis is very common. No sex at all, though males occur. (Here’s a challenge: you think of anything common among vertebrates, then look for invertebrates (including single-cellular animals) for whom it’s not common. The sea-dwellers are very good for this search.)
Some people would tell you that only Homo sapiens exists as a species. Suppose a ‘species’ exists as a set of disjointed populations, which will never meet each other (or the probability of it happening is so much smaller than of them going extinct)...
No, that’s a clade or a monophyletic taxon. Most species are clades, but as solipsist points out not all species are necessarily clades, and most clades are not species.
No, it’s more specific because of based on criteria like “sexual compatibility of descendants”.