I didn’t find your figures in the (enormous) referenced papers. Without them, it isn’t clear what the figures mean or where the 73% comes from.
Most proteins have not been discovered—and there is probably a bias towards discovering the ones that are shared with eucaryotes—which would distort the figures in favour of finding older genes.
Also, life started around 3.7 billion years ago. Also, it seems rather dubious to measure the rate of information change within evolution as the rate of information change within bacterial genomes. That doesn’t consider the information present in the diversity of life.
In the Levitt paper, 64% is the number of single-domain architecture proteins that are found in at least two of the 3 groups viruses, prokaryotes, and eukaryotes (figure 3). This is my (very close) approximation for the fraction of families in eukaryotes or prokaryotes found in both eukaryotes and prokaryotes, which isn’t reported. 84% is computed from that information, plus the caption of figure 3 saying that prokaryotes contain 88% of SDA families. 73% is computed from all of that information.
Most proteins have not been discovered—and there is probably a bias towards discovering the ones that are shared with eucaryotes—which would distort the figures in favour of finding older genes.
There is no bias towards discovering genes shared with eukaryotes in ordinary sequencing. We sequence complete genomes. Almost all of the bacterial genes known come from these whole-genome projects. We’ve sequenced many more bacteria than eukaryotes. Bacterial genomes don’t contain much repetitive intergenic DNA, so you get nice complete genome assemblies.
Life starting 3.7 billion years ago—could be. Google’s top ten show claims ranging from 2.7GY to 4.4GY ago. Adding that .7 billion could make the information-growth curve more linear, and remove one exponentiation in my analysis.
Also, it seems rather dubious to measure the rate of information change within evolution as the rate of information change within bacterial genomes. That doesn’t consider the information present in the diversity of life.
Let’s just say I’m measuring the information in DNA. Information in “the diversity of life” is too vague. I don’t want to measure any information that an organism or an ecosystem gains from the environment by expressing those genetic codes.
Out of curiosity, has the protein problem yet been mathematically formalized so that it can be handed over to computers? That is, do we understand molecular dynamics well enough to automatically discover all possible proteins, starting from the known ones?
We could list them in order, if that’s what you mean. It would be a Library of Babel. Could we determine their structures? No. Need much much more computational power, for starters.
That’s what I intended to refer to—the tertiary structure. Have they thrown the full mathematical toolbox at it? I’ve heard that predicting it is NP-complete, in that it’s order-dependent, etc., but have they thrown all the data-mining tricks at it to find out whatever additional regularity constrains the folding so as to make it less computationally intensive?
The reason I don’t immediately assume the best minds have already considered this, is that there seems to be some disconnect between the biological and mathematical communities—just recently I heard that they just got around to using the method of adjacency matrix eigenvectors (i.e. what Google’s PageRank uses) for identifying critical species in ecosystems.
Predicting the ground state of a protein is NP-hard. But nature can’t solve NP-hard problems either, so predicting what actually happens when a protein folds is merely in BQP.
I would expect most proteins found in natural organisms to be in some sense easy instances of the protein folding problem, i.e. that the BQP method finds the ground state. Because the alternative is getting stuck in local minima, which probably means it doesn’t fold consistently to the same shape, which is probably an evolutionary disadvantage. But if there are any remaining differences, then for the purpose of protein structure prediction it’s actually the local minimum that’s the right answer, and the NP problem that doesn’t get solved is irrelevant.
And yes there are quantum simulation people hard at work on the problem, so it’s not just biologists. But I don’t know enough of the details to say whether they’ve exhausted the conventional toolbox of heavy-duty math yet.
But nature can’t solve NP-hard problems in general either, so predicting what actually happens when a protein folds is merely in BQP.
That explains why I’ve seen descriptions of folding prediction algorithms that run in polynomial time, on the order of n^5 or less with n = number of amino acids in primary chain.
I wanted to add that many proteins found in nature require chaperones to fold correctly. These can be any other molecules—usually proteins, RNA-zymes, or lipids—that influence the folding process to either assist or prevent certain configurations. They can even form temporary covalent bonds with the protein being folded. (Or permanent ones; some working proteins have attached sugars, metals, other proteins, etc.) And the protein making machinery in the ribosomes has a lot of complexity as well—amino acid chains don’t just suddenly appear and start folding.
All this makes it much harder to predict the folding and action of a protein in a real cell environment. In vivo experiments can’t be replaced by calculations without simulating a big chunk of the whole cell on a molecular level.
Why do you think nature can’t solve NP-hard problems? When you dip a twisted wire with 3D structure into a dish of liquid soap and water, and pull it out and get a soap film, didn’t it just solve an NP problem?
All of the bonds and atoms in a protein are “computing” simultaneously, so the fact that the problem is NP in terms of number of molecules isn’t a problem. I don’t understand BQP & so can’t comment on that.
Incidentally, your observation about consistent folding is often right, but some proteins have functions that depending on them folding to different shapes under different conditions. Usually these shapes are similar. I don’t know if any proteins routinely fold into 2 very-different shapes.
Why do you think nature can’t solve NP-hard problems? When you dip a twisted wire with 3D structure into a dish of liquid soap and water, and pull it out and get a soap film, didn’t it just solve an NP problem?
Oh no. Ohhhhh no. Not somebody trying to claim that nature solves problems in NP in polynomial time because of bubble shape minimization again!
Like Cyan just said, nature does not solve the NP problem of a global optimal configuration. It just finds a local optimum, which is already known to be computationally easy! Here’s a reference list.
All of the bonds and atoms in a protein are “computing” simultaneously,
And there are at most N^2 of them, so that doesn’t transform exponential into tractable. It’s not even a Grover speedup (2^N → 2^(N/2)), which we do know how to get out of a quantum computer.
So: with two data-points and a back-of-the-envelope calculation, you conclude that DNA-evolution has been slowing down?
It seems like pretty feeble evidence to me :-(
I should add that—conventionally—evolutionary rates are measured in ‘darwins’, and are based on trait variation (not variation in the underlying genotypes) because of how evolution is defined.
Dude. It’s an idea. I said not to take my conclusions too seriously. This is not a refereed journal. Why do you think your job is only to find fault with ideas, and never to play with them, try them out, or look for other evidence?
Different people measure evolutionary rate or distance differently, depending on what their data is. People studying genetic evolution never use Darwins. The reason for bringing up genomes at all in this post is to look at the shape of the relationship between genome information and phenotypic complexity; so to start by measuring only phenotypes would get you nowhere.
Inaccurate premise: I don’t think my job is “only to find fault with ideas”. When I do that, it’s often because that is the simplest and fastest way to contribute. Destruction is easier than construction—but it is pretty helpful nonetheless. Critics have saved me endless hours of frustration pursuing bad ideas. I wish to pass some of that on.
In this particular sub-thread, my behavior is actually fairly selfish: if there’s reasonable evidence that DNA-evolution has been slowing down, I would be interested in hearing about it. However, I’m not going to find such evidence in this thread if people get the idea that this point has already been established.
I don’t have strong evidence that DNA evolution has been slowing down in bacteria. I presented both evidence and explanation why it has been slowing down in eukaryotes. That is all that matters for this post; because the point of referring to DNA evolution in this post has to do with how efficiently evolution uses information in the production of intelligence. Eukaryotes are more intelligent than bacteria.
So I’ve read the paper. According to it, and it seems very plausible to me, we have some reason to suspect we seriously underestimate number of SDA families, and most widely distributed SDA families are most likely to be known (those often happen to occur in multiple groups), and less widely distributed families are least likely to be known (those often happen to be one group only).
The actual percentage of shared SDA families is almost certainly lower than what we can naively estimate from current data. I don’t know how much lower. Maybe just a few percent, maybe a lot.
Not mentioned in the paper, but quite obvious is huge amount of horizontal gene transfer happening on evolutionary scales like that (especially with viruses). It also increases apparent sharing and makes them appear older than they really are.
Third effect is that SDA family that diverged long time ago might be unrecognizable as single family, and one that developed more recently is still recognizable as such. This can only increase apparent age of SDA families.
So there are at least three effects of unknown magnitude, but known direction. If any of them is strong enough, it invalidates your hypothesis. If all of them are weak, your hypothesis still relies a lot on dating of eukaryote-prokaryote split.
Imagine you have 100 related organisms in a bag with PRO written on it. You take 10 and put them in a bag with EUC written on it. Then you sequence everything—and calculate what fraction of the total number of genes are found in both bags—and you come back with 64%.
That this is less than 100% doesn’t represent the genomes in the EUC bag changing. It just means that you selected a small sample from the PRO bag. Choose 5 and the figure would have been smaller. Choose 50 and the figure would have been bigger.
I too was talking about information in DNA. The number of species influences the quantity of information present in the DNA of an ecosystem—just as rolling a dice 100 times supplies more information than rolling it once.
I didn’t find your figures in the (enormous) referenced papers. Without them, it isn’t clear what the figures mean or where the 73% comes from.
Most proteins have not been discovered—and there is probably a bias towards discovering the ones that are shared with eucaryotes—which would distort the figures in favour of finding older genes.
Also, life started around 3.7 billion years ago. Also, it seems rather dubious to measure the rate of information change within evolution as the rate of information change within bacterial genomes. That doesn’t consider the information present in the diversity of life.
In the Levitt paper, 64% is the number of single-domain architecture proteins that are found in at least two of the 3 groups viruses, prokaryotes, and eukaryotes (figure 3). This is my (very close) approximation for the fraction of families in eukaryotes or prokaryotes found in both eukaryotes and prokaryotes, which isn’t reported. 84% is computed from that information, plus the caption of figure 3 saying that prokaryotes contain 88% of SDA families. 73% is computed from all of that information.
There is no bias towards discovering genes shared with eukaryotes in ordinary sequencing. We sequence complete genomes. Almost all of the bacterial genes known come from these whole-genome projects. We’ve sequenced many more bacteria than eukaryotes. Bacterial genomes don’t contain much repetitive intergenic DNA, so you get nice complete genome assemblies.
Life starting 3.7 billion years ago—could be. Google’s top ten show claims ranging from 2.7GY to 4.4GY ago. Adding that .7 billion could make the information-growth curve more linear, and remove one exponentiation in my analysis.
Let’s just say I’m measuring the information in DNA. Information in “the diversity of life” is too vague. I don’t want to measure any information that an organism or an ecosystem gains from the environment by expressing those genetic codes.
Out of curiosity, has the protein problem yet been mathematically formalized so that it can be handed over to computers? That is, do we understand molecular dynamics well enough to automatically discover all possible proteins, starting from the known ones?
We could list them in order, if that’s what you mean. It would be a Library of Babel. Could we determine their structures? No. Need much much more computational power, for starters.
That’s what I intended to refer to—the tertiary structure. Have they thrown the full mathematical toolbox at it? I’ve heard that predicting it is NP-complete, in that it’s order-dependent, etc., but have they thrown all the data-mining tricks at it to find out whatever additional regularity constrains the folding so as to make it less computationally intensive?
The reason I don’t immediately assume the best minds have already considered this, is that there seems to be some disconnect between the biological and mathematical communities—just recently I heard that they just got around to using the method of adjacency matrix eigenvectors (i.e. what Google’s PageRank uses) for identifying critical species in ecosystems.
Predicting the ground state of a protein is NP-hard. But nature can’t solve NP-hard problems either, so predicting what actually happens when a protein folds is merely in BQP.
I would expect most proteins found in natural organisms to be in some sense easy instances of the protein folding problem, i.e. that the BQP method finds the ground state. Because the alternative is getting stuck in local minima, which probably means it doesn’t fold consistently to the same shape, which is probably an evolutionary disadvantage. But if there are any remaining differences, then for the purpose of protein structure prediction it’s actually the local minimum that’s the right answer, and the NP problem that doesn’t get solved is irrelevant.
And yes there are quantum simulation people hard at work on the problem, so it’s not just biologists. But I don’t know enough of the details to say whether they’ve exhausted the conventional toolbox of heavy-duty math yet.
This is a nice insight.
That explains why I’ve seen descriptions of folding prediction algorithms that run in polynomial time, on the order of n^5 or less with n = number of amino acids in primary chain.
I wanted to add that many proteins found in nature require chaperones to fold correctly. These can be any other molecules—usually proteins, RNA-zymes, or lipids—that influence the folding process to either assist or prevent certain configurations. They can even form temporary covalent bonds with the protein being folded. (Or permanent ones; some working proteins have attached sugars, metals, other proteins, etc.) And the protein making machinery in the ribosomes has a lot of complexity as well—amino acid chains don’t just suddenly appear and start folding.
All this makes it much harder to predict the folding and action of a protein in a real cell environment. In vivo experiments can’t be replaced by calculations without simulating a big chunk of the whole cell on a molecular level.
Why do you think nature can’t solve NP-hard problems? When you dip a twisted wire with 3D structure into a dish of liquid soap and water, and pull it out and get a soap film, didn’t it just solve an NP problem?
All of the bonds and atoms in a protein are “computing” simultaneously, so the fact that the problem is NP in terms of number of molecules isn’t a problem. I don’t understand BQP & so can’t comment on that.
Incidentally, your observation about consistent folding is often right, but some proteins have functions that depending on them folding to different shapes under different conditions. Usually these shapes are similar. I don’t know if any proteins routinely fold into 2 very-different shapes.
Oh no. Ohhhhh no. Not somebody trying to claim that nature solves problems in NP in polynomial time because of bubble shape minimization again!
Like Cyan just said, nature does not solve the NP problem of a global optimal configuration. It just finds a local optimum, which is already known to be computationally easy! Here’s a reference list.
The more convoluted the wire structure, the more likely the soap film is to be in a stable sub-optimal configuration.
And there are at most N^2 of them, so that doesn’t transform exponential into tractable. It’s not even a Grover speedup (2^N → 2^(N/2)), which we do know how to get out of a quantum computer.
And, interestingly enough, slashdot just ran a story on progress in protein folding in the nucleus of a cell.
So: with two data-points and a back-of-the-envelope calculation, you conclude that DNA-evolution has been slowing down?
It seems like pretty feeble evidence to me :-(
I should add that—conventionally—evolutionary rates are measured in ‘darwins’, and are based on trait variation (not variation in the underlying genotypes) because of how evolution is defined.
Dude. It’s an idea. I said not to take my conclusions too seriously. This is not a refereed journal. Why do you think your job is only to find fault with ideas, and never to play with them, try them out, or look for other evidence?
Different people measure evolutionary rate or distance differently, depending on what their data is. People studying genetic evolution never use Darwins. The reason for bringing up genomes at all in this post is to look at the shape of the relationship between genome information and phenotypic complexity; so to start by measuring only phenotypes would get you nowhere.
Inaccurate premise: I don’t think my job is “only to find fault with ideas”. When I do that, it’s often because that is the simplest and fastest way to contribute. Destruction is easier than construction—but it is pretty helpful nonetheless. Critics have saved me endless hours of frustration pursuing bad ideas. I wish to pass some of that on.
In this particular sub-thread, my behavior is actually fairly selfish: if there’s reasonable evidence that DNA-evolution has been slowing down, I would be interested in hearing about it. However, I’m not going to find such evidence in this thread if people get the idea that this point has already been established.
I don’t have strong evidence that DNA evolution has been slowing down in bacteria. I presented both evidence and explanation why it has been slowing down in eukaryotes. That is all that matters for this post; because the point of referring to DNA evolution in this post has to do with how efficiently evolution uses information in the production of intelligence. Eukaryotes are more intelligent than bacteria.
So I’ve read the paper. According to it, and it seems very plausible to me, we have some reason to suspect we seriously underestimate number of SDA families, and most widely distributed SDA families are most likely to be known (those often happen to occur in multiple groups), and less widely distributed families are least likely to be known (those often happen to be one group only).
The actual percentage of shared SDA families is almost certainly lower than what we can naively estimate from current data. I don’t know how much lower. Maybe just a few percent, maybe a lot.
Not mentioned in the paper, but quite obvious is huge amount of horizontal gene transfer happening on evolutionary scales like that (especially with viruses). It also increases apparent sharing and makes them appear older than they really are.
Third effect is that SDA family that diverged long time ago might be unrecognizable as single family, and one that developed more recently is still recognizable as such. This can only increase apparent age of SDA families.
So there are at least three effects of unknown magnitude, but known direction. If any of them is strong enough, it invalidates your hypothesis. If all of them are weak, your hypothesis still relies a lot on dating of eukaryote-prokaryote split.
Imagine you have 100 related organisms in a bag with PRO written on it. You take 10 and put them in a bag with EUC written on it. Then you sequence everything—and calculate what fraction of the total number of genes are found in both bags—and you come back with 64%.
That this is less than 100% doesn’t represent the genomes in the EUC bag changing. It just means that you selected a small sample from the PRO bag. Choose 5 and the figure would have been smaller. Choose 50 and the figure would have been bigger.
At least 3.7 billion years ago, then.
I too was talking about information in DNA. The number of species influences the quantity of information present in the DNA of an ecosystem—just as rolling a dice 100 times supplies more information than rolling it once.