It looks like this was the straw that broke my back—my first post ever on this site after lurking occasionally for upwards of 1.5 years. The explicit plea combined with a number of cringe-inducing misunderstandings of genetics / molecular biology in a bunch of previous posts finally got to me (I’m a grad student studying basic eukaryotic cell biology).
Here’s my thirty-thousand-feet overall take on the matter: you cannot in good conscience treat genetic information like a computer program, and this is where most misunderstandings and problematical logical leaps occur.
It is true that protein-coding-gene expression is a process that kind of resembles an algorithm. You have pieces of the DNA, roughly analogous to long-term-storage in this context, under particular circumstances getting transcribed into an RNA copy, roughly analogous to memory. Then you have a subset of that RNA that gets ‘read’ in 3 base chunks with particular meanings: START-alanine-tryptophan-asparagine-glycine...arginine-STOP. The proteins made by this fold up, and do whatever they do.
There are four big problems with thinking from this approach though. The first being that coding for proteins is not all that DNA and other nucleic acids do by a long shot. There are genes that never make a protein but make functional RNAs that tether things together. Others make regulatory RNAs that go on to affect gene expression. Other DNA that never gets ‘read’ by anything binds proteins and other complexes for all kinds of purposes I will get into below. Still other DNA consists of selfish replicating elements that exist in vast quantities.
The second is that DNA and RNA are not just information, a string of bits. They are physical objects that are moving around at dozens of meters per second, hitting other molecules, and these physical interactions are what drive their activity. It is not logical operations being performed on a bit string, it is chemical reactions and catalysis in actual three dimensional space. DNA is full of functional elements that have nothing to do with coding for a protein, from promoter elements that have the right charge structure (a function of sequence yes, but decidedly a physical attribute) to stick to the transcription factors and polymerases needed to pry apart the strands and synthesize RNA, to attachment points for fibers that pull freshly replicated DNA into daughter cells, to areas of loose base pairing needed to first pry the strands apart and begin replication, to extra binding sites for transcription factors away from genes which soak up extra molecules and keep them inactive. RNA molecules also have widely varying stability and half-lives before breaking down, again dependent upon their shape and sequence in ways that are often the opposite of straightforward. This is all very physical and depends on the interaction of the shape and charge of the nucleic acid molecules and the rest of the contents of the cell.
The third is that all this information content is completely context-dependent. Yes, almost everything on Earth has compatible genetic codes (animal mitochondria being a notable exception, incidentally—there is very nearly nothing in biology that doesn’t have some exception somewhere in the slew of diversity that exists). But if you put an entire yeast genome straight into a human cell, it wouldn’t be active and would make no protein—not even the selfish parasitic elements mooching along inside it would turn on. The presence of a gene reading frame is useless without the correct functional DNA elements next to it that, when colliding with the correct proteins and complexes, are able to properly stick to all the pre-existing machinery that is required to catalyze the production of other molecules from that template. While in any one branch of life the DNA elements and catalytic machinery have evolved together, they drift over evolutionary time. To make matters even worse, it appears that the genetic code itself is pretty arbitrary. There’s nothing chemical in the structure of a gene to tie it to a particular amino acid sequence, other than the code itself which is entirely mediated by proteins* which tie together free amino acids and transfer-RNAs and that are themselves made by genes according to the code. The code does not exist without the proteins.
EDIT: I have to amend this. I went looking through the literature and while I could not find reference to normal human promoters working in yeast or vice versa I did find references to a very strong promoter from a human mononucleosis-causing virus that in bread yeast produces detectable but biologically insignificant amounts of protein, and works a bit better in a second ‘fission yeast’ species. Looks like it is sometimes possible for human and yeast promoters to cross-talk, but it is rare and insignificant compared to normal expression.
Fourth, there are features of organisms that have nothing to do with their genomes. Nothing in its genome tells a gram negative bacterium to have two nested cell membranes. Instead, its particular compliment of proteins allows that second membrane to be perpetuated and split along with the rest of the cell when it divides. Nothing in a human cell (we think) tells it that its internal membrane system should have particular proteins in it—instead the functional internal membrane system pulls particular freshly-synthesized proteins with particular features into itself as they are made. The more widely-thought-of epigenetic state is another example of this.
All together, this makes me extremely wary of any attempt to even talk about the ‘complexity’ of an organism based only on its genome. The size of the genome puts an upper bound on some sorts of complexity—the number of proteins that an organism can make, for example. But physical interactions, the presence of non-DNA molecules, and the previous shape/state of the organism are integral, carry vast quantities of information, and are the context that makes the DNA represent information in the first place rather than just being an unstable polymer.
But physical interactions, the presence of non-DNA molecules, and the previous shape/state of the organism are integral, carry vast quantities of information, and are the context that makes the DNA represent information in the first place rather than just being an unstable polymer.
Your comment is really interesting, but I question this “vast quantities of information”. Suppose I want to implement a cell as a physics simulation and watch it live on in my computer.
The relevant laws of physics don’t take a huge number of bits to describe or implement on a computer.
If we take all the non-DNA molecules (excluding the proteins and RNA which are already coded for in the DNA), are there any really complex ones that would take a lot of bits to specify (compared to the DNA)? Or are there that many different kinds of molecules?
The exact shape and state of a cell do contain a lot of information, but surely most of that is irrelevant. For example, I can probably get away with approximating the initial shape of the simulated cell as a sphere or ellipsoid (or some other simple geometric shape), which takes many fewer bits to specify than its actual shape. Same thing with the distributions of molecules inside the cell. I probably don’t need information about the exact location of each molecule, but can approximate the distributions using relatively simple density gradients.
So even after reading your comment, I think it’s likely that most of the complexity (in the sense of number of bits needed to implement a simulation) of a cell is in its genome. Am I wrong in any of my guesses above, or missing something else?
You can implement A simulation. But that simulation having anything to do with any particular thing that has existed in the real world is harder.
Physics itself is not hard. Applying it to large numbers of particles is hard.
As for non-DNA molecules, there are all kinds of small molecule metabolites which are constantly being converted back and forth, some of which are very important (they bind to the big molecules, are part of metabolism, and I have seen some brand new research about particular proteins that only fold properly around a ‘nucleus’ formed by a particular 6-carbon molecule). But my main point I was trying to make was more along the lines of (addressing the third bullet):
Shape is more detailed than general cell shape. There is fine structure in terms of internal fibers, distributions of molecules, impermiable barriers that segregate things, etc. Some of this, like the aforementioned membranes in bacteria, the self-perpetuating but never-made-from-scratch compartments that distill out their components from the general cell mileu, don’t necessarily have the DNA as a determinant but rather as something that sets up the circumstance in which it is stable. Other things like the amounts and simple distributions of molecules all come from pervious states and most possible distributions dont correspond to any real state (though doubtless many of them would be unstable and collapse down to one attractor or another that normally exists once you instantiate them).
I have a hard time trying to think of the nature of the correspondence between these things and bits for a simulation besides positions of molecules, and I’m not sure in what context those bits are specified. A little help?
I have a hard time trying to think of the nature of the correspondence between these things and bits for a simulation besides positions of molecules, and I’m not sure in what context those bits are specified. A little help?
What you do is write a program that generates a set of particles and places them into the simulated cell, such that the resulting cell is viable and functionally equivalent to the original cell. Take the program and count its length in bits. If you haven’t programmed before you may not have much intuition about this. In that case think of it this way: if you have to describe the shape/internal structure/distributions (ETA: and structures) of molecules, in natural language and/or mathematical notation, in sufficient detail that someone else could create a physics simulation of the cell based on your description, how many bits would that take, and what fraction of those would be taken up by the DNA sequences?
Seconding RichardKennaway. Thanks for writing that! It would be great to have more scientifically literate people posting here. Like Ilya Shpitser, a student of Judea Pearl who was attracted by Eliezer’s writings about causality and is trying to give a more informed perspective.
Do you know of any good tutorials of DNA structure covering the elements you describe?
I’m a member of a forum for jaguar owners. Somebody wrote a book, XJ6, Bumper to Bumper, where he walks you through the entire car, every major part, from front to back.
That kind of thing for DNA would be extremely helpful. Something that shows all the different kinds of functional blocks within DNA, and how they’re all pieced together. When they say a gene, is it a pattern of dna starting at base pair X? Starting at functional block X? How does copy number variation work? Are all the copies in the same place?
Your post had a bunch more types of functional units. I’d love to have a nice PBS video with pretty colors and a soothing narrator giving me the tour, and pointing out those functional units and others as we stroll down a DNA strand.
I’m not ignorant on purpose. I’ve tried to look for this kind of information, but I’ve never been satisfied with the material I’ve found. I haven’t found anything really tied to DNA structure, only a “this, that, and the other” verbal listing.
Molecular Biology of the Cell (Alberts) is a textbook of cell biology that gets a lot of recommendations; when you go to a talk where someone is investigating something and figures out a wrinkle or complication to some process they will often call the general consensus “the Alberts version”. Has all kinds of information on general cell biology, with the information on DNA coding and function spread in a couple of chapters.
On the rather more technical end, I loved the book “The Origins of Genome Architecture” by Dr. Michael Lynch. It’s kind of half a massive review article of features you find in eukaryotic genome structure and how they vary across the tree of life, with a major focus on evolutionary biology and mechanisms of how they change over time, and with bibliography that must be a fifth of the book. It’s other half of its content is basically the author pushing his idea that for eukaryotes and multicellular organisms in particular, most of their genome architecture got the way it is via non-adaptive processes that have more to do with the sorts of mutation that are possible than their end fitness results. Makes a compelling argument and shows all kinds of details of DNA function I did not know about until I read it, but gets very technical very fast and is written in many parts like a scientific paper with biologists in mind.
An intro biology textbook will not cover DNA in the detail CellBioGuy touched on. You’d need to read an intro Bio book and then maybe intro Molecular Biology, and then find a book focusing on DNA.
Thank you for that, it deserves an award for the most useful first post ever.
So, as a non-biologist who just has a layman’s knowledge of these things, the idea of reconstructing an organism from nothing but its DNA would be like reconstructing a car engine from the crankshaft? The causal chains that make an engine go all pass through the crankshaft: it moves the pistons up and down, drives the fuel pump to bring petrol to the cylinders, drives the camshafts that open and close the valves at the right time, and receives the energy of the power stroke and passes it to the transmission. So obviously the crankshaft is how the whole thing works, and complete knowledge of the crankshaft must in principle let us reconstruct the whole engine.
(I am indebted to my colleague William Powers for the crankshaft as an illustration of how to misunderstand cyclic causal chains.)
It is worth noting that you can bootstrap a genome with non-DNA machinery which is good enough, and get it to work and eventually equilibrate as it makes its own native machinery. That’s how Craig Venter and company ‘created’ their ‘artificial’ bacterium a while back—they chemically synthesized a near-exact copy of one of the smallest and simplest known bacterial genomes (no small feat by the way, chemical synthesis of DNA is expensive and error-prone during the assembly steps of smaller pieces into the full chromosome) with a few watermarks to show it was theirs, and shoved it into another related bacterial species (again no small feat). They were close enough that that other species’ proteins were able to run that new genome’s genes, getting a causal cycle going, and after a few dozen generations all the old proteins/metabolites/etc had been diluted out and replaced with those made by the artificial genome. This of course basically requires a related living system, on top of something like the basic definition of the genetic code.
The fact that nongenetic information is perpetuated in the form of epigenetic state, physical shape and arrangement, etc adds another loop that many times doesn’t even feed through the genome at all, but rather relies on the genome and the proteins it codes for to set up a landscape of attractors that these factors are capable of occupying and which ones can lead to which others through paths which the cell can stay alive through.
If you could somehow synthesize compatible non-DNA machinery capable of interacting with a particular genome, you might be able to get a cell going—but the question is how do you do that given just the genome and the genetic code to go on, and once you do does the system fall into any of the normal stable attractors, and how much of that organism is things like the double membrane of a gram negative bacterium, passed on from its parents but only its capability for replication really represented by the genome or its products. All membranes come from previously-existing membranes, things like that. Animal cells, with all the fun attractors they fall into from every different cell type to cancer to all of their intercelular interactions are examples of cells that have their history as a vital component of their identity.
How do most of these objections not apply also to computer programs? Computer programs are physical objects, and what the program actually does depends entirely on the physical machinery that runs it.
I would say the main difference is that computer systems work to embody the same bit string in widely varying substrates and perform the same logical operations on it. It doesn’t matter if a program is stored on magnetic domains in a tape drive and executed in vacuum tubes, or if it is stored in electrons trapped in flash memory and executed in a 22 nanometer process CPU, the end result of a given set of logical operations is the same. In biology though there really isn’t a message or program you can abstract away from the molecules bouncing around, there is only one level of abstraction. You cannot separate ‘hardware’ and ‘software’.
Assuming “bit string” means “machine code”, this isn’t true. The same machine code will not result in the same logical operations being performed on all computers. It may not correspond to any logical operations at all on other computers. And what logical operations are carried out depends entirely on “the molecules bouncing around” in the computer. You aren’t making DNA sound different from machine code at all.
Good point regarding machine code, I wasn’t thinking at that level of detail. But at this point the similarity is metaphorical at best.
The metaphor fails to go far enough, I think, because the non-DNA context and the DNA are working at the same level. Both are objects with shapes that interact. Yes, software is always embodied in matter, and yes processing happens in matter rather than some kind of abstract logical space, but part of the point of most computers is that the same matter can carry all kinds of different patterns. In biology the DNA builds the context and the context builds the DNA, and both of them alter each other when they interact. The interations also produce effects that tend to much more closely resemble physics-model-type actions—of he sort that can be modeled via differential equations when you are at a large enough scale that the single-molecule variances average out—rather than really embodying particular operations or logics.
I get the feeling there is an inferential gap happening here...
It is true that protein-coding-gene expression is a process that kind of resembles an algorithm.
That’s because it is an algorithm. What else would it be?
It is not logical operations being performed on a bit string, it is chemical reactions and catalysis in actual three dimensional space.
Of course those are logical operations being performed on a bit string, again, what else would they be? Magical uncomputable non-functions?
Your major point—which I agree with—seems to be that there are a lot of hard-to-quantify factors and influences that go into determining the result, and that a focus on just DNA does not capture those interactions. But that just means a (mathematical/algorithmic) description that merely focuses on DNA would be on some levels inadequate, that an actual complete description of a cell may take additional information. However, that doesn’t mean that more thorough description wouldn’t also be an algorithm, a program that the cell executes / that describes the cell’s behavior. It would merely be a more complex one. Even that is debatable:
I agree that a long inert DNA polymer wouldn’t do much on its own, just as the same information on some HDD wouldn’t. However, from an informational perspective, if you found a complete set of a mammoth’s DNA and assorted “codings”, would that be enough for a resourceful agent to reconstitute a mammoth specimen? If the agent is also aware of some basic data about cellular structures—certainly. But I’d reckon that given enough time, even an agent with little such knowledge would figure out much of the DNA’s purpose, and eventually be able to recreate a mammoth oocyte to fill with the DNA. If that were indeed so, that would mean that from an informational perspective, the overhead was not shown to be strictly necessary to encode the “mammoth essence”—at least most of it.
I think you could reword my point to be something like: by the time you are doing something algorithmically/computationally that really recapitulates the important things happening in a cell, you are doing something more akin to a physics simulation than to running turing machines on DNA tape. At that point, when your ‘decompression algorithm’ is physics itself, calling it an algorithm seems a little out of place.
In another post just now I wrote that a genome and its products also define a whole landscape of states, not one particular organism. I can’t help but wonder just how huge that space of living states is, and how many of them correspond to normal cell types or cell states in such a mammoth, and how intractable it would be to find that one set of states that corresponds to ‘mammoth oocyte’ and produces a self-perpetuating multicellular system.
(Qualifier so I’m not drowning in some reality fluid) I’d say that there aren’t any physical processes that cannot be completely described as algorithms.
I can’t help but wonder just how huge that space of living states is, and how many of them correspond to normal cell types or cell states in such a mammoth, and how intractable it would be to find that one set of states that corresponds to ‘mammoth oocyte’ and produces a self-perpetuating multicellular system.
I wouldn’t overestimate its additional complexity, especially given that most of it ultimately derives from the relationship of different areas in the DNA sequence itself. For the predictability of results with varying slightly different states, confer e.g. the success and predictability in results (on a viral level, not a clinical results level) from manipulating lentivirii and AAVs, see for example this NEJM paper.
No physics-level simulation needed to accurately predict what a cell will do when switching out parts of its genome.
If it were otherwise (if you think about it), the whole natural virus ecosystem itself would break down.
EDIT: A different example that come to mind: Insulin which is synthesized in a laboratory strain of Escherichia coli bacteria which has been genetically altered with recombinant DNA to produce biosynthetic human insulin. No surprises there either.
My main response there is that in those situations, you are making one small change to a pre-existing system using elements that have previously been qualitatively characterized. In the case of the viral gene therapy, it’s a case of adding to the cell a DNA construct consisting of a crippled virus that can’t actually replicate in normal cells for insertion purposes, and a promoter element that turns on a reading frame next to it in any human cellular context along with the reading frame for the gene in question which has had all the frills and splice sites removed. In the case of insulin in bacteria, it’s a case of adding to the bacteria the human insulin reading frame and a few processing enzyme reading frames, each attached to a constantly-on bacterial promoter element. The overall systems of the cells are left intact, and you are just piggybacking on them.
You can do things like this because in many living systems you have elements that have been isolated and tested, and which you can say “if I stick this in, it will do X”. That has for the most part for a long time been figured out empirically, by putting into cells elements that are whole truncated, or mutated in some way and seeing which ones work and which ones dont. These days examining their chemical structures we have physical and chemical explanations for a bunch of them and how they work and are starting to get better at predicting them in particular organismal contexts, though it’s still much much harder in multicellular creatures with huge genomes than in those with compact genomes*.
When I was saying that physics-like things were needed, I was more referring to a situation in which you do not have a pre-existing living thing and are trying to work out what an element does only from its sequence. When you can test things in the correct context and start figuring out what proteins and DNA elements are important for what you can leap over this and tell what is important for what even before you really understand the physical reasons. if you were starting from just the DNA sequence and didn’t really understand what the non-DNA context for them was or possibly even how the DNA helps produce the non-DNA context, you get a much less tractible problem.
*(It’s worth noting that the ease of analysis of noncoding elements is wildly different in different organisms. Bacteria and yeast have compact promoter elements that have DNA sequences of dozens and dozens-to-hundreds of base pairs each, often with easily identifiable protein binding sites, while in animals a promoter element can be in chunks strewn across hundreds of kilobases (though severa kilobases is more typical) and is usually defined as ‘this is the smallest piece of DNA we could include and still get it to express properly’ with only a subset of computationally predicted protein binding sites actually turning out to be functionally important. A yeast centromere element for fiber attachment to chromosomes during cell division is a precisely defined 125 base pair sequence that assembles a complex of anchoring proteins on itself, while a human centromere can be the size of an entire yeast genome and is a huge array of short repeats that might just bind the fiber anchoring proteins a little bit better than random DNA. Noncoding elements get larger and less straightforward much faster than coding elements as genome size increases.)
EDIT: as for viral ecosystems, viruses can hop from species to species because related species share a lot of cellular machinery, even going back when splits happened hundreds of millions of years ago, and the virus just has to work well enough (and will immediately start adapting to its new host). Seeing as life is more than three gigayears old though, there are indeed barriers that viruses cannot cross. You will not find a virus that can infect both a bacterium and a mammal, or a mammal and a plant. When they hop from species to species or population to population the differences can render some species resistant, or change the end behavior of the virus, and you get things like the simian immunodeficiency virus hardly affecting chimps and the only-separated-by-a-century-from-it HIV killing its human host.
if you were starting from just the DNA sequence and didn’t really understand what the non-DNA context for them was or possibly even how the DNA helps produce the non-DNA context, you get a much less tractible problem.
I can’t help but feel this is related to (what I perceive as) a vast overrating of the plausibility of uploading from cryonically-preserved brain remnants. It’s late at night and I’m still woozy from finals, but it feels like someone who’s discovered they enjoy, say, classical music without much grasp of music theory or even the knowledge of how to play any instruments figuring it can’t be too hard to just brute-force a piano riff of, say, the fourth movement of Beethoven’s 9th if they just figure out by listening which notes to play. The mistake being made is a subtler and yet more important one than simply underestimating the algorithmic complexity of the desired output.
It looks like this was the straw that broke my back—my first post ever on this site after lurking occasionally for upwards of 1.5 years. The explicit plea combined with a number of cringe-inducing misunderstandings of genetics / molecular biology in a bunch of previous posts finally got to me (I’m a grad student studying basic eukaryotic cell biology).
Here’s my thirty-thousand-feet overall take on the matter: you cannot in good conscience treat genetic information like a computer program, and this is where most misunderstandings and problematical logical leaps occur.
It is true that protein-coding-gene expression is a process that kind of resembles an algorithm. You have pieces of the DNA, roughly analogous to long-term-storage in this context, under particular circumstances getting transcribed into an RNA copy, roughly analogous to memory. Then you have a subset of that RNA that gets ‘read’ in 3 base chunks with particular meanings: START-alanine-tryptophan-asparagine-glycine...arginine-STOP. The proteins made by this fold up, and do whatever they do.
There are four big problems with thinking from this approach though. The first being that coding for proteins is not all that DNA and other nucleic acids do by a long shot. There are genes that never make a protein but make functional RNAs that tether things together. Others make regulatory RNAs that go on to affect gene expression. Other DNA that never gets ‘read’ by anything binds proteins and other complexes for all kinds of purposes I will get into below. Still other DNA consists of selfish replicating elements that exist in vast quantities.
The second is that DNA and RNA are not just information, a string of bits. They are physical objects that are moving around at dozens of meters per second, hitting other molecules, and these physical interactions are what drive their activity. It is not logical operations being performed on a bit string, it is chemical reactions and catalysis in actual three dimensional space. DNA is full of functional elements that have nothing to do with coding for a protein, from promoter elements that have the right charge structure (a function of sequence yes, but decidedly a physical attribute) to stick to the transcription factors and polymerases needed to pry apart the strands and synthesize RNA, to attachment points for fibers that pull freshly replicated DNA into daughter cells, to areas of loose base pairing needed to first pry the strands apart and begin replication, to extra binding sites for transcription factors away from genes which soak up extra molecules and keep them inactive. RNA molecules also have widely varying stability and half-lives before breaking down, again dependent upon their shape and sequence in ways that are often the opposite of straightforward. This is all very physical and depends on the interaction of the shape and charge of the nucleic acid molecules and the rest of the contents of the cell.
The third is that all this information content is completely context-dependent. Yes, almost everything on Earth has compatible genetic codes (animal mitochondria being a notable exception, incidentally—there is very nearly nothing in biology that doesn’t have some exception somewhere in the slew of diversity that exists). But if you put an entire yeast genome straight into a human cell, it wouldn’t be active and would make no protein—not even the selfish parasitic elements mooching along inside it would turn on. The presence of a gene reading frame is useless without the correct functional DNA elements next to it that, when colliding with the correct proteins and complexes, are able to properly stick to all the pre-existing machinery that is required to catalyze the production of other molecules from that template. While in any one branch of life the DNA elements and catalytic machinery have evolved together, they drift over evolutionary time. To make matters even worse, it appears that the genetic code itself is pretty arbitrary. There’s nothing chemical in the structure of a gene to tie it to a particular amino acid sequence, other than the code itself which is entirely mediated by proteins* which tie together free amino acids and transfer-RNAs and that are themselves made by genes according to the code. The code does not exist without the proteins.
EDIT: I have to amend this. I went looking through the literature and while I could not find reference to normal human promoters working in yeast or vice versa I did find references to a very strong promoter from a human mononucleosis-causing virus that in bread yeast produces detectable but biologically insignificant amounts of protein, and works a bit better in a second ‘fission yeast’ species. Looks like it is sometimes possible for human and yeast promoters to cross-talk, but it is rare and insignificant compared to normal expression.
Fourth, there are features of organisms that have nothing to do with their genomes. Nothing in its genome tells a gram negative bacterium to have two nested cell membranes. Instead, its particular compliment of proteins allows that second membrane to be perpetuated and split along with the rest of the cell when it divides. Nothing in a human cell (we think) tells it that its internal membrane system should have particular proteins in it—instead the functional internal membrane system pulls particular freshly-synthesized proteins with particular features into itself as they are made. The more widely-thought-of epigenetic state is another example of this.
All together, this makes me extremely wary of any attempt to even talk about the ‘complexity’ of an organism based only on its genome. The size of the genome puts an upper bound on some sorts of complexity—the number of proteins that an organism can make, for example. But physical interactions, the presence of non-DNA molecules, and the previous shape/state of the organism are integral, carry vast quantities of information, and are the context that makes the DNA represent information in the first place rather than just being an unstable polymer.
Your comment is really interesting, but I question this “vast quantities of information”. Suppose I want to implement a cell as a physics simulation and watch it live on in my computer.
The relevant laws of physics don’t take a huge number of bits to describe or implement on a computer.
If we take all the non-DNA molecules (excluding the proteins and RNA which are already coded for in the DNA), are there any really complex ones that would take a lot of bits to specify (compared to the DNA)? Or are there that many different kinds of molecules?
The exact shape and state of a cell do contain a lot of information, but surely most of that is irrelevant. For example, I can probably get away with approximating the initial shape of the simulated cell as a sphere or ellipsoid (or some other simple geometric shape), which takes many fewer bits to specify than its actual shape. Same thing with the distributions of molecules inside the cell. I probably don’t need information about the exact location of each molecule, but can approximate the distributions using relatively simple density gradients.
So even after reading your comment, I think it’s likely that most of the complexity (in the sense of number of bits needed to implement a simulation) of a cell is in its genome. Am I wrong in any of my guesses above, or missing something else?
You can implement A simulation. But that simulation having anything to do with any particular thing that has existed in the real world is harder.
Physics itself is not hard. Applying it to large numbers of particles is hard.
As for non-DNA molecules, there are all kinds of small molecule metabolites which are constantly being converted back and forth, some of which are very important (they bind to the big molecules, are part of metabolism, and I have seen some brand new research about particular proteins that only fold properly around a ‘nucleus’ formed by a particular 6-carbon molecule). But my main point I was trying to make was more along the lines of (addressing the third bullet):
Shape is more detailed than general cell shape. There is fine structure in terms of internal fibers, distributions of molecules, impermiable barriers that segregate things, etc. Some of this, like the aforementioned membranes in bacteria, the self-perpetuating but never-made-from-scratch compartments that distill out their components from the general cell mileu, don’t necessarily have the DNA as a determinant but rather as something that sets up the circumstance in which it is stable. Other things like the amounts and simple distributions of molecules all come from pervious states and most possible distributions dont correspond to any real state (though doubtless many of them would be unstable and collapse down to one attractor or another that normally exists once you instantiate them).
I have a hard time trying to think of the nature of the correspondence between these things and bits for a simulation besides positions of molecules, and I’m not sure in what context those bits are specified. A little help?
What you do is write a program that generates a set of particles and places them into the simulated cell, such that the resulting cell is viable and functionally equivalent to the original cell. Take the program and count its length in bits. If you haven’t programmed before you may not have much intuition about this. In that case think of it this way: if you have to describe the shape/internal structure/distributions (ETA: and structures) of molecules, in natural language and/or mathematical notation, in sufficient detail that someone else could create a physics simulation of the cell based on your description, how many bits would that take, and what fraction of those would be taken up by the DNA sequences?
Seconding RichardKennaway. Thanks for writing that! It would be great to have more scientifically literate people posting here. Like Ilya Shpitser, a student of Judea Pearl who was attracted by Eliezer’s writings about causality and is trying to give a more informed perspective.
Do you know of any good tutorials of DNA structure covering the elements you describe?
I’m a member of a forum for jaguar owners. Somebody wrote a book, XJ6, Bumper to Bumper, where he walks you through the entire car, every major part, from front to back.
That kind of thing for DNA would be extremely helpful. Something that shows all the different kinds of functional blocks within DNA, and how they’re all pieced together. When they say a gene, is it a pattern of dna starting at base pair X? Starting at functional block X? How does copy number variation work? Are all the copies in the same place?
Your post had a bunch more types of functional units. I’d love to have a nice PBS video with pretty colors and a soothing narrator giving me the tour, and pointing out those functional units and others as we stroll down a DNA strand.
I’m not ignorant on purpose. I’ve tried to look for this kind of information, but I’ve never been satisfied with the material I’ve found. I haven’t found anything really tied to DNA structure, only a “this, that, and the other” verbal listing.
A biology textbook? Perhaps that big brown one that says “Biology” on the front.
Textbooks are great!
Molecular Biology of the Cell (Alberts) is a textbook of cell biology that gets a lot of recommendations; when you go to a talk where someone is investigating something and figures out a wrinkle or complication to some process they will often call the general consensus “the Alberts version”. Has all kinds of information on general cell biology, with the information on DNA coding and function spread in a couple of chapters.
On the rather more technical end, I loved the book “The Origins of Genome Architecture” by Dr. Michael Lynch. It’s kind of half a massive review article of features you find in eukaryotic genome structure and how they vary across the tree of life, with a major focus on evolutionary biology and mechanisms of how they change over time, and with bibliography that must be a fifth of the book. It’s other half of its content is basically the author pushing his idea that for eukaryotes and multicellular organisms in particular, most of their genome architecture got the way it is via non-adaptive processes that have more to do with the sorts of mutation that are possible than their end fitness results. Makes a compelling argument and shows all kinds of details of DNA function I did not know about until I read it, but gets very technical very fast and is written in many parts like a scientific paper with biologists in mind.
An intro biology textbook will not cover DNA in the detail CellBioGuy touched on. You’d need to read an intro Bio book and then maybe intro Molecular Biology, and then find a book focusing on DNA.
Even better.
Thank you for that, it deserves an award for the most useful first post ever.
So, as a non-biologist who just has a layman’s knowledge of these things, the idea of reconstructing an organism from nothing but its DNA would be like reconstructing a car engine from the crankshaft? The causal chains that make an engine go all pass through the crankshaft: it moves the pistons up and down, drives the fuel pump to bring petrol to the cylinders, drives the camshafts that open and close the valves at the right time, and receives the energy of the power stroke and passes it to the transmission. So obviously the crankshaft is how the whole thing works, and complete knowledge of the crankshaft must in principle let us reconstruct the whole engine.
(I am indebted to my colleague William Powers for the crankshaft as an illustration of how to misunderstand cyclic causal chains.)
Interesting analogy. I like it.
It is worth noting that you can bootstrap a genome with non-DNA machinery which is good enough, and get it to work and eventually equilibrate as it makes its own native machinery. That’s how Craig Venter and company ‘created’ their ‘artificial’ bacterium a while back—they chemically synthesized a near-exact copy of one of the smallest and simplest known bacterial genomes (no small feat by the way, chemical synthesis of DNA is expensive and error-prone during the assembly steps of smaller pieces into the full chromosome) with a few watermarks to show it was theirs, and shoved it into another related bacterial species (again no small feat). They were close enough that that other species’ proteins were able to run that new genome’s genes, getting a causal cycle going, and after a few dozen generations all the old proteins/metabolites/etc had been diluted out and replaced with those made by the artificial genome. This of course basically requires a related living system, on top of something like the basic definition of the genetic code.
The fact that nongenetic information is perpetuated in the form of epigenetic state, physical shape and arrangement, etc adds another loop that many times doesn’t even feed through the genome at all, but rather relies on the genome and the proteins it codes for to set up a landscape of attractors that these factors are capable of occupying and which ones can lead to which others through paths which the cell can stay alive through.
If you could somehow synthesize compatible non-DNA machinery capable of interacting with a particular genome, you might be able to get a cell going—but the question is how do you do that given just the genome and the genetic code to go on, and once you do does the system fall into any of the normal stable attractors, and how much of that organism is things like the double membrane of a gram negative bacterium, passed on from its parents but only its capability for replication really represented by the genome or its products. All membranes come from previously-existing membranes, things like that. Animal cells, with all the fun attractors they fall into from every different cell type to cancer to all of their intercelular interactions are examples of cells that have their history as a vital component of their identity.
How do most of these objections not apply also to computer programs? Computer programs are physical objects, and what the program actually does depends entirely on the physical machinery that runs it.
I would say the main difference is that computer systems work to embody the same bit string in widely varying substrates and perform the same logical operations on it. It doesn’t matter if a program is stored on magnetic domains in a tape drive and executed in vacuum tubes, or if it is stored in electrons trapped in flash memory and executed in a 22 nanometer process CPU, the end result of a given set of logical operations is the same. In biology though there really isn’t a message or program you can abstract away from the molecules bouncing around, there is only one level of abstraction. You cannot separate ‘hardware’ and ‘software’.
Assuming “bit string” means “machine code”, this isn’t true. The same machine code will not result in the same logical operations being performed on all computers. It may not correspond to any logical operations at all on other computers. And what logical operations are carried out depends entirely on “the molecules bouncing around” in the computer. You aren’t making DNA sound different from machine code at all.
Good point regarding machine code, I wasn’t thinking at that level of detail. But at this point the similarity is metaphorical at best.
The metaphor fails to go far enough, I think, because the non-DNA context and the DNA are working at the same level. Both are objects with shapes that interact. Yes, software is always embodied in matter, and yes processing happens in matter rather than some kind of abstract logical space, but part of the point of most computers is that the same matter can carry all kinds of different patterns. In biology the DNA builds the context and the context builds the DNA, and both of them alter each other when they interact. The interations also produce effects that tend to much more closely resemble physics-model-type actions—of he sort that can be modeled via differential equations when you are at a large enough scale that the single-molecule variances average out—rather than really embodying particular operations or logics.
I get the feeling there is an inferential gap happening here...
That’s because it is an algorithm. What else would it be?
Of course those are logical operations being performed on a bit string, again, what else would they be? Magical uncomputable non-functions?
Your major point—which I agree with—seems to be that there are a lot of hard-to-quantify factors and influences that go into determining the result, and that a focus on just DNA does not capture those interactions. But that just means a (mathematical/algorithmic) description that merely focuses on DNA would be on some levels inadequate, that an actual complete description of a cell may take additional information. However, that doesn’t mean that more thorough description wouldn’t also be an algorithm, a program that the cell executes / that describes the cell’s behavior. It would merely be a more complex one. Even that is debatable:
I agree that a long inert DNA polymer wouldn’t do much on its own, just as the same information on some HDD wouldn’t. However, from an informational perspective, if you found a complete set of a mammoth’s DNA and assorted “codings”, would that be enough for a resourceful agent to reconstitute a mammoth specimen? If the agent is also aware of some basic data about cellular structures—certainly. But I’d reckon that given enough time, even an agent with little such knowledge would figure out much of the DNA’s purpose, and eventually be able to recreate a mammoth oocyte to fill with the DNA. If that were indeed so, that would mean that from an informational perspective, the overhead was not shown to be strictly necessary to encode the “mammoth essence”—at least most of it.
I think you could reword my point to be something like: by the time you are doing something algorithmically/computationally that really recapitulates the important things happening in a cell, you are doing something more akin to a physics simulation than to running turing machines on DNA tape. At that point, when your ‘decompression algorithm’ is physics itself, calling it an algorithm seems a little out of place.
In another post just now I wrote that a genome and its products also define a whole landscape of states, not one particular organism. I can’t help but wonder just how huge that space of living states is, and how many of them correspond to normal cell types or cell states in such a mammoth, and how intractable it would be to find that one set of states that corresponds to ‘mammoth oocyte’ and produces a self-perpetuating multicellular system.
In your opinion, are there any physical processes which are not algorithms?
(Qualifier so I’m not drowning in some reality fluid) I’d say that there aren’t any physical processes that cannot be completely described as algorithms.
double post
I wouldn’t overestimate its additional complexity, especially given that most of it ultimately derives from the relationship of different areas in the DNA sequence itself. For the predictability of results with varying slightly different states, confer e.g. the success and predictability in results (on a viral level, not a clinical results level) from manipulating lentivirii and AAVs, see for example this NEJM paper.
No physics-level simulation needed to accurately predict what a cell will do when switching out parts of its genome.
If it were otherwise (if you think about it), the whole natural virus ecosystem itself would break down.
EDIT: A different example that come to mind: Insulin which is synthesized in a laboratory strain of Escherichia coli bacteria which has been genetically altered with recombinant DNA to produce biosynthetic human insulin. No surprises there either.
My main response there is that in those situations, you are making one small change to a pre-existing system using elements that have previously been qualitatively characterized. In the case of the viral gene therapy, it’s a case of adding to the cell a DNA construct consisting of a crippled virus that can’t actually replicate in normal cells for insertion purposes, and a promoter element that turns on a reading frame next to it in any human cellular context along with the reading frame for the gene in question which has had all the frills and splice sites removed. In the case of insulin in bacteria, it’s a case of adding to the bacteria the human insulin reading frame and a few processing enzyme reading frames, each attached to a constantly-on bacterial promoter element. The overall systems of the cells are left intact, and you are just piggybacking on them.
You can do things like this because in many living systems you have elements that have been isolated and tested, and which you can say “if I stick this in, it will do X”. That has for the most part for a long time been figured out empirically, by putting into cells elements that are whole truncated, or mutated in some way and seeing which ones work and which ones dont. These days examining their chemical structures we have physical and chemical explanations for a bunch of them and how they work and are starting to get better at predicting them in particular organismal contexts, though it’s still much much harder in multicellular creatures with huge genomes than in those with compact genomes*.
When I was saying that physics-like things were needed, I was more referring to a situation in which you do not have a pre-existing living thing and are trying to work out what an element does only from its sequence. When you can test things in the correct context and start figuring out what proteins and DNA elements are important for what you can leap over this and tell what is important for what even before you really understand the physical reasons. if you were starting from just the DNA sequence and didn’t really understand what the non-DNA context for them was or possibly even how the DNA helps produce the non-DNA context, you get a much less tractible problem.
*(It’s worth noting that the ease of analysis of noncoding elements is wildly different in different organisms. Bacteria and yeast have compact promoter elements that have DNA sequences of dozens and dozens-to-hundreds of base pairs each, often with easily identifiable protein binding sites, while in animals a promoter element can be in chunks strewn across hundreds of kilobases (though severa kilobases is more typical) and is usually defined as ‘this is the smallest piece of DNA we could include and still get it to express properly’ with only a subset of computationally predicted protein binding sites actually turning out to be functionally important. A yeast centromere element for fiber attachment to chromosomes during cell division is a precisely defined 125 base pair sequence that assembles a complex of anchoring proteins on itself, while a human centromere can be the size of an entire yeast genome and is a huge array of short repeats that might just bind the fiber anchoring proteins a little bit better than random DNA. Noncoding elements get larger and less straightforward much faster than coding elements as genome size increases.)
EDIT: as for viral ecosystems, viruses can hop from species to species because related species share a lot of cellular machinery, even going back when splits happened hundreds of millions of years ago, and the virus just has to work well enough (and will immediately start adapting to its new host). Seeing as life is more than three gigayears old though, there are indeed barriers that viruses cannot cross. You will not find a virus that can infect both a bacterium and a mammal, or a mammal and a plant. When they hop from species to species or population to population the differences can render some species resistant, or change the end behavior of the virus, and you get things like the simian immunodeficiency virus hardly affecting chimps and the only-separated-by-a-century-from-it HIV killing its human host.
I can’t help but feel this is related to (what I perceive as) a vast overrating of the plausibility of uploading from cryonically-preserved brain remnants. It’s late at night and I’m still woozy from finals, but it feels like someone who’s discovered they enjoy, say, classical music without much grasp of music theory or even the knowledge of how to play any instruments figuring it can’t be too hard to just brute-force a piano riff of, say, the fourth movement of Beethoven’s 9th if they just figure out by listening which notes to play. The mistake being made is a subtler and yet more important one than simply underestimating the algorithmic complexity of the desired output.