Oversimplification when generalizing from DNA?
I changed the old topic because it was misleading and did not convey the questioning intention of this post. Sorry about that.
The point of this post is to examine the proposition that people underestimate the complexity of living beings from examining them through the complexity of their functional DNA included in the genome alone. I don’t have sufficient information to answer the question, but I have just about enough information to ask the question, so if you can do a better job drawnig a conclusion that’d be great. Also if you could point out technical errors that’d be nice too.
Genome
The genome contains the DNA which contains each invidual gene and serves as the currency of inherited qualities of the organism. That is evolutionary theories calculate around the frequency of genes and create formalisms, mathematical laws and so forth to predict and understand the phenomenom of natural selection or natural reproduction.
Nothing wrong with this so far. But when it comes to actually thinking about the genes and the protein sequences, it seems to me that often it is forgotten that the entire cell which contaisn the DNA and the mitochondrial DNA and the intracellular devices are part of this replicatory system.
To draw an unreliable surface analogy you could compare the replicatory process to a cellular automata you could think of the system as a generator which accepts a string of numbers to operate the generator. In this surface analogy the entire system is the final organism, the product of the automata, the invidual genes represent the fed in string of numbers and the other parts of the cell—DNA excluded—function as the generator which accepts the string of numbers. This analogue is poor because the distinction isn’t real. But it only serves to illustrate a point. Which is that if you have just the string of genome that is contained in the DNA of a human being—you can not make a human being. Something is missing. The devices inside the cells, the mitochondrial DNA, the initial position—which is a fertilized ovum in a suitable environment like the womb.
The point of the post and the proposition is the following:
The genome (mathematically) contains a smaller amount of data than is actually required for an organism as complex as the phenotype produced with the help of it to develop. To illustrate this with the previous surface analogy of a generator and a feed, the complexity of the generator contributes to the complexity of the final product with the fed string. And this leads to cognitive oversimplifying the complexity of an organis. But that analogue is inaccurate, and this proposition could be too.
So can you tell if this proposition is correct or incorrect?
I don’t have sufficient knowledge in biology, evolutionary theory, mathematics and I just pretty much can’t tell if this is true or completely false, but I’m intuitively anticipating a systemic underevaluation of the complexity of organisms in relation to the complexity of it’s genome on these grounds. Note however I’m not saying that people think organisms as less complicated than they’re, but in terms of mathematics when extrapolating from the genetic complexity, they’d underestimate their predictions. So what do you think?
It looks like this was the straw that broke my back—my first post ever on this site after lurking occasionally for upwards of 1.5 years. The explicit plea combined with a number of cringe-inducing misunderstandings of genetics / molecular biology in a bunch of previous posts finally got to me (I’m a grad student studying basic eukaryotic cell biology).
Here’s my thirty-thousand-feet overall take on the matter: you cannot in good conscience treat genetic information like a computer program, and this is where most misunderstandings and problematical logical leaps occur.
It is true that protein-coding-gene expression is a process that kind of resembles an algorithm. You have pieces of the DNA, roughly analogous to long-term-storage in this context, under particular circumstances getting transcribed into an RNA copy, roughly analogous to memory. Then you have a subset of that RNA that gets ‘read’ in 3 base chunks with particular meanings: START-alanine-tryptophan-asparagine-glycine...arginine-STOP. The proteins made by this fold up, and do whatever they do.
There are four big problems with thinking from this approach though. The first being that coding for proteins is not all that DNA and other nucleic acids do by a long shot. There are genes that never make a protein but make functional RNAs that tether things together. Others make regulatory RNAs that go on to affect gene expression. Other DNA that never gets ‘read’ by anything binds proteins and other complexes for all kinds of purposes I will get into below. Still other DNA consists of selfish replicating elements that exist in vast quantities.
The second is that DNA and RNA are not just information, a string of bits. They are physical objects that are moving around at dozens of meters per second, hitting other molecules, and these physical interactions are what drive their activity. It is not logical operations being performed on a bit string, it is chemical reactions and catalysis in actual three dimensional space. DNA is full of functional elements that have nothing to do with coding for a protein, from promoter elements that have the right charge structure (a function of sequence yes, but decidedly a physical attribute) to stick to the transcription factors and polymerases needed to pry apart the strands and synthesize RNA, to attachment points for fibers that pull freshly replicated DNA into daughter cells, to areas of loose base pairing needed to first pry the strands apart and begin replication, to extra binding sites for transcription factors away from genes which soak up extra molecules and keep them inactive. RNA molecules also have widely varying stability and half-lives before breaking down, again dependent upon their shape and sequence in ways that are often the opposite of straightforward. This is all very physical and depends on the interaction of the shape and charge of the nucleic acid molecules and the rest of the contents of the cell.
The third is that all this information content is completely context-dependent. Yes, almost everything on Earth has compatible genetic codes (animal mitochondria being a notable exception, incidentally—there is very nearly nothing in biology that doesn’t have some exception somewhere in the slew of diversity that exists). But if you put an entire yeast genome straight into a human cell, it wouldn’t be active and would make no protein—not even the selfish parasitic elements mooching along inside it would turn on. The presence of a gene reading frame is useless without the correct functional DNA elements next to it that, when colliding with the correct proteins and complexes, are able to properly stick to all the pre-existing machinery that is required to catalyze the production of other molecules from that template. While in any one branch of life the DNA elements and catalytic machinery have evolved together, they drift over evolutionary time. To make matters even worse, it appears that the genetic code itself is pretty arbitrary. There’s nothing chemical in the structure of a gene to tie it to a particular amino acid sequence, other than the code itself which is entirely mediated by proteins* which tie together free amino acids and transfer-RNAs and that are themselves made by genes according to the code. The code does not exist without the proteins.
EDIT: I have to amend this. I went looking through the literature and while I could not find reference to normal human promoters working in yeast or vice versa I did find references to a very strong promoter from a human mononucleosis-causing virus that in bread yeast produces detectable but biologically insignificant amounts of protein, and works a bit better in a second ‘fission yeast’ species. Looks like it is sometimes possible for human and yeast promoters to cross-talk, but it is rare and insignificant compared to normal expression.
Fourth, there are features of organisms that have nothing to do with their genomes. Nothing in its genome tells a gram negative bacterium to have two nested cell membranes. Instead, its particular compliment of proteins allows that second membrane to be perpetuated and split along with the rest of the cell when it divides. Nothing in a human cell (we think) tells it that its internal membrane system should have particular proteins in it—instead the functional internal membrane system pulls particular freshly-synthesized proteins with particular features into itself as they are made. The more widely-thought-of epigenetic state is another example of this.
All together, this makes me extremely wary of any attempt to even talk about the ‘complexity’ of an organism based only on its genome. The size of the genome puts an upper bound on some sorts of complexity—the number of proteins that an organism can make, for example. But physical interactions, the presence of non-DNA molecules, and the previous shape/state of the organism are integral, carry vast quantities of information, and are the context that makes the DNA represent information in the first place rather than just being an unstable polymer.
Your comment is really interesting, but I question this “vast quantities of information”. Suppose I want to implement a cell as a physics simulation and watch it live on in my computer.
The relevant laws of physics don’t take a huge number of bits to describe or implement on a computer.
If we take all the non-DNA molecules (excluding the proteins and RNA which are already coded for in the DNA), are there any really complex ones that would take a lot of bits to specify (compared to the DNA)? Or are there that many different kinds of molecules?
The exact shape and state of a cell do contain a lot of information, but surely most of that is irrelevant. For example, I can probably get away with approximating the initial shape of the simulated cell as a sphere or ellipsoid (or some other simple geometric shape), which takes many fewer bits to specify than its actual shape. Same thing with the distributions of molecules inside the cell. I probably don’t need information about the exact location of each molecule, but can approximate the distributions using relatively simple density gradients.
So even after reading your comment, I think it’s likely that most of the complexity (in the sense of number of bits needed to implement a simulation) of a cell is in its genome. Am I wrong in any of my guesses above, or missing something else?
You can implement A simulation. But that simulation having anything to do with any particular thing that has existed in the real world is harder.
Physics itself is not hard. Applying it to large numbers of particles is hard.
As for non-DNA molecules, there are all kinds of small molecule metabolites which are constantly being converted back and forth, some of which are very important (they bind to the big molecules, are part of metabolism, and I have seen some brand new research about particular proteins that only fold properly around a ‘nucleus’ formed by a particular 6-carbon molecule). But my main point I was trying to make was more along the lines of (addressing the third bullet):
Shape is more detailed than general cell shape. There is fine structure in terms of internal fibers, distributions of molecules, impermiable barriers that segregate things, etc. Some of this, like the aforementioned membranes in bacteria, the self-perpetuating but never-made-from-scratch compartments that distill out their components from the general cell mileu, don’t necessarily have the DNA as a determinant but rather as something that sets up the circumstance in which it is stable. Other things like the amounts and simple distributions of molecules all come from pervious states and most possible distributions dont correspond to any real state (though doubtless many of them would be unstable and collapse down to one attractor or another that normally exists once you instantiate them).
I have a hard time trying to think of the nature of the correspondence between these things and bits for a simulation besides positions of molecules, and I’m not sure in what context those bits are specified. A little help?
What you do is write a program that generates a set of particles and places them into the simulated cell, such that the resulting cell is viable and functionally equivalent to the original cell. Take the program and count its length in bits. If you haven’t programmed before you may not have much intuition about this. In that case think of it this way: if you have to describe the shape/internal structure/distributions (ETA: and structures) of molecules, in natural language and/or mathematical notation, in sufficient detail that someone else could create a physics simulation of the cell based on your description, how many bits would that take, and what fraction of those would be taken up by the DNA sequences?
Seconding RichardKennaway. Thanks for writing that! It would be great to have more scientifically literate people posting here. Like Ilya Shpitser, a student of Judea Pearl who was attracted by Eliezer’s writings about causality and is trying to give a more informed perspective.
Do you know of any good tutorials of DNA structure covering the elements you describe?
I’m a member of a forum for jaguar owners. Somebody wrote a book, XJ6, Bumper to Bumper, where he walks you through the entire car, every major part, from front to back.
That kind of thing for DNA would be extremely helpful. Something that shows all the different kinds of functional blocks within DNA, and how they’re all pieced together. When they say a gene, is it a pattern of dna starting at base pair X? Starting at functional block X? How does copy number variation work? Are all the copies in the same place?
Your post had a bunch more types of functional units. I’d love to have a nice PBS video with pretty colors and a soothing narrator giving me the tour, and pointing out those functional units and others as we stroll down a DNA strand.
I’m not ignorant on purpose. I’ve tried to look for this kind of information, but I’ve never been satisfied with the material I’ve found. I haven’t found anything really tied to DNA structure, only a “this, that, and the other” verbal listing.
A biology textbook? Perhaps that big brown one that says “Biology” on the front.
Textbooks are great!
Molecular Biology of the Cell (Alberts) is a textbook of cell biology that gets a lot of recommendations; when you go to a talk where someone is investigating something and figures out a wrinkle or complication to some process they will often call the general consensus “the Alberts version”. Has all kinds of information on general cell biology, with the information on DNA coding and function spread in a couple of chapters.
On the rather more technical end, I loved the book “The Origins of Genome Architecture” by Dr. Michael Lynch. It’s kind of half a massive review article of features you find in eukaryotic genome structure and how they vary across the tree of life, with a major focus on evolutionary biology and mechanisms of how they change over time, and with bibliography that must be a fifth of the book. It’s other half of its content is basically the author pushing his idea that for eukaryotes and multicellular organisms in particular, most of their genome architecture got the way it is via non-adaptive processes that have more to do with the sorts of mutation that are possible than their end fitness results. Makes a compelling argument and shows all kinds of details of DNA function I did not know about until I read it, but gets very technical very fast and is written in many parts like a scientific paper with biologists in mind.
An intro biology textbook will not cover DNA in the detail CellBioGuy touched on. You’d need to read an intro Bio book and then maybe intro Molecular Biology, and then find a book focusing on DNA.
Even better.
Thank you for that, it deserves an award for the most useful first post ever.
So, as a non-biologist who just has a layman’s knowledge of these things, the idea of reconstructing an organism from nothing but its DNA would be like reconstructing a car engine from the crankshaft? The causal chains that make an engine go all pass through the crankshaft: it moves the pistons up and down, drives the fuel pump to bring petrol to the cylinders, drives the camshafts that open and close the valves at the right time, and receives the energy of the power stroke and passes it to the transmission. So obviously the crankshaft is how the whole thing works, and complete knowledge of the crankshaft must in principle let us reconstruct the whole engine.
(I am indebted to my colleague William Powers for the crankshaft as an illustration of how to misunderstand cyclic causal chains.)
Interesting analogy. I like it.
It is worth noting that you can bootstrap a genome with non-DNA machinery which is good enough, and get it to work and eventually equilibrate as it makes its own native machinery. That’s how Craig Venter and company ‘created’ their ‘artificial’ bacterium a while back—they chemically synthesized a near-exact copy of one of the smallest and simplest known bacterial genomes (no small feat by the way, chemical synthesis of DNA is expensive and error-prone during the assembly steps of smaller pieces into the full chromosome) with a few watermarks to show it was theirs, and shoved it into another related bacterial species (again no small feat). They were close enough that that other species’ proteins were able to run that new genome’s genes, getting a causal cycle going, and after a few dozen generations all the old proteins/metabolites/etc had been diluted out and replaced with those made by the artificial genome. This of course basically requires a related living system, on top of something like the basic definition of the genetic code.
The fact that nongenetic information is perpetuated in the form of epigenetic state, physical shape and arrangement, etc adds another loop that many times doesn’t even feed through the genome at all, but rather relies on the genome and the proteins it codes for to set up a landscape of attractors that these factors are capable of occupying and which ones can lead to which others through paths which the cell can stay alive through.
If you could somehow synthesize compatible non-DNA machinery capable of interacting with a particular genome, you might be able to get a cell going—but the question is how do you do that given just the genome and the genetic code to go on, and once you do does the system fall into any of the normal stable attractors, and how much of that organism is things like the double membrane of a gram negative bacterium, passed on from its parents but only its capability for replication really represented by the genome or its products. All membranes come from previously-existing membranes, things like that. Animal cells, with all the fun attractors they fall into from every different cell type to cancer to all of their intercelular interactions are examples of cells that have their history as a vital component of their identity.
How do most of these objections not apply also to computer programs? Computer programs are physical objects, and what the program actually does depends entirely on the physical machinery that runs it.
I would say the main difference is that computer systems work to embody the same bit string in widely varying substrates and perform the same logical operations on it. It doesn’t matter if a program is stored on magnetic domains in a tape drive and executed in vacuum tubes, or if it is stored in electrons trapped in flash memory and executed in a 22 nanometer process CPU, the end result of a given set of logical operations is the same. In biology though there really isn’t a message or program you can abstract away from the molecules bouncing around, there is only one level of abstraction. You cannot separate ‘hardware’ and ‘software’.
Assuming “bit string” means “machine code”, this isn’t true. The same machine code will not result in the same logical operations being performed on all computers. It may not correspond to any logical operations at all on other computers. And what logical operations are carried out depends entirely on “the molecules bouncing around” in the computer. You aren’t making DNA sound different from machine code at all.
Good point regarding machine code, I wasn’t thinking at that level of detail. But at this point the similarity is metaphorical at best.
The metaphor fails to go far enough, I think, because the non-DNA context and the DNA are working at the same level. Both are objects with shapes that interact. Yes, software is always embodied in matter, and yes processing happens in matter rather than some kind of abstract logical space, but part of the point of most computers is that the same matter can carry all kinds of different patterns. In biology the DNA builds the context and the context builds the DNA, and both of them alter each other when they interact. The interations also produce effects that tend to much more closely resemble physics-model-type actions—of he sort that can be modeled via differential equations when you are at a large enough scale that the single-molecule variances average out—rather than really embodying particular operations or logics.
I get the feeling there is an inferential gap happening here...
That’s because it is an algorithm. What else would it be?
Of course those are logical operations being performed on a bit string, again, what else would they be? Magical uncomputable non-functions?
Your major point—which I agree with—seems to be that there are a lot of hard-to-quantify factors and influences that go into determining the result, and that a focus on just DNA does not capture those interactions. But that just means a (mathematical/algorithmic) description that merely focuses on DNA would be on some levels inadequate, that an actual complete description of a cell may take additional information. However, that doesn’t mean that more thorough description wouldn’t also be an algorithm, a program that the cell executes / that describes the cell’s behavior. It would merely be a more complex one. Even that is debatable:
I agree that a long inert DNA polymer wouldn’t do much on its own, just as the same information on some HDD wouldn’t. However, from an informational perspective, if you found a complete set of a mammoth’s DNA and assorted “codings”, would that be enough for a resourceful agent to reconstitute a mammoth specimen? If the agent is also aware of some basic data about cellular structures—certainly. But I’d reckon that given enough time, even an agent with little such knowledge would figure out much of the DNA’s purpose, and eventually be able to recreate a mammoth oocyte to fill with the DNA. If that were indeed so, that would mean that from an informational perspective, the overhead was not shown to be strictly necessary to encode the “mammoth essence”—at least most of it.
I think you could reword my point to be something like: by the time you are doing something algorithmically/computationally that really recapitulates the important things happening in a cell, you are doing something more akin to a physics simulation than to running turing machines on DNA tape. At that point, when your ‘decompression algorithm’ is physics itself, calling it an algorithm seems a little out of place.
In another post just now I wrote that a genome and its products also define a whole landscape of states, not one particular organism. I can’t help but wonder just how huge that space of living states is, and how many of them correspond to normal cell types or cell states in such a mammoth, and how intractable it would be to find that one set of states that corresponds to ‘mammoth oocyte’ and produces a self-perpetuating multicellular system.
In your opinion, are there any physical processes which are not algorithms?
(Qualifier so I’m not drowning in some reality fluid) I’d say that there aren’t any physical processes that cannot be completely described as algorithms.
double post
I wouldn’t overestimate its additional complexity, especially given that most of it ultimately derives from the relationship of different areas in the DNA sequence itself. For the predictability of results with varying slightly different states, confer e.g. the success and predictability in results (on a viral level, not a clinical results level) from manipulating lentivirii and AAVs, see for example this NEJM paper.
No physics-level simulation needed to accurately predict what a cell will do when switching out parts of its genome.
If it were otherwise (if you think about it), the whole natural virus ecosystem itself would break down.
EDIT: A different example that come to mind: Insulin which is synthesized in a laboratory strain of Escherichia coli bacteria which has been genetically altered with recombinant DNA to produce biosynthetic human insulin. No surprises there either.
My main response there is that in those situations, you are making one small change to a pre-existing system using elements that have previously been qualitatively characterized. In the case of the viral gene therapy, it’s a case of adding to the cell a DNA construct consisting of a crippled virus that can’t actually replicate in normal cells for insertion purposes, and a promoter element that turns on a reading frame next to it in any human cellular context along with the reading frame for the gene in question which has had all the frills and splice sites removed. In the case of insulin in bacteria, it’s a case of adding to the bacteria the human insulin reading frame and a few processing enzyme reading frames, each attached to a constantly-on bacterial promoter element. The overall systems of the cells are left intact, and you are just piggybacking on them.
You can do things like this because in many living systems you have elements that have been isolated and tested, and which you can say “if I stick this in, it will do X”. That has for the most part for a long time been figured out empirically, by putting into cells elements that are whole truncated, or mutated in some way and seeing which ones work and which ones dont. These days examining their chemical structures we have physical and chemical explanations for a bunch of them and how they work and are starting to get better at predicting them in particular organismal contexts, though it’s still much much harder in multicellular creatures with huge genomes than in those with compact genomes*.
When I was saying that physics-like things were needed, I was more referring to a situation in which you do not have a pre-existing living thing and are trying to work out what an element does only from its sequence. When you can test things in the correct context and start figuring out what proteins and DNA elements are important for what you can leap over this and tell what is important for what even before you really understand the physical reasons. if you were starting from just the DNA sequence and didn’t really understand what the non-DNA context for them was or possibly even how the DNA helps produce the non-DNA context, you get a much less tractible problem.
*(It’s worth noting that the ease of analysis of noncoding elements is wildly different in different organisms. Bacteria and yeast have compact promoter elements that have DNA sequences of dozens and dozens-to-hundreds of base pairs each, often with easily identifiable protein binding sites, while in animals a promoter element can be in chunks strewn across hundreds of kilobases (though severa kilobases is more typical) and is usually defined as ‘this is the smallest piece of DNA we could include and still get it to express properly’ with only a subset of computationally predicted protein binding sites actually turning out to be functionally important. A yeast centromere element for fiber attachment to chromosomes during cell division is a precisely defined 125 base pair sequence that assembles a complex of anchoring proteins on itself, while a human centromere can be the size of an entire yeast genome and is a huge array of short repeats that might just bind the fiber anchoring proteins a little bit better than random DNA. Noncoding elements get larger and less straightforward much faster than coding elements as genome size increases.)
EDIT: as for viral ecosystems, viruses can hop from species to species because related species share a lot of cellular machinery, even going back when splits happened hundreds of millions of years ago, and the virus just has to work well enough (and will immediately start adapting to its new host). Seeing as life is more than three gigayears old though, there are indeed barriers that viruses cannot cross. You will not find a virus that can infect both a bacterium and a mammal, or a mammal and a plant. When they hop from species to species or population to population the differences can render some species resistant, or change the end behavior of the virus, and you get things like the simian immunodeficiency virus hardly affecting chimps and the only-separated-by-a-century-from-it HIV killing its human host.
I can’t help but feel this is related to (what I perceive as) a vast overrating of the plausibility of uploading from cryonically-preserved brain remnants. It’s late at night and I’m still woozy from finals, but it feels like someone who’s discovered they enjoy, say, classical music without much grasp of music theory or even the knowledge of how to play any instruments figuring it can’t be too hard to just brute-force a piano riff of, say, the fourth movement of Beethoven’s 9th if they just figure out by listening which notes to play. The mistake being made is a subtler and yet more important one than simply underestimating the algorithmic complexity of the desired output.
As always, there are a lot of angles and subtle points where the answer to your proposition stands and falls with slight interpretational variances of what you mean exactly.
DNA on its own cannot reproduce, however that is not a function of missing data so much as of missing the actual apparatus (the “data” of which is already encoded in the DNA). Note: I’m lumping epigenetic and assorted information storage together with “DNA”.
In terms of “data”, the question you should ask yourself is this: Could an alien with unlimited resources reconstitute a human being from solely its genetic code? How much additional information would be necessary? This doesn’t refer just to the cellular details so much as to the environmental habitat as well, from pressure levels in the macroenvironment to the general cytokine soup in the microenvironment. Much of that is shared across species and organisms, does that count?
In short, are you interested in the complexity of an organism given an empty tape on a UTM, or of an organism given the biosphere of the planet earth (minus that organism/species) already provided?
Also, DNA is not well compressed at all. Refer to the codon table. With 4 to the 3rd power (=64) different possible states per triplet, those only map to a paltry 22 results. Talk about redundancy! Not to mention that you referred to the genome, not merely the exome (which has a higher information density). Although non-transcribed areas also retain some functionality, it could probably be functionally losslessly compressed even more effectively.
Does “its” genetic code include, say, mitochondrial DNA?
Er, in the presence of mutation, you want redundancy to continue functioning, which is relevant to whether or not it’s “well” compressed.
In context, “well compressed” meant “highly compressed”. It’s true there’s a reason for DNA not being highly compressed, but that doesn’t change the fact that it isn’t.
Yes this is pretty much what I was trying to find out. And also asking if people are aware of this information gap—and if that gap exists.
Pretty sure that’s a no. Source: I’m a Plant Biology grad student. CellBioGuy could probably be more definitive.
Anyone know of a decent tutorial on chromosome structure?
People like to simplify physical structures, to where when I think about them, I start seeing that the simplified model has gaps.
Certain genes are on certain chromosomes. Right? But do the genes on a chromosome always appear on it the same order? How does copy number variation work? Do multiple copies always lie together, or can they be strewn about. And if there is copy number variation to a gene, how are they talking about SNPs? Are all the copies really identical copies, or can I have two different copies of the same gene? Is copy number variation handed down as well?
It’s a string of molecules, but I’ve never seen a full description from the string to genes. Walk me down a chromosome and point out what we see as we move along, base pair by base pair.
To try and really quickly answer some of these:
A chromosome is a single long DNA molecule and associated bound proteins—the blobs you see as a chromosome during cell division is way more than half protein. When a cell is not dividing they are decondensed blobs but they fold up neatly into those cords you see in micrographs for segregation during division. A chromosome needs a centromere (point of attachment for fibers to pull it into daughter cells after replication), origins of replication (areas where weak base pairing is pulled apart once per cell division to allow replication to begin), and telomeres (easy-to-rebuild-after-shortening repeats at the ends of chromosomes) if it is a linear chromosome because the ends of linear DNA molecules are difficult to replicate all the way out ot the end.
In a given species the order of genes (and everything else) will be the same on a given chromosome, as within a cell one chromosome is often repaired using the other copy as a template and when generating gametes for sex the two chromosome copies will exchange fragments. Individuals can have differences in the order of genes without having physical problems provided the cut/paste points don’t come in the middle of a functional element, though there might be less efficient reproduction when mating with those with the ‘normal’ arrangement. Over evolutionary time chromosome fragments do break and get shuffled around, but you see between say mice and humans that our chromosomes consist of chunks of an apparent ancestral set of chromosomes that have been differenty cut and pasted back together in the two lineages. Sometimes individual genes move but this is rare.
Asexual species can have their genomes shuffle arond a lot faster because they dont have to stay roughly compatible with their mating partners.
Almost all the time when you have copy number variation of genes, it is due to there being multiple copies of the gene laying right next to each other in tandem. Only a small subset of genes usually have copy number variation, but it CAN exist in a lot of placeswhere it doesn’t TYPICALLY exist. These blocks of tandem gene repeats are usually handed down unchanged, but they do change now and then when gametes are being made for sex. When one chromosome breaks and switches pieces with another, if a breakpoint happens inside one of the repeated genes it could attach itself into any of the multiple repeats with similar sequence, allowing one gamete to gain repeats at the expense of another produced by the same parent cell (2 → 3 + 1, say). As such repeats are unstable and can expand or contract quickly over evolutionary time. I don’t know how much research has been done into genes with multiple copy numbers carrying different versions of the sequence.
Of course the description of the genome presented here is over-simplification of the process. The genome and the cell are much more complicated than a simple Turing machine. However I have to emphasize two main ideas:
First, saying that the biological machinery is more complex doesn’t mean that you cannot evaluate the complexity of an organism from its genome. It’s just means that you need to make your model more complex, and include the necessary parameters such as RNA interference, epigenetic, and cellular activity into account. The technical problem to take one organism DNA and transfer it to another (changing the program), really just say something about our current technology and not about the theoretical impossibility to do so. I believe that these kinds of experiments will be completely possible in a metazoan cell within the next decade or two. Time will tell.
Second, in a deep sense, every aspect of an organism is encoded in the genome. The organelles and membrane structure, the biochemical activity, the system-biology, the development of milticellular organism, the epigenetic itself, its all relay eventually to specific genomic sequences. These can be regulatory sequences, epigenetic related sequences, protein coding genes, RNA genes etc. even tough you cannot directly associate all the cellular functions to the knowledge you currently have about the genome, it does not means that it is doesn’t there. It has to be. The reason for that is that the “programmer of the genome”, hence evolution, works primarily by changing the genome. The genetic code is the information that last across the ages, and it have to encode all the programming necessary to make an organism, including the entire cellular and organism biochemistry and epigenetics.
Third, if you want to estimate the true complexity of the genome, you need to appreciate the algorithms it is actually encodes: The genome encodes to thousands of proteins that create a computational system by themselves. For example, it encodes proteins that can sense the environment, calculate huge amount of possible states, by constructing large networks of protein interactions and signaling, and eventually changing the entire biochemistry of the cell, and even the genome itself. So the real complexity of the cells does not hidden in the number of proteins or RNAs it encodes. The true complexity is in the regulatory system telling which protein to express when, and how to react to environmental cues. In multi-cellular environments it even more intricate—each cell (with the same genome) have different functions, asserted by the states of proteins networks that responded to complicated signaling and internal states during development. The genome is patterning a remarkable complex organism, without actually encoding all the developmental stages, just the mechanism that can calculate and response to the environment. Another example will be the immune system: instead of encoding all the possible responses to all the possible pathogens (infinite possibilities), inside the genome, you encode to a learning system that can gradually learn to react to different threats when needed. So again, the genome encodes to another computation system by itself. One final example would be the animal brain – the genome does not encode all the information and processing power of the brain. Instead, it encodes the basic developmental and cell biology processes needed to develop a brain, which can learn, calculate and react to changing environments.
So, if you want to generalize the complexity of an organism from its DNA, you have to account for the ways this DNA actually compress its algorithms in systems that can learn, calculate and create infinite number of possible outcomes by themselves. In analogy, it will be like understanding the complexity of a computer program to program new programs that each can respond to the environment and change itself accordingly, and each of these programs are interconnected into one big system. To add more complexity, these programs are running simultaneously, changing one another, and eventually even changing the basic code that encoded them in the first place. This is how complex DNA program is.
Sure. This is well known. See here, where it says:
Evolution give humans roughly 25 megabytes of optimization. If aliens understand evolution well enough to know what it’s optimizing us for, that means that, given a list of 2^2000000 possible reproducing organisms with human DNA, they’d have a 50:50 chance to spot the human. It would take vastly more than 25 megabytes of code to specify a specific human cell minus DNA, or even one that passes for human. However, if you’re limited to organisms that could actually reproduce, and you ignore unreasonably compex organisms, I think it would be enough, and you’d still have enough left over to specify how the womb works. You could even specify evolution, so the aliens only have to know it’s the result of some optimization process. As such, aliens, given only our DNA, ungodly computing power, and the knowledge that they were handed more than random bits, would be able to work out what we are.
See the part on inner message/outer message and DNA in Douglas Hofstadter’s Godel Escher Bach.