What we are trying to do is infer the evolutionary mechanisms by which organisms evolve increasingly complex systems of interacting proteins. A protein interaction network is a map of the organism’s proteins and their interactions in the form of a graph: we represent each protein with a node, and draw an edge between two proteins if they can form a complex.
One of the first observations made about protein networks is that the degree distributions of their graphs seem to obey a power law: there are few nodes which have a huge number of interaction partners which act as ‘hubs’ for the networks. Furthermore, hubs are less likely to have interactions with other hubs.
One evolutionary mechanism for explaining this power law is the preferential attachment model. In this model, the interaction network random gains new proteins at random: the added protein forms a single edge with an existing protein in the network, and it is disproportionally likely to form an edge with a ‘hub.’ (Technically speaking, the probability of choosing a particular existing node is proportional to the degree of that particular existing node.)
The preferential attachment model is explains the power law distribution, but is too simplistic by itself: it can only explain networks which have tree-like structures.
Another model for network growth is DMC (duplication-mutation-complementation), also called DDA (duplication-divergence-attachment), a model based on the biological phenomenon of gene duplication.
In DMC, there is a duplication step and a mutation step. In the duplication step, we duplicate one of the existing proteins at random. Say protein O was the existing protein and protein P is the duplicate. The duplicate interacts with all of the interaction partners of the original protein. If protein O formed complexes OE, OF, OG, then the duplicate forms complexes PE, PF, PG.
In the mutation step, the original copy and duplicate copy degenerate in a complementary fashion. They can lose the ability to form certain complexes, so long as the copy provides a backup copy of that particular complex. For example, protein O can lose the ability to form complex OE, but as long as its copy P can still form PE, the organism can continue to be viable. Thus an organism might go from having complexes {OE,OF,OG,PE,PF,PG} after the duplication step to having {PE,OF,OG} after the mutation step.
We can propose models for network generation based on preferential attachment, duplication-mutation-complementation, or a mixture of the two, as was proposed in Ratmann (2007).
For any particular model we want to see how well it explains the observed data. Given models M1, M2, M3, we can compare the probability of observing the data given the models, P(D|M1), P(D|M2), P(D|M3). We say that the model with the highest probability is the model which best explains the data.
However, a particular model might have a parameter q. For example, in the duplication-mutation-complementation model, we might want to vary the rate at which complexes are lost in the mutation step. For a model M with a parameter q, the probability of observing the data will depend on the parameter. For a certain parameter value, we write P(D|M,q) for this probability.
Now there are two different approaches to calculating P(D|M) in this situation. The Bayesian approach is to assume a prior distribution (such as a uniform distribution) on the parameter, P(q), and then compute P(D|M) = Integral[P(D|M,q)P(q) dq] over the parameter space.
The classical approach of maximum likelihood is simply to take P(D|M) = max[P(D|M,q)] over the parameter space. Though if we are comparing models, we will want to apply a penalty factor for the complexity of the model, so in fact we will take P(D|M)= c max[P(D|M,q)] where c is a complexity penalty (e.g. according to Aikake Information Criterion).
What we are trying to do is infer the evolutionary mechanisms by which organisms evolve increasingly complex systems of interacting proteins. A protein interaction network is a map of the organism’s proteins and their interactions in the form of a graph: we represent each protein with a node, and draw an edge between two proteins if they can form a complex.
One of the first observations made about protein networks is that the degree distributions of their graphs seem to obey a power law: there are few nodes which have a huge number of interaction partners which act as ‘hubs’ for the networks. Furthermore, hubs are less likely to have interactions with other hubs.
One evolutionary mechanism for explaining this power law is the preferential attachment model. In this model, the interaction network random gains new proteins at random: the added protein forms a single edge with an existing protein in the network, and it is disproportionally likely to form an edge with a ‘hub.’ (Technically speaking, the probability of choosing a particular existing node is proportional to the degree of that particular existing node.)
The preferential attachment model is explains the power law distribution, but is too simplistic by itself: it can only explain networks which have tree-like structures.
Another model for network growth is DMC (duplication-mutation-complementation), also called DDA (duplication-divergence-attachment), a model based on the biological phenomenon of gene duplication.
In DMC, there is a duplication step and a mutation step. In the duplication step, we duplicate one of the existing proteins at random. Say protein O was the existing protein and protein P is the duplicate. The duplicate interacts with all of the interaction partners of the original protein. If protein O formed complexes OE, OF, OG, then the duplicate forms complexes PE, PF, PG.
In the mutation step, the original copy and duplicate copy degenerate in a complementary fashion. They can lose the ability to form certain complexes, so long as the copy provides a backup copy of that particular complex. For example, protein O can lose the ability to form complex OE, but as long as its copy P can still form PE, the organism can continue to be viable. Thus an organism might go from having complexes {OE,OF,OG,PE,PF,PG} after the duplication step to having {PE,OF,OG} after the mutation step.
We can propose models for network generation based on preferential attachment, duplication-mutation-complementation, or a mixture of the two, as was proposed in Ratmann (2007).
For any particular model we want to see how well it explains the observed data. Given models M1, M2, M3, we can compare the probability of observing the data given the models, P(D|M1), P(D|M2), P(D|M3). We say that the model with the highest probability is the model which best explains the data.
However, a particular model might have a parameter q. For example, in the duplication-mutation-complementation model, we might want to vary the rate at which complexes are lost in the mutation step. For a model M with a parameter q, the probability of observing the data will depend on the parameter. For a certain parameter value, we write P(D|M,q) for this probability.
Now there are two different approaches to calculating P(D|M) in this situation. The Bayesian approach is to assume a prior distribution (such as a uniform distribution) on the parameter, P(q), and then compute P(D|M) = Integral[P(D|M,q)P(q) dq] over the parameter space.
The classical approach of maximum likelihood is simply to take P(D|M) = max[P(D|M,q)] over the parameter space. Though if we are comparing models, we will want to apply a penalty factor for the complexity of the model, so in fact we will take P(D|M)= c max[P(D|M,q)] where c is a complexity penalty (e.g. according to Aikake Information Criterion).
A fantastic explanation!