I have a program that estimates the chances that one gene has the same function as another gene, based on their similarity. This is estimated from the % identity of amino acids between the proteins, and on the % of the larger protein that is covered by an alignment with the shorter protein.
For various reasons, this is done by breaking %id and %len into bins, eg 20-30%id, 30-40%id, 40-50%id, … 30-40%len, 40-50%len, … and estimating a probability for each bin that two proteins matched in that way have the same function.
What I want to do is to reduce the number of bins, so there are only 3 bins for %ID and 3 bins for %len, and 9 bins for their cross-product.
I can gather a bunch of statistics on matches made where we think we know the answer. The frequentist statistician can take, say for %ID, every side-by-side pair of the original 10 bins, do an ANOVA, and look at the F-statistic; then retain the 2 boundaries with the largest F-statistics.
To make sure I’m interpreting this correctly: the calibration data is a list of pairs of genes, along with their %id, and %len, and tagged by either “same function” or “different function”? And currently, these are binned, and the probabilities estimated from the statistics known in that bin?
You want to change this, in particular, reduce the number of bins. Before we get to “how”, may I ask why you want to do this? It doesn’t seem as if it would reduce the computational cost. It would up the number of samples and possibly get a better discrimination, but at the same time it spreads the gene pairs being compared against over larger regions of parameter space, meaning your inference is now based more on genes that should have less relevance to your case...
To make sure I’m interpreting this correctly: the calibration data is a list of pairs of genes, along with their %id, and %len, and tagged by either “same function” or “different function”? And currently, these are binned, and the probabilities estimated from the statistics known in that bin?
Yes.
Before we get to “how”, may I ask why you want to do this?
Not enough samples for a large number of bins.
Ideally I’d use a different method that would use the numbers directly, in regression or some machine-learning technique. I may do that someday. But there are institutional barriers to doing that.
Ideally I’d use a different method that would use the numbers directly,
So would I, but that would be a research project.
There is no direct Bayesian prescription for the best way of binning,
though the motto of “use every scrap of information: throw
nothing away”, implies to me that the proper thing to do is minimize
information left out once we know the bin. A bin is most informative
if the statistics of the bin have the least entropy. So select a binning that does this, and obeys whatever other reasonable restraints you want, such as being contiguous, or dividing directly into 9, by the cross product of 3 on each axis.
A natural measure of the entropy is just -p log p - (1-p) log (1- p), where p is the
revealed frequency, but it’s not the right one. I’m going to argue that
instead we want to use a different measure of entropy: that of an underlying posterior
distribution. This is essentially what information we’re still lacking once we have the bin.
For no prior information, and data of the counts, this a Beta distribution, with parameters of the number in the bin judged to be the “same” + 1, and the number judged to be “different” + 1. There is an entropy formula in the Wikipedia article. EDIT: be careful about signs though, it appears to be the negative of the entropy currently.
Because we’re concerned about the gain per gene pair, naturally each bin’s entropy should be weighted by how often it comes up—that is, the number of samples in the bin (perhaps +1).
Does this seem like a reasonable procedure? Note that it doesn’t directly maximize differing bins getting differing predictions. Instead it minimizes uncertainty in each bin. In practice, I believe it will have the same effect. A slightly more ad-hoc thing to try would be minimizing the variance in each, rather than the entropy.
You know what’s funny—My bosses have a “research project bad” reaction. If I say that fixing a problem requires finding a new solution, they usually say, “That would be a research project”, and nix it.
But if I say, “Fixing this would require changing the underlying database from Sybase to SQLite”, or, “Fixing this would require using NCBI’s NRAA database instead of the PANDA database”, that’s easier for people to accept, even if it requires ten times as much work.
Doing some simulations on a similar problem (1-d, with p(x) = x), I’m getting results indicating that this isn’t working well at all. Reducing the entropy by means of having large numbers in one bin seems to override the reduction in entropy by having the probabilities be more skewed, at least for this case. I am surprised, and a bit perplexed.
EDIT: I was hoping that a better measure, such as the mutual information I(x; y) between bin (x) and probability of being the same function (y) would work. But this boils down to the measure I suggested last time: I(x; y) = H(y) - H(y|x). H(y) is fixed just by the distribution of same vs. not. H(Y|X) = - sum x p(x) int p(y|x) log p(y|x) dy, and so maximizing the mutual information is the same as minimizing the measure I suggested last time.
What’s the similar problem? “1-d, with p(x) = x)” doesn’t mean much to me. It sounds like you’re looking for bins on the region 1 to d, with p(x) = x. I think that if you used the plogp -qlogq entropy, it would work fine.
“1-d”, meaning one-dimensional. n bins between 0 and 1, samples drawn uniformly in the space X = [0,1], with probability p(x) = x of being considered the same.
I still think it’s not a great choice, though clearly my other choices haven’t worked well. But please do try it.
Given that the probability is a continuous distribution, the Fisher information might instead be a reasonable thing to look at. For a single distribution, maximizing it corresponds to minimizing the variance, so my suggestion for that wasn’t as ad-hoc as I thought. I’m not sure the equivalence holds for multiple distributions.
I have a program that estimates the chances that one gene has the same function as another gene, based on their similarity. This is estimated from the % identity of amino acids between the proteins, and on the % of the larger protein that is covered by an alignment with the shorter protein.
For various reasons, this is done by breaking %id and %len into bins, eg 20-30%id, 30-40%id, 40-50%id, … 30-40%len, 40-50%len, … and estimating a probability for each bin that two proteins matched in that way have the same function.
What I want to do is to reduce the number of bins, so there are only 3 bins for %ID and 3 bins for %len, and 9 bins for their cross-product.
I can gather a bunch of statistics on matches made where we think we know the answer. The frequentist statistician can take, say for %ID, every side-by-side pair of the original 10 bins, do an ANOVA, and look at the F-statistic; then retain the 2 boundaries with the largest F-statistics.
What would the Bayesian do?
To make sure I’m interpreting this correctly: the calibration data is a list of pairs of genes, along with their %id, and %len, and tagged by either “same function” or “different function”? And currently, these are binned, and the probabilities estimated from the statistics known in that bin?
You want to change this, in particular, reduce the number of bins. Before we get to “how”, may I ask why you want to do this? It doesn’t seem as if it would reduce the computational cost. It would up the number of samples and possibly get a better discrimination, but at the same time it spreads the gene pairs being compared against over larger regions of parameter space, meaning your inference is now based more on genes that should have less relevance to your case...
Yes.
Not enough samples for a large number of bins.
Ideally I’d use a different method that would use the numbers directly, in regression or some machine-learning technique. I may do that someday. But there are institutional barriers to doing that.
So would I, but that would be a research project.
There is no direct Bayesian prescription for the best way of binning, though the motto of “use every scrap of information: throw nothing away”, implies to me that the proper thing to do is minimize information left out once we know the bin. A bin is most informative if the statistics of the bin have the least entropy. So select a binning that does this, and obeys whatever other reasonable restraints you want, such as being contiguous, or dividing directly into 9, by the cross product of 3 on each axis.
A natural measure of the entropy is just -p log p - (1-p) log (1- p), where p is the revealed frequency, but it’s not the right one. I’m going to argue that instead we want to use a different measure of entropy: that of an underlying posterior distribution. This is essentially what information we’re still lacking once we have the bin.
For no prior information, and data of the counts, this a Beta distribution, with parameters of the number in the bin judged to be the “same” + 1, and the number judged to be “different” + 1. There is an entropy formula in the Wikipedia article. EDIT: be careful about signs though, it appears to be the negative of the entropy currently.
Because we’re concerned about the gain per gene pair, naturally each bin’s entropy should be weighted by how often it comes up—that is, the number of samples in the bin (perhaps +1).
Does this seem like a reasonable procedure? Note that it doesn’t directly maximize differing bins getting differing predictions. Instead it minimizes uncertainty in each bin. In practice, I believe it will have the same effect. A slightly more ad-hoc thing to try would be minimizing the variance in each, rather than the entropy.
You know what’s funny—My bosses have a “research project bad” reaction. If I say that fixing a problem requires finding a new solution, they usually say, “That would be a research project”, and nix it.
But if I say, “Fixing this would require changing the underlying database from Sybase to SQLite”, or, “Fixing this would require using NCBI’s NRAA database instead of the PANDA database”, that’s easier for people to accept, even if it requires ten times as much work.
Doing some simulations on a similar problem (1-d, with p(x) = x), I’m getting results indicating that this isn’t working well at all. Reducing the entropy by means of having large numbers in one bin seems to override the reduction in entropy by having the probabilities be more skewed, at least for this case. I am surprised, and a bit perplexed.
EDIT: I was hoping that a better measure, such as the mutual information I(x; y) between bin (x) and probability of being the same function (y) would work. But this boils down to the measure I suggested last time: I(x; y) = H(y) - H(y|x). H(y) is fixed just by the distribution of same vs. not. H(Y|X) = - sum x p(x) int p(y|x) log p(y|x) dy, and so maximizing the mutual information is the same as minimizing the measure I suggested last time.
What’s the similar problem? “1-d, with p(x) = x)” doesn’t mean much to me. It sounds like you’re looking for bins on the region 1 to d, with p(x) = x. I think that if you used the plogp -qlogq entropy, it would work fine.
“1-d”, meaning one-dimensional. n bins between 0 and 1, samples drawn uniformly in the space X = [0,1], with probability p(x) = x of being considered the same.
That’s a good idea.
I’m glad you said that, since that was what I immediately thought of doing. I’ll read up on the beta distribution, thanks!
I still think it’s not a great choice, though clearly my other choices haven’t worked well. But please do try it.
Given that the probability is a continuous distribution, the Fisher information might instead be a reasonable thing to look at. For a single distribution, maximizing it corresponds to minimizing the variance, so my suggestion for that wasn’t as ad-hoc as I thought. I’m not sure the equivalence holds for multiple distributions.