So, if there are two start sites, you’ll have Q(1) and Q(2) and those will add up to 1? Or are Q(1) and Q(2) decoupled (they could both be start sites, or neither)? O(1) would then be the likelihood ratio that the start is at 1 rather than 2?
Part of my confusion is that you state the priors are 1/numStarts, but then you talk about aggregating experiments where numStarts has different values. This could cause major problems- if you thought that it was either 1 or 2 (that is, putting a prior of 0 on 3 and 4), but measured 1, 2, 3, and 4, then you won’t be aggregating your data very well (since your 0 prior will eat any observations).
It may be easier to express your experiments in the form of P(observation|Q(s)) and then work out what multiple observations mean from the atomic probabilities, rather than starting from a higher level of abstraction. (At the very least, it’ll be easier for us to figure out what’s going on!)
Part of my confusion is that you state the priors are 1/numStarts, but then you talk about aggregating experiments where numStarts has different values.
That’s the source of the difficulty. The priors are different for each sample.
It will confuse you to try to assign a number to s. Each ‘s’ is one start site. It’s an atom that you make propositions about. It doesn’t have content. Biologically, it has a position on the genome, which I use to compute the values of propositions about it.
You have a set S of start sites, each of which we can make propositions about. Each one of them has some position on the genome.
You’re interested in looking at each of the start sites and assessing some property- “does this start site overlap the previous gene’s stop site?” If that’s true for the particular start site s, we say Q(s)=1; otherwise, Q(s)=0 (using 0 and 1 as synonymous with true and false). This is unknown, so we refer to our uncertainty as P(Q(s)), which might starts off as 1/S for all s, or might vary with the start site. Knowing P(Q(i)) doesn’t tell us anything about P(Q(j)).
When we do an experiment, we get back an observation about s; suppose it signals either “heads” or “tails,” which I’ll shorten to H or T. We can calculate P(O(s)=H|Q(s)=1) and P(O(s)=H|Q(s)=0), and from that we can calculate the likelihood ratio used to update Q(s). Note that the likelihood ratio is dependent only on the probabilities of H, and thus is totally independent of the prior probability on Q(s).
We can do the experiments in batches- on, say, four sites at once. It will give a H or T reading for each start site, and the probabilities may depend on the number of sites measured at once. Thus, the likelihood ratio will be different based on the batch size- suppose we call a “heads” result when there are 2 sites tested H2.
Thus, we want to figure out, say, P(Q(s)|&, H2,T4,H4,H4). “&” stands for “all background knowledge,” which will basically be our prior, and assuming multiple experiments are independent conditioned on Q(s), then we can just multiply the odds contributed by the prior and each of the tests to get one final estimate for Q(s). Assuming we started off with 1:20, H4 contributes 2:1, T4 contributes 1:2, and H2 contributes 4:1, we end up with 1*4*1*2*2:20*1*2*1*1=2:5, and so Q(s)=2/7.
So, if there are two start sites, you’ll have Q(1) and Q(2) and those will add up to 1? Or are Q(1) and Q(2) decoupled (they could both be start sites, or neither)? O(1) would then be the likelihood ratio that the start is at 1 rather than 2?
Part of my confusion is that you state the priors are 1/numStarts, but then you talk about aggregating experiments where numStarts has different values. This could cause major problems- if you thought that it was either 1 or 2 (that is, putting a prior of 0 on 3 and 4), but measured 1, 2, 3, and 4, then you won’t be aggregating your data very well (since your 0 prior will eat any observations).
It may be easier to express your experiments in the form of P(observation|Q(s)) and then work out what multiple observations mean from the atomic probabilities, rather than starting from a higher level of abstraction. (At the very least, it’ll be easier for us to figure out what’s going on!)
That’s the source of the difficulty. The priors are different for each sample.
It will confuse you to try to assign a number to s. Each ‘s’ is one start site. It’s an atom that you make propositions about. It doesn’t have content. Biologically, it has a position on the genome, which I use to compute the values of propositions about it.
This is my current model of your problem:
You have a set S of start sites, each of which we can make propositions about. Each one of them has some position on the genome.
You’re interested in looking at each of the start sites and assessing some property- “does this start site overlap the previous gene’s stop site?” If that’s true for the particular start site s, we say Q(s)=1; otherwise, Q(s)=0 (using 0 and 1 as synonymous with true and false). This is unknown, so we refer to our uncertainty as P(Q(s)), which might starts off as 1/S for all s, or might vary with the start site. Knowing P(Q(i)) doesn’t tell us anything about P(Q(j)).
When we do an experiment, we get back an observation about s; suppose it signals either “heads” or “tails,” which I’ll shorten to H or T. We can calculate P(O(s)=H|Q(s)=1) and P(O(s)=H|Q(s)=0), and from that we can calculate the likelihood ratio used to update Q(s). Note that the likelihood ratio is dependent only on the probabilities of H, and thus is totally independent of the prior probability on Q(s).
We can do the experiments in batches- on, say, four sites at once. It will give a H or T reading for each start site, and the probabilities may depend on the number of sites measured at once. Thus, the likelihood ratio will be different based on the batch size- suppose we call a “heads” result when there are 2 sites tested H2.
Thus, we want to figure out, say, P(Q(s)|&, H2,T4,H4,H4). “&” stands for “all background knowledge,” which will basically be our prior, and assuming multiple experiments are independent conditioned on Q(s), then we can just multiply the odds contributed by the prior and each of the tests to get one final estimate for Q(s). Assuming we started off with 1:20, H4 contributes 2:1, T4 contributes 1:2, and H2 contributes 4:1, we end up with 1*4*1*2*2:20*1*2*1*1=2:5, and so Q(s)=2/7.