That is as concrete as I can make it, unless you want me to write out an algorithm for Gibbs sampling and explaining why it produces priors that maximize the posterior. Or give an example where I used it to do so. I can do that: I had a set of about 8 different databases I was using to assign functions to known proteins. I wanted to estimate the reliability of each database, as a probability that its annotation was correct. This set of 8 probabilities was the set of priors I sought. I had a set of about a hundred-thousand annotated proteins, and given a set of priors, I could produce the probability of the given set of 100,000 annotations. I used that dataset plus Gibbs sampling to produce those 8 priors. And it worked extraordinarily well.
Oh man, you’re not doing yourself any favors in trying to shift my understanding of you. Not that I doubt that your algorithm worked well! Let me explain.
You’ve used a multilevel modelling scheme in which the estimands are the eight proportions. In general, in any multilevel model, the parameters at a given level determine the prior probabilities for the variables at the level immediately below. In your specific context, i.e., estimating these proportions, a fully Bayesian multilevel model would also have a prior distribution on those proportions (a so-called “hyperprior”, terrible name).
If you didn’t use one, your algorithm can be viewed as a fully Bayesian analysis that implicitly used a constant prior density for the proportions, and this will indeed work well given enough information in the data. Alternatively, one could view the algorithm as a (randomized) type II maximum likelihood estimator, also known as “empirical Bayes”.
In a fully Bayesian analysis, there will always be a top-level prior that is chosen only on the basis of prior information, not data. Any approach that uses the data to set the prior at the top level is an empirical Bayes approach. (These are definitions, by the way.) When you speak of “estimating the prior probabilities”, you’re taking an empirical Bayes point of view, but you’re not well-informed enough to be aware that “Bayesian” and “empirical Bayes” are not the same thing.
The kinds of prior distributions with which I was concerned in my posts are those top-level prior distributions that don’t come from data. Now, my pair of posts were terrible—they basically dropped all of the readers into the inferential gap. But smart mathy guy cousin_it was intrigued enough to do his own reading and wrote some follow-up posts, and these serve as an existence proof that it was possible for someone with enough background to understand what I was talking about.
On the other hand, you didn’t know what I was talking about, but you thought you did, and you offered questions and comments that apparently you still believe are relevant to the topic I addressed in my posts. To me, it really does look like—in this context, at least—you are laboring under a “cognitive bias in which unskilled individuals suffer from illusory superiority, mistakenly rating their ability much higher than is accurate”.
So now I’ll review my understanding of you:
Smart? Yes.
Not as smart as you think you are? Yes.
High intelligence is a core part of your self-image? Well, you did find my claim “not as smart as you think you are” irritating enough to respond to; you touted your math degree, teaching experience, and success in data analysis. So: yes.
Posting on LW is often unrewarding for you because of above three traits? Hmm… well, that has the same answer as this question: have you found our current exchange unrewarding? (Absent further info, I’m assuming the answer is “yes”.)
To claim evidence that I’m overconfident, you have to show me asserting something that is wrong, and then failing to update when you provide evidence that it’s wrong.
In the thread which you referenced, I asked you questions, and the only thing I asserted was that EM and Gibbs sampling find priors which will result in computed posteriors being well-calibrated to the data. You did not provide, and still have not provided, evidence that that statement was wrong. Therefore I did not exhibit a failure to update
I might be using different terminology than you—by “priors” I meant the values that I’m going to use as priors in my running program on new data for transferred function annotations, and by “posteriors” I meant the posterior probability it will compute for a given annotation, given those “priors”. I didn’t claim to know what the standard terminology is. The only thing I claimed was that Gibbs sampling & EM did something that, using my terminology, could be described as setting priors so they gave calibrated results.
If you had corrected my terminology, and I’d ignored you, that would have been a failure to update. If you’d explained that I misunderstand Gibbs sampling, that would have been a failure to update. You didn’t.
Relevant to your post? I don’t know. I didn’t assert that that particular fact was relevant to your post. I don’t know if I even read your post. I responded to your comment, “seek a prior that guarantees posterior calibration,” very likely in an attempt to understand your post.
you didn’t know what I was talking about, but you thought you did
Again, what are you talking about? I asked you questions. The only thing I claimed to know was about the subject that I brought up, which was EM and Gibbs sampling.
As far as I can see, I didn’t say anything confidently, I didn’t say anything that was incorrect AFAIK, I didn’t claim you had made a mistake, and I didn’t fail to update on any evidence that something I’d said was wrong. So all these words of yours are not evidence for my over-confidence.
Even now, after writing paragraphs on the subject, you haven’t tried to take anything I claimed and explain why it is wrong!
Try this approach: Look over the comments that you provided as evidence of my overconfidence. Say what I would have written differently if I were not overconfident.
In a fully Bayesian analysis, there will always be a top-level prior that is chosen only on the basis of prior information, not data. Any approach that uses the data to set the prior at the top level is an empirical Bayes approach.
I don’t see how distinction makes sense for Gibbs sampling or EM. They are iterative procedures that take your initial (top-level) prior, and then converge on a posterior-to-the-data value (which I called the prior, as it is plugged into my operating program as a prior). It doesn’t matter how you choose your initial prior; the algorithm will converge onto the same final result, unless there is some difficulty converging. That’s why these algorithms exist—they spare you from having to choose a prior, if the data is strong enough that the choice makes no difference.
If you’d explained that I misunderstand Gibbs sampling, that would have been a failure to update. You didn’t.
I wrote a comment that was so discordant with your understanding of Gibbs sampling and EM that it should have been a red flag that one or the other of us was misunderstanding something. Instead you put forth a claim stating your understanding, and it fell to me to take note of the discrepancy and ask for clarification. This failure to update is the exact event which prompted me to attach “Dunning-Kruger” to my understanding of you.
I don’t see how distinction makes sense for Gibbs sampling or EM… That’s why these algorithms exist—they spare you from having to choose a prior, if the data is strong enough that the choice makes no difference.
The way in which the ideas you have about EM and Gibbs sampling are wrong isn’t easily fixable in a comment thread. We could do a Google Hangout at some point; if you’re interested, PM me.
I believe my ideas about Gibbs sampling are correct, as demonstrated by my correct choice and implementation of it to solve a difficult problem. My terminology may be non-standard.
Here is what I believe happened in that referenced exchange: You wrote a comment that was difficult to comprehend, and I didn’t see how it related to my question. I explained why I asked the question, hoping for clarification. That’s a failure to communicate, not a failure to update.
Here is what I believe happened in that referenced exchange: You wrote a comment that was difficult to comprehend, and I didn’t see how it related to my question. I explained why I asked the question, hoping for clarification. That’s a failure to communicate, not a failure to update.
My interpretation, having read this comment thread and then the original: Cyan brought up a subtle point about statistics, explained in a non-obvious way. (This comment seemed about as informative to me as the entire post.) You asked “don’t statistical procedures X and Y solve this problem?”, to which Cyan responded that they weren’t relevant, and then you repeated that they do.
Here, the takeaway I would make is that Cyan is likely a theory guy, and you’re likely an applications guy. (I got what I think Cyan’s point was on my first read, but it was a slow read and my “not my area of expertise” alarms were sounding.) It is evidence for overconfidence when people don’t know what they don’t know (heck, that might even be a good definition for overconfidence).
Say what I would have written differently if I were not overconfident.
After Cyan’s response that Gibbs and EM weren’t relevant, I would have written something like “If Gibbs and EM aren’t relevant to the ideas of this post, then I don’t think I understand the ideas of this post. Can you try to summarize those as clearly as possible?”
That’s a failure to communicate, not a failure to update.
Okay, fair enough. I’ll give it a shot, and then I’m bowing out.
Let me explain the problem with
That’s why these algorithms exist—they spare you from having to choose a prior, if the data is strong enough that the choice makes no difference.
This is not why these algorithms exist. EM isn’t really an algorithm per se; it’s a recipe for building an optimization algorithm for an objective function with the form given in equation 1.1 of the seminal paper on the topic. Likewise, Gibbs sampling is a recipe for constructing a certain type of Markov chain Monte Carlo algorithm for a given target distribution.
If you read the source material I’ve linked, you’ll notice that the EM paper gives many examples in which nothing like what you call a prior (actually a proportion) is present, e.g., sections 4.1.3, 4.6. Something like what you call priors are present in the example of section 4.3, although those models don’t really match the problem you solved. (To see why I brought up empirical Bayes in the context of your problem, read section 4.5.)
You’ll also notice that the Wikipedia article on MCMC does not mention priors in either your sense or my sense at all. That is because such notions only arise in specific applications; a true grokking of MCMC in general and Gibbs sampling in particular does not require the notion of a prior in either sense.
You’ve understood how to use the Gibbs sampling technology to solve a problem; that does not mean you understand the key ideas underlying the technology. Your problem was in the space of problems addressed by the technology, but that space is much larger, and the key ideas much more general, than you have as yet appreciated.
Not to be a jerk, but your ideas about Gibbs and EM seem very wrong to me too, for exactly the reasons that Cyan describes below.
Because of that, surprised that you said you had used Gibbs in a statistical application with great success. Perhaps you were using a stats package that used Gibbs sampling rather than being Gibbs sampling?.
That is as concrete as I can make it, unless you want me to write out an algorithm for Gibbs sampling and explaining why it produces priors that maximize the posterior. Or give an example where I used it to do so. I can do that: I had a set of about 8 different databases I was using to assign functions to known proteins. I wanted to estimate the reliability of each database, as a probability that its annotation was correct. This set of 8 probabilities was the set of priors I sought. I had a set of about a hundred-thousand annotated proteins, and given a set of priors, I could produce the probability of the given set of 100,000 annotations. I used that dataset plus Gibbs sampling to produce those 8 priors. And it worked extraordinarily well.
Oh man, you’re not doing yourself any favors in trying to shift my understanding of you. Not that I doubt that your algorithm worked well! Let me explain.
You’ve used a multilevel modelling scheme in which the estimands are the eight proportions. In general, in any multilevel model, the parameters at a given level determine the prior probabilities for the variables at the level immediately below. In your specific context, i.e., estimating these proportions, a fully Bayesian multilevel model would also have a prior distribution on those proportions (a so-called “hyperprior”, terrible name).
If you didn’t use one, your algorithm can be viewed as a fully Bayesian analysis that implicitly used a constant prior density for the proportions, and this will indeed work well given enough information in the data. Alternatively, one could view the algorithm as a (randomized) type II maximum likelihood estimator, also known as “empirical Bayes”.
In a fully Bayesian analysis, there will always be a top-level prior that is chosen only on the basis of prior information, not data. Any approach that uses the data to set the prior at the top level is an empirical Bayes approach. (These are definitions, by the way.) When you speak of “estimating the prior probabilities”, you’re taking an empirical Bayes point of view, but you’re not well-informed enough to be aware that “Bayesian” and “empirical Bayes” are not the same thing.
The kinds of prior distributions with which I was concerned in my posts are those top-level prior distributions that don’t come from data. Now, my pair of posts were terrible—they basically dropped all of the readers into the inferential gap. But smart mathy guy cousin_it was intrigued enough to do his own reading and wrote some follow-up posts, and these serve as an existence proof that it was possible for someone with enough background to understand what I was talking about.
On the other hand, you didn’t know what I was talking about, but you thought you did, and you offered questions and comments that apparently you still believe are relevant to the topic I addressed in my posts. To me, it really does look like—in this context, at least—you are laboring under a “cognitive bias in which unskilled individuals suffer from illusory superiority, mistakenly rating their ability much higher than is accurate”.
So now I’ll review my understanding of you:
Smart? Yes.
Not as smart as you think you are? Yes.
High intelligence is a core part of your self-image? Well, you did find my claim “not as smart as you think you are” irritating enough to respond to; you touted your math degree, teaching experience, and success in data analysis. So: yes.
Posting on LW is often unrewarding for you because of above three traits? Hmm… well, that has the same answer as this question: have you found our current exchange unrewarding? (Absent further info, I’m assuming the answer is “yes”.)
To claim evidence that I’m overconfident, you have to show me asserting something that is wrong, and then failing to update when you provide evidence that it’s wrong.
In the thread which you referenced, I asked you questions, and the only thing I asserted was that EM and Gibbs sampling find priors which will result in computed posteriors being well-calibrated to the data. You did not provide, and still have not provided, evidence that that statement was wrong. Therefore I did not exhibit a failure to update
I might be using different terminology than you—by “priors” I meant the values that I’m going to use as priors in my running program on new data for transferred function annotations, and by “posteriors” I meant the posterior probability it will compute for a given annotation, given those “priors”. I didn’t claim to know what the standard terminology is. The only thing I claimed was that Gibbs sampling & EM did something that, using my terminology, could be described as setting priors so they gave calibrated results.
If you had corrected my terminology, and I’d ignored you, that would have been a failure to update. If you’d explained that I misunderstand Gibbs sampling, that would have been a failure to update. You didn’t.
Relevant to your post? I don’t know. I didn’t assert that that particular fact was relevant to your post. I don’t know if I even read your post. I responded to your comment, “seek a prior that guarantees posterior calibration,” very likely in an attempt to understand your post.
Again, what are you talking about? I asked you questions. The only thing I claimed to know was about the subject that I brought up, which was EM and Gibbs sampling.
As far as I can see, I didn’t say anything confidently, I didn’t say anything that was incorrect AFAIK, I didn’t claim you had made a mistake, and I didn’t fail to update on any evidence that something I’d said was wrong. So all these words of yours are not evidence for my over-confidence.
Even now, after writing paragraphs on the subject, you haven’t tried to take anything I claimed and explain why it is wrong!
Try this approach: Look over the comments that you provided as evidence of my overconfidence. Say what I would have written differently if I were not overconfident.
I don’t see how distinction makes sense for Gibbs sampling or EM. They are iterative procedures that take your initial (top-level) prior, and then converge on a posterior-to-the-data value (which I called the prior, as it is plugged into my operating program as a prior). It doesn’t matter how you choose your initial prior; the algorithm will converge onto the same final result, unless there is some difficulty converging. That’s why these algorithms exist—they spare you from having to choose a prior, if the data is strong enough that the choice makes no difference.
I wrote a comment that was so discordant with your understanding of Gibbs sampling and EM that it should have been a red flag that one or the other of us was misunderstanding something. Instead you put forth a claim stating your understanding, and it fell to me to take note of the discrepancy and ask for clarification. This failure to update is the exact event which prompted me to attach “Dunning-Kruger” to my understanding of you.
The way in which the ideas you have about EM and Gibbs sampling are wrong isn’t easily fixable in a comment thread. We could do a Google Hangout at some point; if you’re interested, PM me.
I believe my ideas about Gibbs sampling are correct, as demonstrated by my correct choice and implementation of it to solve a difficult problem. My terminology may be non-standard.
Here is what I believe happened in that referenced exchange: You wrote a comment that was difficult to comprehend, and I didn’t see how it related to my question. I explained why I asked the question, hoping for clarification. That’s a failure to communicate, not a failure to update.
My interpretation, having read this comment thread and then the original: Cyan brought up a subtle point about statistics, explained in a non-obvious way. (This comment seemed about as informative to me as the entire post.) You asked “don’t statistical procedures X and Y solve this problem?”, to which Cyan responded that they weren’t relevant, and then you repeated that they do.
Here, the takeaway I would make is that Cyan is likely a theory guy, and you’re likely an applications guy. (I got what I think Cyan’s point was on my first read, but it was a slow read and my “not my area of expertise” alarms were sounding.) It is evidence for overconfidence when people don’t know what they don’t know (heck, that might even be a good definition for overconfidence).
After Cyan’s response that Gibbs and EM weren’t relevant, I would have written something like “If Gibbs and EM aren’t relevant to the ideas of this post, then I don’t think I understand the ideas of this post. Can you try to summarize those as clearly as possible?”
Okay, fair enough. I’ll give it a shot, and then I’m bowing out.
Let me explain the problem with
This is not why these algorithms exist. EM isn’t really an algorithm per se; it’s a recipe for building an optimization algorithm for an objective function with the form given in equation 1.1 of the seminal paper on the topic. Likewise, Gibbs sampling is a recipe for constructing a certain type of Markov chain Monte Carlo algorithm for a given target distribution.
If you read the source material I’ve linked, you’ll notice that the EM paper gives many examples in which nothing like what you call a prior (actually a proportion) is present, e.g., sections 4.1.3, 4.6. Something like what you call priors are present in the example of section 4.3, although those models don’t really match the problem you solved. (To see why I brought up empirical Bayes in the context of your problem, read section 4.5.)
You’ll also notice that the Wikipedia article on MCMC does not mention priors in either your sense or my sense at all. That is because such notions only arise in specific applications; a true grokking of MCMC in general and Gibbs sampling in particular does not require the notion of a prior in either sense.
You’ve understood how to use the Gibbs sampling technology to solve a problem; that does not mean you understand the key ideas underlying the technology. Your problem was in the space of problems addressed by the technology, but that space is much larger, and the key ideas much more general, than you have as yet appreciated.
Not to be a jerk, but your ideas about Gibbs and EM seem very wrong to me too, for exactly the reasons that Cyan describes below.
Because of that, surprised that you said you had used Gibbs in a statistical application with great success. Perhaps you were using a stats package that used Gibbs sampling rather than being Gibbs sampling?.