What math is essential to the art of rationality?
I have started to put together a sort of curriculum for learning the subjects that lend themselves to rationality. It includes things like experimental methodology and cognitive psychology (obviously), along with “support disciplines” like computer science and economics. I think (though maybe I’m wrong) that mathematics is one of the most important things to understand.
Eliezer said in the simple math of everything:
It seems to me that there’s a substantial advantage in knowing the drop-dead basic fundamental embarrassingly simple mathematics in as many different subjects as you can manage. Not, necessarily, the high-falutin’ complicated damn math that appears in the latest journal articles. Not unless you plan to become a professional in the field. But for people who can read calculus, and sometimes just plain algebra, the drop-dead basic mathematics of a field may not take that long to learn. And it’s likely to change your outlook on life more than the math-free popularizations or the highly technical math.
I want to have access to outlook-changing insights. So, what math do I need to know? What are the generally applicable mathematical principles that are most worth learning? The above quote seems to indicate at least calculus, and everyone is a fan of Bayesian statistics (which I know little about).
Secondarily, what are some of the most important of that “drop-dead basic fundamental embarrassingly simple mathematics” from different fields? What fields are mathematically based, other than physics and evolutionary biology, and economics?
What is the most important math for an educated person to be familiar with?
As someone who took an honors calculus class in high school, liked it, and did alright in the class, but who has probably forgotten most of it by now and needs to relearn it, how should I go about learning that math?
I would look into very basic texts (doesn’t even have to be a full book) on what a proof is and how proofs work, e.g.:
http://math.berkeley.edu/~hutching/teach/proofs.pdf
I would also learn enough causal inference stuff to recognize when the {press|people on the internet} are talking out of their asses about an empirical result. Usually this is of the form [policy prescription based on observational data], e.g. “scientists find wine is correlated with life expectancy, so drink wine to live longer!” People, even otherwise very smart people, get this wrong surprisingly often.
But I would say that!
If you know a bit of math that comes up often, you can use it as a sanity check for how careful people are about things you may not know about. That is, if they screw up what you know, that means they probably screw up other stuff.
“Insight porn” is not how real intellectual growth happens, at least in my experience. New insight feels nice, but lasting behavioral change isn’t sudden. Learning new stuff generally outpaces a consistent ability to act on that knowledge in our society.
With regards to insight porn, I was actually a bit surprised to see EY say “change your outlook on life”, which seems very strong. (He did say, “more than” the alternatives, so perhaps it’s a bit uncharitable to critique that.)
Acknowledging that its not a substitute for real understanding, I like insight. There’s no reason why I can’t have them both.
Also, I’m not sure that it is always true that cheep, quick insights aren’t the way intellectual growth happens. There have been many little realizations (and even just exposures to new ideas or topics), that, taken together, made for a more intellectually competent me. Sure, it’s harder to “act on that knowledge in our society” (that takes self-discipline), but I consider that separate from “intellectual growth”
I guess I don’t view “intellectual growth” separately from “personal growth” (perhaps I should?) And I view personal growth as a kind of chemical reaction, where the input ingredient in smallest amounts is the limit to how far the reaction goes. In (modern, Western, internet-enabled) society, intellectual insight/knowledge is usually not the limiting ingredient. The limiting ingredient is generally the motivation to get work done. Without it, the standard failure mode for “too much insight” is online wankery, basically.
You gotta be kidding. You don’t need to learn how to make tools when you only need to use them.
How do you know what a tool is?
Do you know how to use a hammer? Do you know how to make it? Does not knowing how to make it prevent you from using it effectively?
I hope not, because then there must not be even a single tool I know how to use.
http://www.econlib.org/library/Essays/rdPncl1.html
I am not sure about the prerequisites you need for “rationality” but take a look at the following courses:
(1) Schaum’s Outline of Probability, Random Variables, and Random Processes:
(2) udacity’s Intro to Artificial Intelligence:
(3) udacity’s Machine Learning: Supervised Learning :
My suggestion is to use khanacademy.org in the following order: Precalculus->Differential calculus->Integral calculus->Linear Algebra->Multivariable calculus->Differential equations->Probability->Statistics.
If you prefer books:
Free precalculus book
The Calculus Lifesaver
A First Course in Linear Algebra (is free and also teaches proof techniques)
Calculus On Manifolds: A Modern Approach To Classical Theorems Of Advanced Calculus
Ordinary Differential Equations (Dover Books on Mathematics)
Schaum’s Outline of Probability, Random Variables, and Random Processes
Discovering Statistics Using R
Statistics comes last, here is why. Take for example the proof of minimizing squared error to regression line. You will at least need to understand how to solve partial derivatives and systems of equations.
(Note: Books 4-7 are based on my personal research on what to read. I haven’t personally read those particular books yet. But they are praised a lot and relatively cheap and concise.)
Jaynes Draft of “Probability Theory:The Logic of Science”. http://www-biba.inrialpes.fr/Jaynes/prob.html
Bretthorst’s slightly edited version. http://thiqaruni.org/mathpdf9/(86).pdf
EDIT: If anyone knows how to fix that link, please ping me with a solution.
Although the link is broken, the file doesn’t exist anymore. According to comments here Jaynes’ publishers took the book down. For your edification, one would normally escape parentheses with “URL encoding” by replacing the opening parenthesis with %28 and the closing parenthesis with %29.
http://thiqaruni.org/mathpdf9/(86).pdf.pdf)
That seems rather tedious for blog comments.
Why doesn’t \ work as an escape character here?
This section in the FAQ about escapes and links just doesn’t seem to work as advertised: http://wiki.lesswrong.com/wiki/FAQ#How_do_I_make_a_comment.3F
http://thiqaruni.org/mathpdf9/\(86\).pdf
http://thiqaruni.org/mathpdf9/(86\).pdf
[http://thiqaruni.org/mathpdf9/(86\).pdf] ( http://thiqaruni.org/mathpdf9/(86\).pdf)
Maybe if I put some text in between? The first one below is the same as the previous one above, where the text and url have a carriage return between them. The second removes the carriage return so it is all on one line. The third removes the backslash in the url portion.
[http://thiqaruni.org/mathpdf9/(86\).pdf] ( http://thiqaruni.org/mathpdf9/(86\).pdf)
http://thiqaruni.org/mathpdf9/(86).pdf.pdf)
http://thiqaruni.org/mathpdf9/(86).pdf.pdf)
Yes, that does seem to work for me. I think you must be typing it wrong somehow. Or it’s some crazy bug.
Edit: I see, the problem is when it looks for links to automatically promote them if you don’t use Markdown syntax. I guess escaping the parenthesis doesn’t work there.
Should look like this:
http://thiqaruni.org/mathpdf9/(86).pdf.pdf)
You don’t actually have to escape the close parenthesis in the “link text” park of the Markdown (in square brackets), only in the URL part (in parentheses).
http://thiqaruni.org/mathpdf9/(86\).pdf
Thank you. Having a ready made “course sequence” that I can then adapt, is really helpful.
I appreciate having Khan academy for looking up math concept that on which I need a refresher, but I’ve heard (or maybe just assumed?) that the higher level teaching was a bit mediocre. You disagree? I’m fully prepared to update on the estimates of people here.
What’s the value of taking classes in math vs. teaching myself (or maybe teaching myself with the occasional help of a tutor)?
Comparing Khan Academy’s linear algebra course to the free book that I recommended, I believe that Khan Academy will be more difficult to understand if you don’t already have some background knowledge of linear algebra. This is not true for the calculus course though. Comparing both calculus and linear algebra to the books I recommend, I believe that Khan Academy only provides a rough sketch of the topics with much less rigor than can be found in books.
Regarding the quality of Khan Academy. I believe it is varying between excellent and mediocre. But I haven’t read enough rigorous material to judge this confidently.
The advantage of Khan Academy is that you get a quick and useful overview. There are books that are also concise and provide an overview, often in the form of so called lecture notes. But they are incredible difficult to understand (assume a lot of prerequisites).
As a more rigorous alternative to Khan Academy try coursera.org.
I’ve never visited a class or got the help of a tutor. I think you can do just fine without one if you use Google and test your knowledge by buying solved problem books. There are a lot of such books:
Probability Problems and Solutions
Fifty Challenging Problems in Probability with Solutions
The Humongous Book of Statistics Problems
Probability and Statistics with Applications : A Problem Solving Text
102 Combinatorial Problems
Schaum’s 3,000 Solved Problems in Calculus
3,000 Solved Problems in Linear Algebra
Challenging Problems in Geometry
The Humongous Book of Trigonometry Problems
Some massive open online courses now offer personal tutors if you pay a monthly fee. udacity.com is one example here.
I also want to add the following recommendations to my original sequence, since you specifically asked about Bayesian statistics:
Bayes’ Rule: A Tutorial Introduction to Bayesian Analysis
Doing Bayesian Data Analysis: A Tutorial with R and BUGS (new version will be released in November)
Thanks!
Frankly, I think moderate statistical literacy (being able to reasonably evaluate statistics and charts you see in the news, or know your approximate risk of facing a given common medical or criminal problem) and the ability to correctly apply arithmetic to your budget give overwhelmingly more rationality-for-effort than any other type of math.
Having the math to manage a budget and not be bamboozled by media or advertising radically improves your life.
After that I think having an intuitive sense of the way statistical quantities tend to result from highly confounded factors can really give deeper insights into all sorts of economic and sociological results. Statistical methods of clustering or factor analysis are a concrete way to look at this.
I would definitely recommend learning basics of algorithms, feasibility (P vs NP), or even computability (halting problem, Godel’s incompleteness, etc). They will change your worldview significantly.
CLRS is a good entry point. After that, perhaps Sipser for some more depth.
Seconded. P versus NP is the most important piece of the basic math of computer science, and a basic notion of algorithms is a bonus. The related broader theory which nonetheless still counts as basic math is algorithmic complexity and the notion of computability.
Jaynes’ interpretation of probability theory as an extension of logic that calculates probability of propositions conditioned on other propositions is much more straightforward that the usual set theory formulation, and the notation he uses is enormously helpful.
Alfred Korzybski, of The Map is the Territory fame, had some chapters on the concepts of differential calculus in Science in Sanity which way back when I found provided key insights (as all the General Semantics literature does, and should get more attention here.)
Judea Pearl is the main game in town for a clear notation for the mathematical analysis of causality.
David Wolpert’s theoretical framework for analyzing generalization algorithms (in which he produced Stacked Generalization) is a little more obscure, but very useful for learning theory.
My document of life-lessons spits out this (it has a focus on teaching children, but it aims high):
The idea is to see the patterns behind the patterns (link in Einsteins Speed).
This is really good and impressive. Do you have such a list for statistics?
My main aha-moment in statistics occurred when I encountered the lebesgue integral. Integrals suddenly generalized a lot. Lebesgue also allows a lot more nifty but intuitive integral transformations. And of course it is needed for dealing cleanly with probability densities.
Causal networks despite needing tricky rules follow from the other points on my list (trees and probability measures)
The two examples you give (Bayesian statistics and calculus) are very good ones, I would definitely recommend becoming familiar with these. I am not sure how much is covered by the ‘calculus’ label, but I would recommend trying to understand on a gut level what a differential equation means (this is simpler than it might sound. Solving them, on the other hand, is hard and often tedious). I believe vector calculus (linear algebra) and the combination with differential equations (linear ODE’s of dimension at least two) are also covered by ‘calculus’? Again having the ability to solve them isn’t that important in most fields (in my limited experience), but grasping what exactly is happening is very valuable.
If you are wholly unfamiliar with statistics then I would also advice looking into frequentist statistics after having studied the Bayesian statistics—frequentist tools provide very accurate and easily computable approximations to the Bayesian inference, and being able to recognise/use these is useful in most sciences (from social science all the way to theoretical physics).
I would advise looking into frequentist statistics before studying Bayesian statistics. Inference done under Bayesian statistics is curiously silent about anything besides the posterior probability, including whether the model makes sense for the data, whether the knowledge gained about the model is likely to match reality, etc. Frequentist concepts like consistency, coverage probability, ancillarity, model checking, etc., don’t just apply to frequentist estimation; they can be used to asses and justify Bayesian procedures.
If anything, Bayesian statistics should just be treated as a factory that churns out estimation procedures. By a corollary of the complete class theorem, this is also the only way you can get good estimation procedures.
ETA: Can I get comments in addition to (or instead of) down votes here? This is a topic I don’t want to be mistaken about, so please tell me if I’m getting something wrong. Or rather if my comment is coming across as “boo Bayes”, which calls out for punishment.
Actually, if you have the necessary math background, it will probably be useful to start by looking at why and how the frequentists and the Bayesians differ.
Some good starting points, in addition to Bayes, are Fisher information and Neyman-Pearson hypothesis testing. This paper by Gelman and Shalizi could be interesting as well.
Thanks for pointing out the Gelman and Shalizi paper. Just skimmed it so far, but it looks like it really captures the zeitgeist of what reasonably thoughtful statisticians think of the framework they’re in the business of developing and using.
Plus, their final footnote, describing their misgivings about elevating Bayesianism beyond a tool in the hypothetico-deductive toolbox, is great:
I’m afraid I don’t understand. (Theoretical) Bayesian statistics is the study of probability flows under minimal assumptions—any quantity that behaves like we want a probability to behave can be described by Bayesian statistics. Therefore learning this general framework is useful when later looking at applications and most notably approximations. For what reasons do you suggest studying the approximation algorithms before studying the underlying framework?
Also you mention ‘Bayesian procedures’, I would like to clarify that I wasn’t referring to any particular Bayesian algorithm but to the complete study of (uncomputable) ideal Bayesian statistics.
But nobody, least of all Bayesian statistical practitioners, does this. They encounter data, get familiar with it, pick/invent a model, pick/invent a prior, run (possibly approximate) inference of the model against the data, verify if inference is doing something reasonable, and jump back to an earlier step and change something if it doesn’t. After however long this takes (if they don’t give up), they might make some decision based on the (possibly approximate) posterior distribution they end up with. This decision might involve taking some actions in the wider world and/or writing a paper.
This is essentially the same workflow a frequentist statistician would use, and it’s only reasonable that a lot of the ideas that work in one of these settings would be useful, if not obvious or well-motivated, in the other.
I know that philosophical underpinnings and underlying frameworks matter but to quote from a recent review article by Reid and Cox (2014):
Well obviously. Same for physicists, nobody (other than some highly specialised teams working at particle accelerators) use the standard model to compute the predictions of their models. Or for computer science—most computer scientists don’t write code at the binary level, or explicitly give commands to individual transistors. Or chemists—just how many of the reaction equations do you think are being checked by solving the quantum mechanics? But just because the underlying theory doesn’t give as good a result-vs-time-tradeoff as some simplified model does not mean that the underlying theory can be ignored altogether (in my particular examples above I remark that the respective researchers do study the fundamentals, but then hardly ever need to apply them!)! By studying the underlying (often mathematically elegant) theory first one can later look at the messy real-world examples through the lens of this theory, and see how the tricks that are used in practice are mostly making use of but often partly disagree with the overarching theory. This is why studying theoretical Bayesian statistics is a good investment of time—after this all other parts of statistics become more accessible and intuitive, as the specific methods can be fitted into the overarching theory.
Of course if you actually want to apply statistical methods to a real-world problem I think that the frequentist toolbox is one of the best options available (in terms of results vs. effort). But it becomes easier to understand these algorithms (where they make which assumptions, where they use shortcuts/substitutions to approximate for the sake of computation, exactly where, how and why they might fail etc.) if you become familiar with the minimal consistent framework for statistics, which to the best of my knowledge is Bayesian statistics.
Have you seen the series of blog posts by Robins and Wasserman that starts here? In problems like the one discussed there (such as the high-dimensional ones that are commonly seen these days), Bayesian procedures, and more broadly any procedures that satisfy the likelihood principle, just don’t work. The procedures that do work, according to frequentist criteria, do not arise from the likelihood so it’s hard to see how they could be approximations to a Bayesian solution.
You can also see this situation in the (frequentist) classic Theory of Point Estimation written by Lehmann and Casella. The text has four central chapters: “Unbiasedness”, “Equivariance”, “Average Risk Optimality”, and “Minimaxity and Admissibility”. Each of these introduces a principle for the design of estimators and then shows where this principle leads. “Average Risk Optimality” leads to Bayesian inference, but also Bayes-Lite methods like empirical Bayes. But each of the other three chapters leads to its own theory, with its own collection of methods that are optimal under that theory. Bayesian statistics is an important and substantial part of the story told by in that book, but it’s not the whole story. Said differently, Bayesian statistics may be a framework for Bayesian procedures and a useful way of analyzing non-Bayesian statistics, but they are not the framework for all of statistics.
That’s an interesting example, thanks for linking it. I read it carefully, and also some of Robins/Ritov CODA paper:
http://www.biostat.harvard.edu/robins/coda.pdf
and I think I get it. The example is phrased in the language of sampling/missing data, but for those in the audience familiar w/ Pearl, we can rephrase it as a causal inference problem. After all, causal inference is just another type of missing data problem.
We have a treatment A (a drug), and an outcome Y (death). Doctors assign A to some patients, but not others, based on their baseline covariates C. Then some patients die. The resulting data is an observational study, and we want to infer from it the effect of drug on survival, which we can obtain from p(Y | do(A=yes)).
We know in this case that p(Y | do(A=yes)) = sum{C} p(Y | A=yes,C) p(C) (this is just what “adjusting for confounders” means).
If we then had a parametric model for E[Y | A=yes,C], we could just fit that model and average (this is “likelihood based inference.”) Larry and Jamie are worried about the (admittedly adversarial) situation where maybe the relationship between Y and A and C is really complicated, and any specific parametric model we might conceivably use will be wrong, while non-parametric methods may have issues due to the curse of dimensionality in moderate samples. But of course the way we specified the problem, we know p(A | C) exactly, because doctors told us the rule by which they assign treatments.
Something like the Horvitz/Thompson estimator which uses this (correct) model only, or other estimators which address issues with the H/T estimator by also using the conditional model for Y, may have better behavior in such settings. But importantly, these methods are exploiting a part of the model we technically do not need (p(A | C) does not appear in the above “adjustment for confounders” expression anywhere), because in this particular setting it happens to be specified exactly, while the parts of the models we do technically need for likelihood based inference to work are really complicated and hard to get right at moderate samples.
But these kinds of estimators are not Bayesian. Of course arguably this entire setting is one Bayesians don’t worry about (but maybe they should? These settings do come up).
The CODA paper apparently stimulated some subsequent Bayesian activity, e.g.:
http://www.is.tuebingen.mpg.de/fileadmin/user_upload/files/publications/techreport2007_6326[0%5D.pdf
So, things are working as intended :).
You’re welcome for the link, and it’s more than repaid by your causal inference restatement of the Robins-Ritov problem.
Yeah, I think this is the heart of the confusion. When you encounter a problem, you can turn the Bayesian crank and it will always do the Right thing, but it won’t always do the right thing. What I find disconcerting (as a Bayesian drifting towards frequentism) is that it’s not obvious how to assess the adequacy of a Bayesian analysis from within the Bayesian framework. In principle, you can do this mindlessly by marginalizing over all the model classes that might apply, maybe? But in practice, a single model class usually gets picked by non-Bayesian criteria like “does the posterior depend on the data in the right way?” or “does the posterior capture the ‘true model’ from simulated data?”. Or a Bayesian may (rightly or wrongly) decide that a Bayesian analysis is not appropriate in that setting.
As I’ve mentioned several times above Bayesian statistics are not just a set of estimators to be used on problems, they are the minimal framework of probability that satisfy Cox’ law. This means that any algorithm that isn’t even approximately Bayesian will spit out something other than (an approximation of) the posterior probability. In other words, in order to even get any sort of answer that can reasonably be used for further computation there has to be a Bayesian explanation, otherwise what your algorithm is doing just doesn’t have anything to do with statistics. This does not mean that the only useful algorithms are those crafted by trying to compute the likelihood ratio, nor does it mean that there is always a simple algorithm that would be classified as a ‘Bayesian algorithm’. It merely means that to do probability you have to do Bayes, and then maybe some more.
Did you actually read and understand the linked example? The entire point of it is that unless you basically craft your prior to mirror the frequentist behavior, your posterior will center on the truth super slowly. And the setting is not very artificial, exposure/outcome relationships w/ baseline covariates often are complicated, and we often do know randomization probabilities in trials.
Why would I want to approximate your posterior if it has this shitty behavior?
Can you elaborate on this? I don’t think that’s how most people understand Bayesian statistics.
I will give it a shot (I recall reading a well-written explanation elsewhere on LW, and I don’t expect to be as clear as what I read there).
In any estimation or prediction setting we are interested in making accurate probabilistic claims about the behaviour of our system of study. In particular we would like to give a description of how the system will behave in the future (for example: ‘this drug cures patients 30% of the time’). This is captured by the posterior probability distribution.
Now if we have any algorithm whatsover that makes statements about our situation, and reliably gives the right answers (for example: most frequentist algorithms) then that algorithm must do something to convert data into predictions, i.e. it must spit out something that can be rewritten into a(n) (ideal) posterior distribution (which captures all our knowledge and therefore predictions).
Now Bayes’ Law gives a formula for this posterior distribution—when having observed data D then the posterior probability of any hypothesis A is equal to P(A|D) = P(D|A)P(A)/P(D) (here the P(D) is often written as sum(P(D|B)P(B)), and B runs over a complete set of mutually exclusive hypotheses). In order to give any accurate results whatsover an algorithm has to produce results that can be interpreted as the quantity above, P(A|D). This is starting to look Bayesian.
But now one can raise the (valid) point that even though we might want to produce the distribution above we don’t have to explicitly use the equation above to determine it—and indeed this is what gives birth the wide range of statistical algorithms (this along with the fact that the formula above is impossible to compute—there are simply too many hypotheses. So we have to make approximations anyway, and different approximations lead to different algorithms). There are two important observations about this point:
1) There are many, many theoretically possible algorithms (by which I mean computer programs, not just limited to statistics). If you were to create an algorithm at random then it will most likely not produce (approximations of) posterior distributions in a reliable and efficient way. So to produce any statistical algorithm at all you have to somehow make use of the mathematics of probabilities, and a good starting point is to use Bayes’ Theorem. Unless you know/can come up with a different formula for the l.h.s. (or at least a good new approximation) you have no choice but to approximate the r.h.s. (note that most of the time P(D|B) is known, so several of the terms on the r.h.s. are already known). This is part of the reason why many statistical algorithms can easily be interpreted and understood with Bayes’ Theorem—quite a few of them are derived from Bayes’ Theorem (for example for the classical testing of a hypothesis we assume that we initially have a set of hypotheses that all have exactly the same probability, and furthermore assume that some hypothesis neatly fits the data (P(D|B) ~ 1 for some B). Then if for our favourite hypothesis H0 we find a small p-value P(D|H0), i.e. the model did not predict the data, we find a posterior of P(H0|D) = P(D|H0) / sum(P(D|B)), which is also very small (the denominator is order 1, the enumalerator is small). Viewed from a Bayesian framework it is clear that it will therefore often suffice to compute P(D|H) rather than go for P(H|D), unless the data really is inexplicable by our hypotheses (P(D|B) is small for all B we consider) or our hypotheses do not have equal (or almost equal) initial likelihood (for example the theory of general relativity is supported by quite a bit of data, so if a single experiment were to disagree with the theory, even with intimidatingly low p-value, the posterior probability of the theory being right might still be large). By considering our algorithm with Bayes’ Law in mind at all times we do not only understand why the algorithm (p-values) work but also in exactly which cases they break down, and are even presented with a way to improve the algorithm in those cases).
2) The more important reason that the formula above is important is that it is true, even if you decide to use a radically different way of summarising your information to make predictions. If an ideal Bayesian reasoner shows up and computes the posterior probability, and then proceeds to make predictions, then most of the time your algorithm and the reasoner are going to agree (since most of the time you both make valid predictions). So there has to be some part of Bayes’ law that reflects what your algorithm does—if your algorithm works reliably, and Bayes’ law gives a mathematical formula for what you should predict, then for some unknown reason Bayes’ law can describe your algorithm (that is to say, if your algorithm predicts ‘Hypothesis H is likely/true, now that we have seen data D’ then there must be some part of the formula P(H|D) that makes it large, otherwise you two would not agree!). This is the most valuable insight into Bayesian statistics that I know—if anybody computes anything at all that acts like a prediction/probability and reliably gets correct results then in Bayes’ theorem the terms must happen to line up just right to give similar predictions, so we can describe the algorithm (and maybe even improve it) by making precise which terms act how. By looking at exactly what predictions a statistical algorithm will make under which conditions we can find out how the unknown terms on the r.h.s. of Bayes’ Theorem should act to produce similar results (the predictions from our non-Bayesian algorithm were correct most of the time, so there has to be some behaviour of these terms that will let us mimic the results), which explains the overt/hidden assumptions and thereby limitations of the model, provides understanding and more insight (I hope that my example above of p-values will provide an example of this insight—I certainly learned more about why people weren’t concerned about 5-sigma disagreements with GR by studying Bayes) and sometimes even improvements. This is what I was trying to say (but I was lazy so I tried to use fewer words. Clearly this backfired) earlier, when I said things like “look at the messy real-world examples through the lens of [Bayesian statistics] and see how the tricks that are used in practice are mostly making use of but often partly disagree with the overarching theory”.
Lastly I would like to remark that in this post so far there has been no mention whatsoever of computational efficiency, and it should not be surprising that actual Bayesian algorithms are therefore often slower (or even uncomputable) when compared to other options (like I said above, there are many possible algorithms. There is no reason why the corrent one and the fast one should be the same). Therefore for real-world applications a non-Bayesian algoritm is often your best bet. But to study and understand the probabilities and general algorithms the Bayesian approach is superior—all because Bayes’ Theorem is a Theorem.
Lastly I would like to remark that often people (including me, above) write as if Bayesian statistics and frequentist statistics are in active conflict, which is mostly but not fully correct. Frequentist tests (and other tests, I tend to forget that there are more than two schools of statistics) are often superior in practice, and should most definitely be used in practice. But given an algorithm, some peculiar black box that gives the right answers, or given a philosophical conundrum on probabilies, the correct thing to do is write down Bayes’ Theorem and figure out how the unknowns should act to give the results we are faced with, i.e. figure out how an ideal Bayesian reasoner could come to the same conclusion. And it really is a shame that this doesn’t always lead to quick algorithms in practice.
I hope this has cleared up some misunderstandings.
Thanks for the write-up. I read it as you arguing that most any prediction can be interpreted in the Bayesian framework which, I think is a weaker claim.
However, there are issues with treating it as the only right way for it leaves a number of important questions unanswered. For example, how do you pick the prior? How do you assemble your set of possible outcomes (=hypotheses)? What happens if your forecast influences the result?
I also think that being uncomputable is a bigger deal than you make it to be.
I think that the claim that any prediction can be interpreted in this minimal and consistent framework without exceptions whatsoever is a rather strong claim, I don’t think I want to claim much more than that (although I do want to add that if we have such a unique framework that is both minimal and complete when it comes to making predictions then that seems like a very natural choice for Statistics with a capital s).
I don’t think we’re going to agree about the importance of computability without more context. I agree that every time I try to build myself a nice Bayesian algorithm I run into the problem of uncomputability, but personally I consider Bayesian statistics to be more of a method of evaluating algorithms than a method for creating them (although Bayesian statistics is by no means limited to this!).
As for your other questions: important to note is that your issues are issues with Bayesian statistics as much as they are issues with any other form of prediction making. To pick a frequentist algorithm is to pick a prior with a set of hypotheses, i.e. to make Bayes’ Theorem computable and provide the unknowns on the r.h.s. above (as mentioned earlier you can in theory extract the prior and set of hypotheses from an algorithm by considering which outcome your algorithm would give when it saw a certain set of data, and then inverting Bayes’ Theorem to find the unknowns. At least, I think this is possible (it worked so far)). And indeed picking the prior and set of hypotheses is not an easy task—this is precisely what leads to different competing algorithms in the field of statistics.
Okay, this is the last thing I’ll say here until/unless you engage with the Robins and Wasserman post that IlyaShpitser and I have been suggesting you look at. You can indeed pick a prior and hypotheses (and I guess a way to go from posterior to point estimation, e.g., MAP, posterior mean, etc.) so that your Bayesian procedure does the same thing as your non-Bayesian procedure for any realization of the data. The problem is that in the Robins-Ritov example, your prior may need to depend on the data to do this! Mechanically, this is no problem; philosophically, you’re updating on the data twice and it’s hard to argue that doing this is unproblematic. In other situations, you may need to do other unsavory things with your prior. If the non-Bayesian procedure that works well looks like a Bayesian procedure that makes insane assumptions, why should we look to Bayesian as a foundation for statistics?
(I may be willing to bite the bullet of poor frequentist performance in some cases for philosophical purity, but I damn well want to make sure I understand what I’m giving up. It is supremely dishonest to pretend there’s no trade-off present in this situation. And a Bayes-first education doesn’t even give you the concepts to see what you gain and what you lose by being a Bayesian.)
The Bayes Rule by itself is not a framework. It’s just a particular statistical operation, useful no doubt, but hardly arising to the level of framework.
The claim that you can interpret any prediction as forecasting a particular probability distribution has nothing to do with Bayes. For example, let’s say that an analyst predicts the average growth in the GDP of China for the next five years to be 5%. If we dig and poke we can re-express this as a forecast of something like a normal distribution centered at 5% and with some width which corresponds to the expected error—so there is your forecast probability distribution. But is there a particular prior here? Any specific pieces of evidence on which the analyst updated the prior? Um, not really.
An understanding of the insights behind math is essential. But I wonder: To improve your rationality, how often do you really solve an equation, arithmetically (as opposed to just going by feel) calculate probabilities from Bayes Rule, or derive a formal proof?
The concrete practice is an indispensable way of arriving at the insight. (“No royal road to geometry” etc.)
Achieving facility with the concrete work is evidence that you have the insight. Evidence to yourself, the one person you need to prove it to.
To be avoided is gaining a mere feeling of understandishness. Anyone can learn to say “light travels along geodesics in curved space”, but if you can’t calculate the precession of Mercury, you don’t know general relativity.
Yes, concrete practice may be indispensable to the insight. But once you have the insight, do you ever need to calculate to help you with a practical problem? Almost never, I think.
When you know things, you discover uses for them. Knowing arithmetic, you can easily decide whether the supereconomy giant size really is a good deal. Knowing prob/stats/causality, you can dismiss a lot of reporting as junk, and be able to say exactly why. Quadratic equations are often used as an example of useless knowledge, and yet I find myself solving those from time to time, and not just at work (in the narrow sense of what people pay me to do).
Yes, arithmetic does come in useful, for example in those cases.
Can you give an example of when you have used actual arithmetical calculations to explain why some prob/stats/causality were junk, or where you solved a quadratic equation?
It’s not some minor trick, like how to fold a t-shirt, it’s useful everywhere.
It’s common enough that I don’t even notice it as a thing. But for example, a political survey shows a 2% advantage for one party. The sample size is given and I know at once that the result is noise. (sigma = sqrt(pqN).) Knowing how correlation and causality relate to each other disposes of a lot of bad reporting, and some bad research. Or I want to generate random numbers with a certain distribution; that easily leads to pages of algebra and trigonomentry.
For a more extensive illustration of how knowing all this stuff enables you to see the world, see gwern’s web site.
Certainly, it is useful everywhere to understand. But very few people actually run calculations (other than basic arithmetic). Gwern and you are very rare exceptions. I think the world could use more of that.
I am greatly flattered to be mentioned in the same breath as Gwern. The world could indeed use a lot more Gwerns.
But it’s like what lionhearted just posted about history: when you know this sort of thing, you see its use. And by seeing its use, you can do things that would not previously have come to your attention as possibilities.
If you get diagnosed with an illness and are given the sensitivity and specificity of the test, being able to calculate your risk is valuable and many doctors get this wrong.
The trouble is that knowing the formula and being able to use it in daily life are two different things. On one of the LW censuses a significant portion got a question intended to test knowledge of Bayes rule wrong.
So other than learning the formula, how do you suggest that we learn to apply it, other than actually applying over and over, until one stops getting problems wrong? It seems that it’s just a function of practice.
I don’t have a good answer to that question.
There are studies of mental biases that suggest that a lot of people who can manage to apply the formula to textbook problem fails to apply it when you give them a political scenario. Most people fail to think clearly and motivated reasoning if the question becomes meaningful for them.
As far as I understand CFAR tries to teach bayes theorem in a way that people will actually use it. I however neither know their exact curriculum nor know the success rate of their approach.
How often have you received this info in practice? How often have you done the calculations?
Would anyone advice new book called The Joy of X by Steven Strogatz? (I am not an advanced mathematician)
http://www.amazon.com/The-Joy-Guided-Tour-Infinity/dp/0547517653
Calculating interest.
Predator prey relationships.
You might want to look into the maths related to various paradoxes: unexpected hanging, Raven paradox, liar paradox, two envelopes problem, Zeno’s paradox, Blue Eyes problem and game theory like the Prisoner’s dilemma. Wikipedia, Math Stackexchange and Less Wrong have much discussion on these paradoxes.
If you want to improve your rationality, it’s not enough to just know the solution. You have think very carefully about what reasoning made you believe the incorrect answer in the first place and how to adjust your intuitions so that they are correct.
Pretty much every field uses some math except maybe social science, political science, history, languages and literature. I think the most commonly (mis?)used math would belong in the realm of statistics.
Statistics was invented for social science, especially political and demographic studies; hence the name.
Oops. Thanks for catching my blunder in this safe environment :)
I’m not sure what it is about the internet that incentivises talking out of your ass.
And “languages ” (I think you mean linguistics) are now heavily using (applied) statistics, especially since corpus linguistics became mainstream. The other issue is that “traditional” linguists usually “somewhat” lack their statistics background and thus the methods are creeping in very slowly, and often there is a tension between traditional and computational linguists.
Literature (if you mean literary theory) is much slower on the uptake, but even there some people admit that these new ideas about calculating number of words and their distribution can help sometimes.
By languages I had in mind university programs that produce language teachers, translators and interpreters. I’m not sure if such programs exists in the US, nor if they can be properly called a “field”.
If I were to lead a machine to rationality, and if I were limited to one mathematical concept, it would be inequality. This is not that. There is not here. These are not those. A is not not-A. One is not many. All is not none. Some are not every. True is not not-true. Not-inequality is not inequality (ie equality). Inequality is perhaps THE mathematical concept.