When I’m reasoning about a practical chemistry problem at work, I’m usually thinking in terms of qualitative mechanisms:
Ethanolamine reacts with epoxides on magnetic beads so that other molecules we introduce later won’t react with the epoxides. This is called “blocking” the epoxides.
But at physiological pH, the secondary amine the ethanolamine introduces is protonated and will be positively charged. This gives the entire bead surface a substantial positive charge.
The nucleic acids in our solution are negatively charged, so they will form ionic bonds with the secondary amine and may also hydrogen bond with its hydroxyl group.
Etc.
The bottleneck for this style of inference requires having a strong grasp of the components of the system and the most important chemical mechanisms at play. The point is to generate hypotheses you can test in an experiment in order to get better practical performance out of the system. The data you are working with is not homogeneous or cleaned. It’s a diverse set of highly contextual observations, and you’re trying to find a logical chain of mechanisms that could produce the observed results. Finding even one such chain can be a big success.
The examples given in Bayesian statistics or machine learning are often for making as much sense as you can out of big data—predicting chemical properties from SMILES, for example. These approaches won’t necessarily give you mechanistic insight on their own. Although the individual data points are what we’d typically think of as “observations,” to a human being, it’s really the outputs of the models that are the “observations” from which we try to make biological sense.
This, I think, has been the root of one of my psychological struggles in the formal study of Bayesian stats. My textbook presents toy models, such as computing the likelihoods of there being a given number of white or blue marbles in a bag by drawing one at a time with replacement. For these problems, Bayesian approaches are the natural way to solve them, and so it’s natural to think of the individual marble draws as “observations” and the posterior probabilities as the “meaningful hypothesis” that we are trying to infer. But the problems we encounter daily are not so formalized and pre-cleaned, and we often don’t need Bayesian methods in order to do an adequate job in solving them.
My PhD will be in cancer epigenetics, where we’d like to look at the cancer microenvironment, determine what cell types are present, and trace the lineage of the cancer back to the healthy cells from which it sprang. We’d also like to use epigenetic data to understand the mechanism by which the healthy cell became cancerous. We have a wet lab workflow that results in an enormous number of single cell genomes that includes information about which bases are methylated per cell. Figuring out how to infer cell type and lineage from that information is the challenge. For this sort of (deeply unfamiliar, not at all intuitively real-world) problem, the tools of Bayesian statistics are perfect.
And so I think that part of making Bayesian statistics intuitive is making the kinds of problems we are trying to solve with Bayesian statistics intuitive. Get in the mental habit of inhabiting a world where you do have access to big, meaningful datasets and can generate potentially groundbreaking insights from them if you have the right statistical tools and know how to use them.
When I’m reasoning about a practical chemistry problem at work, I’m usually thinking in terms of qualitative mechanisms:
Ethanolamine reacts with epoxides on magnetic beads so that other molecules we introduce later won’t react with the epoxides. This is called “blocking” the epoxides.
But at physiological pH, the secondary amine the ethanolamine introduces is protonated and will be positively charged. This gives the entire bead surface a substantial positive charge.
The nucleic acids in our solution are negatively charged, so they will form ionic bonds with the secondary amine and may also hydrogen bond with its hydroxyl group.
Etc.
The bottleneck for this style of inference requires having a strong grasp of the components of the system and the most important chemical mechanisms at play. The point is to generate hypotheses you can test in an experiment in order to get better practical performance out of the system. The data you are working with is not homogeneous or cleaned. It’s a diverse set of highly contextual observations, and you’re trying to find a logical chain of mechanisms that could produce the observed results. Finding even one such chain can be a big success.
The examples given in Bayesian statistics or machine learning are often for making as much sense as you can out of big data—predicting chemical properties from SMILES, for example. These approaches won’t necessarily give you mechanistic insight on their own. Although the individual data points are what we’d typically think of as “observations,” to a human being, it’s really the outputs of the models that are the “observations” from which we try to make biological sense.
This, I think, has been the root of one of my psychological struggles in the formal study of Bayesian stats. My textbook presents toy models, such as computing the likelihoods of there being a given number of white or blue marbles in a bag by drawing one at a time with replacement. For these problems, Bayesian approaches are the natural way to solve them, and so it’s natural to think of the individual marble draws as “observations” and the posterior probabilities as the “meaningful hypothesis” that we are trying to infer. But the problems we encounter daily are not so formalized and pre-cleaned, and we often don’t need Bayesian methods in order to do an adequate job in solving them.
My PhD will be in cancer epigenetics, where we’d like to look at the cancer microenvironment, determine what cell types are present, and trace the lineage of the cancer back to the healthy cells from which it sprang. We’d also like to use epigenetic data to understand the mechanism by which the healthy cell became cancerous. We have a wet lab workflow that results in an enormous number of single cell genomes that includes information about which bases are methylated per cell. Figuring out how to infer cell type and lineage from that information is the challenge. For this sort of (deeply unfamiliar, not at all intuitively real-world) problem, the tools of Bayesian statistics are perfect.
And so I think that part of making Bayesian statistics intuitive is making the kinds of problems we are trying to solve with Bayesian statistics intuitive. Get in the mental habit of inhabiting a world where you do have access to big, meaningful datasets and can generate potentially groundbreaking insights from them if you have the right statistical tools and know how to use them.
Thinking about the qualitative mechanism is causal reasoning, which humans prefer to statistical reasoning.
Causal knowledge is obtaining by making statistical observations (and not necessarily big datasets of statistical observations).