The Median Researcher Problem
Claim: memeticity in a scientific field is mostly determined, not by the most competent researchers in the field, but instead by roughly-median researchers. We’ll call this the “median researcher problem”.
Prototypical example: imagine a scientific field in which the large majority of practitioners have a very poor understanding of statistics, p-hacking, etc. Then lots of work in that field will be highly memetic despite trash statistics, blatant p-hacking, etc. Sure, the most competent people in the field may recognize the problems, but the median researchers don’t, and in aggregate it’s mostly the median researchers who spread the memes.
(Defending that claim isn’t really the main focus of this post, but a couple pieces of legible evidence which are weakly in favor:
People did in fact try to sound the alarm about poor statistical practices well before the replication crisis, and yet practices did not change, so clearly at least some people did in fact see the problem and were in fact not memetically successful at the time. The claim is more general than just statistics-competence and replication, but at least in the case of the replication crisis it seems like the model must be at least somewhat true.
Again using the replication crisis as an example, you may have noticed the very wide (like, 1 sd or more) average IQ gap between students in most fields which turned out to have terrible replication rates and most fields which turned out to have fine replication rates.
… mostly, though, the reason I believe the claim is from seeing how people in fact interact with research and decide to spread it.)
Two interesting implications of the median researcher problem:
A small research community of unusually smart/competent/well-informed people can relatively-easily outperform a whole field, by having better internal memetic selection pressures.
… and even when that does happen, the broader field will mostly not recognize it; the higher-quality memes within the small community are still not very fit in the broader field.
In particular, LessWrong sure seems like such a community. We have a user base with probably-unusually-high intelligence, community norms which require basically everyone to be familiar with statistics and economics, we have fuzzier community norms explicitly intended to avoid various forms of predictable stupidity, and we definitely have our own internal meme population. It’s exactly the sort of community which can potentially outperform whole large fields, because of the median researcher problem. On the other hand, that does not mean that those fields are going to recognize LessWrong as a thought-leader or whatever.
In academic biomedicine, at least, which is where I work, it’s all about tech dev. Most of the development is based on obvious signals and conceptual clarity. Yes, we do study biological systems, but that comes after years, even decades, of building the right tools to get a crushingly obvious signal out of the system of interest. Until that point all the data is kind of a hint of what we will one day have clarity on rather than a truly useful stepping stone towards it. Have as much statistical rigor as you like, but if your methods aren’t good enough to deliver the data you need, it just doesn’t matter. Which is why people read titles, not figure footnotes: it’s the big ideas that really matter, and the labor going on in the labs themselves. Papers are in a way just evidence of work being done.
That’s why I sometimes worry about LessWrong. Participants who aren’t professionally doing research and spend a lot of time critiquing papers over niche methodological issues be misallocating their attention, or searching under the spotlight. The interesting thing is growth in our ability to measure and manipulate phenomena, not the exact analysis method in one paper or another. What’s true will eventually become crushingly obvious and you won’t need fancy statistics at that point, and before then the data will be crap so the fancy statistics won’t be much use. Obviously there’s a middle ground, but I think the vast majority of time is spent in the “too early to tell” or “everybody knows that” phase. If you can’t participate in that technology development in some way, I am not sure it’s right to say you are “outperforming” anything.
I don’t see how this is any evidence against John’s point.
Presumably the reason you need such crushingly obvious results which can be seen regardless of the validity of your statistical tool before the field can move on is because you need to convince the median researchers.
The sharp researchers have predictions about where the field is going based on statistical evidence and mathematical reasoning, and presumably can be convinced of the ultimate state far before the median, and work toward proving or disproving their hypotheses, and then once its clear to them, making the case stupidly obvious for the lowest common denominator in the room. And I expect this is where most of the real conceptual progress lies.
Even in the word where as you claim this is a marginal effect, if we could speed up any given advance in academic biomedicine by a year, that is an incredible achievement! Many people may die in that year who could’ve been saved had the median not wasted time (assuming the year saved carries over to clinical medicine).
It’s not evidence, it’s just an opinion!
But I don’t agree with your presumption. Let me put it another way. Science matters most when it delivers information that is accurate and precise enough to be decision-relevant. Typically, we’re in one of a few states:
The technology is so early that no level of statistical sophistication will yield decision-relevant results. Example: most single-cell omics in 2024 that I’m aware of, with respect to devising new biomedical treatments (this is my field).
The technology is so mature that any statistics required to parse it are baked into the analysis software, so that they get used by default by researchers of any level of proficiency. Example: Short read sequencing, where the extremely complex analysis that goes into obtaining and aligning reads has been so thoroughly established that undergraduates can use it mindlessly.
The technology’s in a sweet spot where a custom statistical analysis needs to be developed, but it’s also so important that the best minds will do that analysis and a community norm exists that we defer to them. Example: clinical trial results.
I think what John calls “memetic” research is just areas where the topics or themes are so relevant to social life that people reach for early findings in immature research fields to justify their positions and win arguments. Or where a big part of the money in the field comes from corporate consulting gigs, where the story you tell determines the paycheck you get. But that’s not the fault of the “median researcher,” it’s a mixture of conflicts of interest and the influence of politics on scientific research communication.
The argument seems to be about this stage, and from what I’ve heard clinical trials indeed take so much more time than is necessary. But maybe I’ve only heard about medical clinical trials, and actually academic biomedical clinical trials are incredibly efficient by comparison.
It also sounds like “community norm exists that we defer to [the best minds]” requires the community to identify who the best minds are, which presumably involves critiquing the research outputs of those best minds according to the standards of the median researcher, which often (though I don’t know about biomedicine) ends up being something crazy like h-index or number of citations or number of papers or derivatives of such things.
It’s not necessary for each person to personally identify the best minds on all topics and exclusively defer to them. It’s more a heuristic of deferring to the people those you trust most defer to on specific topics, and calibrating your confidence according to your own level of ability to parse who to trust and who not to.
But really these are two separate issues: how to exercise judgment in deciding who to trust, and the causes of research being “memetic.” I still say research is memetic not because mediocre researchers are blithely kicking around nonsense ideas that take on an exaggerated life of their own, but mainly because of politics and business ramifications of the research.
The idea that wine is good for you is memetic both because of its way of poking at “established wisdom” and because the alcohol industry sponsors research in that direction.
Similar for implicit bias tests, which are a whole little industry of their own.
Clinical trials represent decades of investment in a therapeutic strategy. Even if an informed person would be skeptical that current Alzheimer’s approaches are the way to go, businesses that have invested in it are best served by gambling on another try and hoping to turn a profit. So they’re incentivized to keep plugging the idea that their strategy really is striking at the root of the disease.
I really feel like we’re talking past each other here, because I have no idea how any of what you said relates to what I said, except the first paragraph.
As for that, what you describe sounds worse than a median researcher problem, instead sounding like a situation ripe for group think instead!
Yes, I agree it’s worse. If ONLY a better understanding of statistics by Phd students and research faculty was at the root of our cultural confusion around science.
I’m not sure median researcher is particularly important here, relatively to, say, median lab leader.
Median voter theorem works explicitly because votes of everyone are equal, but if you have lab/research group leader who disincentivizes bad research practices, then you theoretically should get lab with good research practices.
In practice, lab leaders are often people who Goodhart incentives, which results in current situation.
LessWrong has chance to be better exactly because it is outside of current system of perverse incentives. Although, it has its own bad incentives.
I had the thought while reading the original post that I recall speaking to at least one researcher who, pre-replication crisis, was like “my work is built on a pretty shaky foundation as is most of the research in this field, but what can you do, this is the way the game is played”. So that suggested to me that plenty of median researchers might have recognized the issue but not been incentivized to change it.
Lab leaders aren’t necessarily in a much better position. If they feel responsibility toward their staff, they might feel even more pressured to keep gaming the metrics so that the lab can keep getting grants and its researchers good CVs.
I agree that lab leaders are not in much better position, I just think that lab leaders causally screen off influence of subordinates, while incentives in the system causally screens off lab leaders.
I am not sure about the median researcher. Many fields have a few “big names” that everybody knows and who’s opinions have disproportionate weight.
I think this is too strong. There are quite a few posts that don’t require knowledge of either one to write, read, or comment on. I’m certain that one could easily accumulate lots of karma and become a well-respected poster without knowing either.
Yeah, I didn’t read this post and come away with “and this is why LessWrong works great”, I came away with a crisper model of “here are some reasons LW performs well sometimes”, but more importantly “here is an important gear for what LW needs to work great.”
Our broader society has community norms which require basically everyone to be literate. Nonetheless, there are jobs in which one can get away without reading, and the inability to read does not make it that much harder to make plenty of money and become well-respected. These statements are not incompatible.
Hmm… let me rephrase: it doesn’t seem to me like we would actually have a clear community norm for this, at least not one strong enough to ensure that the median community member would actually be familiar with stats and econ.
There’s no norm saying you can’t be ignorant of stats and read, or even post about things not requiring an understanding of stats, but there’s still a critical mass of people who do understand the topic well enough to enforce norms against actively contributing with that illiteracy. (E.g. how do you expect it to go over if someone makes a post claiming that p=0.05 means that there’s a 95% change that the hypothesis is true?)
Taking it a step further, I’d say my household “has norms which basically require everyone to speak English”, but that doesn’t mean the little one is quite there yet or that we’re gonna boot her for not already meeting the bar. It just means that she has to work hard to learn how to talk if she wants to be part of what’s going on.
Lesswrong feels like that to me in that I would feel comfortable posting about things which require statistical literacy to understand, knowing that engagement which fails to meet that bar will be downvoted rather than getting downvoted for expecting to find a statistically literate audience here.
Show me a field where replication crises tear through, exposing fraud and rot and an emperor that never had any clothes, a field where replications fail so badly that they result in firings and polemics in the New York Times and destroyed careers- and then I will show you a field that is a little confused but has the spirit and will get there sooner or later.
What you really need to look out for are fields that could never, on a conceptual level, have a devastating replication crisis. Lesswrong sometimes strays a little close to this camp.
So… parapsychology? How’d that work out? Did they have the (ahem) spirit and get there sooner or later?
Personally I am quite pleased with the field of parapsychology. For example, they took a human intuition and experience (“Wow, last night when I went to sleep I floated out of my body. That was real!”) and operationalized it into a testable hypothesis (“When a subject capable of out of body experiences floats out of their body, they will be able to read random numbers written on a card otherwise hidden to them.”) They went and actually performed this experiment, with a decent deal of rigor, writing the results down accurately, and got an impossible result- one subject could read the card. (Tart, 1968.) A great deal of effort quickly went in to further exploration (including military attention with the men who stare at goats etc) and it turned out that the experiment didn’t replicate, even though everyone involved seemed to genuinely expect it to. In the end, no, you can’t use an out of body experience to remotely view, but I’m really glad someone did the obvious experiments instead of armchair philosophizing.
https://digital.library.unt.edu/ark:/67531/metadc799368/m2/1/high_res_d/vol17-no2-73.pdf is a great read from someone who obviously believes in the metaphysical, and then does a great job designing and running experiments and accurately reporting their observations, and so it’s really only a small ding against them that the author draws the wrong larger conclusions in the end.
A field can be absolutely packed with dreadful research and still see virtually no one getting fired. Take, for instance, the moment a prominent psychologist dubbed peers who questioned methodological standards as “methodological terrorists.” It’s the kind of rhetoric that sends a clear message: questioning sloppy methods isn’t just unwelcome; it’s practically heretical.
This assumes the median researchers can’t recognize who the competent researchers are, or otherwise don’t look to them as thought leaders.
I’m not arguing that this isn’t often the case, just that it isn’t always the case. In engineering, if you’re more competent than everyone else, you can make cooler shit. If you’re a median engineer trying to figure out which memes to take on and spread, you’re going to be drawn to the work of the more competent engineers because it is visibly and obviously better.
In fields where distinguishing between bad research and good research has to be done by knowing how to do good research, rather than “does it fly or does it crash”, then the problem you describe is much more difficult to avoid. I argue that the difference between the fields which replicate and those which don’t is as much about the legibility of the end product as it is about the quality of the median researcher.
Eh. feels wrong to me. Specifically, this argument feels over-complicated.
As best I can tell, the predominant mode of science in replication-crisis affected fields is that they do causal inference by sampling from noisy posteriors.
The predominant mode of science in non-replication-crisis affected fields is that they don’t do this or do this less.
Most of the time it seems like science is conducted like that in those fields because they have to. Can you come up with a better way of doing Psychology research? Science in hard fields is hard is definitely a less sexy hypothesis, but it seems obviously true?
They’re measuring a noisy phenomenon, yes, but that’s only half the problem. The other half of the problem is that society demands answers. New psychology results are a matter of considerable public interest and you can become rich and famous from them. In the gap between the difficulty of supply and the massive demand grows a culture of fakery. The same is true of nutrition— everyone wants to know what the healthy thing to eat is, and the fact that our current methods are incapable of discerning this is no obstacle to people who claim to know.
For a counterexample, look at the field of planetary science. Scanty evidence dribbles in from occasional spacecraft missions and telescopic observations, but the field is intellectually sound because public attention doesn’t rest on the outcome.
Yes. More emphasis on concrete useful results, less emphasis on trying to find simple correlations in complex situations.
For example, “Do power poses work?”. They did studies like this one where they tell people to hold a pose for five minutes while preparing for a fake job interview, and then found that the pretend employers pretended to hire them more often in the “power pose” condition. Even assuming there’s a real effect where those students from that university actually impress those judges more when they pose powerfully ahead of time… does that really imply that power posing will help other people get real jobs and keep them past the first day?
That’s like studying “Are car brakes really necessary?” by setting up a short track and seeing if the people who run the “red light” progress towards their destination quicker. Contrast that with studying the cars and driving behaviors that win races, coming up with your theories, and testing them by trying to actually win races. You’ll find out very quickly if your “brakes aren’t needed” hypothesis is a scientific breakthrough or foolishly naive.
Instead of studying “Does CBT work?”, study the results of individual therapists, see if you can figure out what the more successful ones are doing differently than the less successful ones, and see if you can use what you learn to increase the effectiveness of your own therapy or the therapy of your students. If the answer turns out to be “The successful therapists all power pose pre-session, then perform textbook CBT” and that allows you to make better therapists, great. If it’s something else, then you get to focus on the things that actually show up in the data.
The results should speak for themselves. If they don’t, and you aren’t keeping in very close contact with real world results, then it’s super easy to go astray with internal feedback loops because the loop that matters isn’t closed.
The reason I trust research in physics in general is that it doesn’t end with publishing a paper. It often ends with building machines that depend on that research being right.
We don’t just “trust the science” that light is a wave; we use microwave ovens at home. We don’t just “trust the science” that relativity is right; we use the relativistic equations to adjust GPS measurements. Therefore it would be quite surprising to find out that any of these underlying theories is wrong. (I mean, it could be wrong, but it would have to be wrong in the right way that still keeps the GPS and the microwave ovens working. That limits the possibilities of what the alternative theory could be.)
Therefore, in a world where we all do power poses all the time, and if you forget to do them, you will predictably fail the exam...
...well, actually that could just be a placebo effect. (Something like seeing a black cat on your way to exam, freaking out about it, and failing to pay full attention to the exam.) Damn!
Well said. I’m gonna have to steal that.
Yeah, “Can I fail my exam” is a bad test, because when the test is “can I fail” then it’s easy for the theory to be “wrong in the right way”. GPS is a good test of GR because you just can’t do it without a better understanding of spacetime so it has to at least get something right even if it’s not the full picture. When you actually use the resulting technology in your day to day life and get results you couldn’t have gotten before, then it almost doesn’t matter what the scientific literature says, because “I would feel sorry for the good Lord. The theory is correct.”.
There are psychological equivalents of this, which rest on doing things that are simply beyond the abilities of people who lack this understanding. The “NLP fast phobia cure” is a perfect example of this, and I can provide citations if anyone is interested. I really get a kick out of the predictable arguments between those who “trust the science” but don’t understand it, and those who actually do it on a regular basis.
This reminds me of an amusing anecdote.
I had a weird experience once where I got my ankle sprained pretty bad and found myself simultaneously indignantly deciding that my ankle wasn’t going to swell and also thinking I was crazy for feeling like swelling was a thing I could control—and it didn’t swell. I told my friend about this experience, and while she was skeptical and thought it sounded crazy, she tried it anyway and her next several injuries didn’t swell.
Eventually she casually mentioned to someone “Nah, my broken thumb isn’t going to swell because I decided not to”, and the person she was talking to responded as if she had said something else because his brain just couldn’t register what she actually said as a real possibility. She then got all self conscious about it and was kinda unintentionally gaslighted into feeling like she was crazy for thinking she could do that, and her thumb swelled up.
I had to call her and remind her “No, you don’t give up and expect it to swell because it ‘sounds crazy’, you intend for it to not swell anyway and find out whether it is something you can control or not”. The swelling went back down most of the way after that, though not to the same degree as in the previous cases where the injury never swelled in the first place.
The problem with this model is that the “bad” models/theories in replication-crisis-prone fields don’t look like random samples from a wide posterior. They have systematic, noticeable, and wrong (therefore not just coming from the data) patterns to them—especially patterns which make them more memetically fit, like e.g. fitting a popular political narrative. A model which just says that such fields are sampling from a noisy posterior fails to account for the predictable “direction” of the error which we see in practice.
I made an omission mistake in just saying “sampling from noisy posteriors,” note I didn’t say they were performing unbiased sampling.
To extend the Psychology example: a study could be considered a sampling technique of the noisy posterior. You appear to be arguing that the extent to which this is a biased sample is a “skill issue.”
I’m arguing that it is often very difficult to perform unbiased sampling in some fields; the issue might be a property of the posterior and not that the researcher has a weak prefrontal cortex. In this framing it would totally make sense if two researchers studying the same/correlated posterior(s) are biased in the same direction–its the same posterior!
My default assumption for any story that ends with “And this is why our ingroup is smarter than everyone else and people outside won’t recognize our genius” is that the story is self-serving nonsense, and this article isn’t giving me any reason to think otherwise.
A “userbase with probably-high intelligence, community norms about statistics and avoiding predictable stupidity” describes a lot of academia. And academia has a higher barrier to entry than “taking the time to write some blog articles”. The average lesswrong user doesn’t need to run an RCT before airing their latest pet theory for the community, so why would it be so much better at selectively spreading true memes than academia is?
I would need a much more thorough gears-level model of memetic spread of ideas, one with actual falsifiable claims (you know, like when people do science) before I could accept the idea of LessWrong as some kind of genius incubator.
Curated. I think this model is pretty useful and well-compressed, and I’m glad to be able to concisely link to it.
Insofar as it is accurate, the policy implications are still much open to debate, for here on LessWrong and for other ecosystems in the world.
This rings painfully true. As early as the late 1950s, at least one person was already raising a red flag about the risks that psychology[1] might veer into publishing a sea of false claims:
Sterling isn’t explicitly talking about psychology, but rather any field where significance tests are used.
Could you please elaborate on what you mean by “highly memetic” and “internal memetic selection pressures”? I’m probably not the right audience for this piece, but that particular word (memetic) is making it difficult for me to get to grips with the post as a whole. I’m confused if you mean there is a high degree of uncritical mimicry, or if you’re making some analogy to ‘genetic’ (and what that analogy is...)
It is indeed an anology to ‘genetic’. Ideas “reproduce” via people sharing them. Some ideas are shared more often, by more people, than others. So, much like biologists think about the relative rate at which genes reproduce as “genetic fitness”, we can think of the relative rate at which ideas reproduce as “memetic fitness”. (The term comes from Dawkins back in the 70′s; this is where the word “meme” originally came from, as in “internet memes”.)
I think you’re using “memetic” to mean “of high memetic fitness”, and I wish you wouldn’t. No one uses “genetic” in that way.
An idea that gets itself copied a lot (either because of “actually good” qualities like internal consistency, doing well at explaining observations, etc., or because of “bad” (or at least irrelevant) ones like memorability, grabbing the emotions, etc.) has high memetic fitness. Similarly, a genetically transmissible trait that tends to lead to its bearers having more surviving offspring with the same trait has high genetic fitness. On the other hand, calling a trait genetic means that it propagates through the genes rather than being taught, formed by the environment, etc., and one could similarly call an idea or practice memetic if it comes about by people learning it from one another rather than (e.g.) being instinctive or a thing that everyone in a particular environment invents out of necessity.
When you say, e.g., “lots of work in that field will be highly memetic despite trash statistics, blatant p-hacking, etc.” I am pretty certain you mean “of high memetic fitness” rather than “people aware of it are aware of it because they learned of it from others rather than because it came to them instinctively or they reinvented it spontaneously because it was obvious from what was around them”.
(It would be possible, though I’d dislike it, to use “memetic” to mean something like “of high memetic fitness for ‘bad’ reasons”—i.e., liable to be popular for the sort of reason that we might not appreciate without the notion of memes. But I don’t think that can be your meaning in the words I quoted, which seem to presuppose that the “default” way for a piece of work to be “memetic” is for it to be of high quality.)
I have split feelings on this one. On the one hand, you are clearly correct that it’s useful to distinguish those two things and that my usage here disagrees with the analogous usage in genetics. On the other hand, I have the vague impression that my usage here is already somewhat standard, so changing to match genetics would potentially be confusing in its own right.
It would be useful to hear from others whether they think my usage in this post is already standard (beyond just me), or they had to infer it from the context of the post. If it’s mostly the latter, then I’m pretty sold on changing my usage to match genetics.
Your use of “memetic” here did struck me as somewhat idiosyncratic; I had to infer it. I would have used “memetically viral” and derivatives in its place. (E. g., in place of “lots of work in that field will be highly memetic despite trash statistics”, I would’ve said “lots of ideas in that field will be highly viral despite originating from research with trash statistics” or something.)
To me memetic normally reads something like “has a high propensity to become a meme” or “is meme-like” I had no trouble interpreting the post from this basis.
I push back against trying to hew closely to usages from the field of genetics. Fundamentally I feel like that is not what talking about memes is for; it was an analogy from the start, not meant for the same level of rigor. Further, memes and how meme-like things are is much more widely talked about than genetics, so insofar as we privilege usage considerations I claim switching to one matching genetics would require more inferential work from readers overall because the population of readers conversant with genetics is smaller.
I also feel like the value of speaking in terms of memes in the post is that the replication crises is largely the fault of non-rigorous treatment; that is to say in many fields the statistical analysis parts really were/are more of a meme inside the field rather than a rigorous practice. People just read other people’s published papers analysis sections, and write something shaped like that, replicability be damned.
Yep, it seems like pretty standard usage to me (and IMO seems conceptually fine, despite the fact that “genetic” means something different, since for some reason using “memetic” in the same way feels very weird or confused to me, like I would almost never say “this has memetic origin”)
… though now that it’s been pointed out, I do feel like I want a short handle for “this idea is mostly passed from person-to-person, as opposed to e.g. being rederived or learned firsthand”.
I also kinda now wish “highly genetic” meant that a gene has high fitness, that usage feels like it would be more natural.
I think in principle it makes sense in the same sense “highly genetic” makes sense. If a trait is highly genetic, then there’s a strong chance for it to be passed on given a reproductive event. If a meme is highly memetic, then there’s a strong chance for it to be passed on via a information transmission.
In genetic evolution it makes sense to distinguish this from fitness, because in genetic evolution the dominant feedback signal is whether you found a mate, not the probability a given trait is passed to the next generation.
In memetic evolution, the dominant feedback signal is the probability a meme gets passed on given a conversation, because there is a strong correlation between the probability someone passes on the information you told them, and getting more people to listen to you. So a highly memetic meme is also incredibly likely to be highly memetically fit.
I definitely had no trouble understanding the post, and the usage seems very standard among blogs I read and people I talk to.
This is rather weak evidence for your claim (“memeticity in a scientific field is mostly determined, not by the most competent researchers in the field, but instead by roughly-median researchers”), unless you additionally posit another mechanism like “fields with terrible replication rates have a higher standard deviation than fields without them” (why?).
Why would that be relevant?
If the means/medians are higher, the tails are also higher as well (usually).
Norm(μ=115, σ=15) distribution will have a much lower proportion of data points above 150 than Norm(μ=130, σ=15). Same argument for other realistic distributions. So if all I know about fields A and B is that B has a much lower mean than A, by default I’d also assume B has a much lower 99th percentile than A, and much lower percentage of people above some “genius” cutoff.
Oh I see, you mean that the observation is weak evidence for the median model relative to a model in which the most competent researchers mostly determine memeticity, because higher median usually means higher tails. I think you’re right, good catch.
I disagree. At best, community norms require everyone to in principle be able to follow along with some statistical/economic argument.
That is a better fit with my experience of LW discussions. And I am not, in fact, familiar with statistics or economics to the extent I am with e.g. classical mechanics or pre-DL machine learning. (This is funny for many reasons, especially because statistical mechanics is one of my favourite subjects in physics.) But it remains the case that what I know of economics could fill perhaps a single chapter in a textbook. I could do somewhat better with statistics, but asking me to calculate ANOVA scores or check if a test in a paper is appropriate for the theories at hand is a fool’s errand.
I think the central claim is true in some fields and not particularly true in others. But one variation of the claim that interests me is, if we consider memeticity relative to public opinions/awareness, then I agree with this more strongly, where the opinions of median researchers are disproportionately more popular than more correct opinions. In some sense, I think the “most correct opinions” are basically never the most popular to the public, though perhaps I wouldn’t invoke the median researcher problem to explain this edge case, but a more specific hypothesis instead.
I mostly disagree with the final paragraph about LW. To me it seems that 1) opinions on LW are quite diverse 2) LW opinions are quite plausibly worse than average in some fields, e.g. mental health, and 3) there’s very little evidence to suggest that high intelligence leads to correct/rational opinions in the context of complicated beliefs about the world; people with high IQ are just as susceptible to converging on “easily disprovable” opinions, often they’re just better at deceiving themselves and others using rationalization.
I happened to read a Quanta article about equivalence earlier, and one of the threads is the difficulty of a field applying a big new concept without the expository and distillation work of putting stuff into textbooks/lectures/etc.
That problem pattern-matches with the replication example, but well-motivated at the front end instead of badly-motivated at the back end. It still feels like exposition and distillation are key tasks that govern the memes-in-the-field passed among median researchers.
I strongly suspect the crux of the replication crisis example is that while there are piles of exposition and distillation for probability/statistics, they are from outside whatever field experiences the problem, and each field needs its own internal completion of these steps for them to stick.
I think this should be a third specialization for every scientific field. We have theorists, we have experimentalists, and to this we should add analysts. Their work would specialize in establishing the statistical properties of experimental designs and equipment in the field on the one side, and the statistical requirements to advance various theories on the other side.
Could this be a more general problem that people in academia are very unlikely to recognize anything that originates outside of academia?
I mean, first you get your degree by listening to people in academia, and reading textbooks written by people in academia. Then you learn to solve problems, defined by people in academia, under the leadership of people in academia, mostly by reading a lot of papers written and reviewed by people in academia. Etc. I think it is very easy to forget that the universe outside of academia exists at all.
And even if you remember that the universe outside of academia exists, interacting with it is potentially dangerous for your reputation. If you cite something from an academically respected source, then… even if it later turns out to be completely wrong, it kinda wasn’t your mistake, was it? Compare to the situation where you cite something from a smart outsider… who also turns out to be completely wrong. Now you look like an idiot… and you can’t shift the blame to him (because he is outside the system) nor to his reviewers (because either he had none, or they were also outside the system).
Similarly, suppose that two respected scientists have expressed a similar idea, and you prefer one of the formulations over the other. No problem; just cite the one you prefer. Now suppose that a respected scientist wrote something about a topic, but a smart outsider wrote about something slightly more relevant for your specific case and/or he expressed it better. Citing the outsider instead of the respected scientist… will make you seem like you are not familiar with the existing research in your field. (Chances are, the respected scientist or his colleague will review your paper.)
So it’s part familiarity, part incentives.
Epistemic status: just a guess; I am not working in academia, so I only imagine what the incentives are
.
Now assuming that this is true, or at least pointing in the right direction, the problem seems solvable, but it would require some work. For example, you can’t expect people in academia to go outside and read your research on their own initiative. But maybe if you publish it as a book that is easy to read, and donate them the book? Or maybe you could academia-wash your research by making someone inside the system a co-author.
Perhaps you could do both of these at the same time, like publish a popular science book that would contain articles both from LW researchers and academic researchers. And then the book would kinda be okay to cite.
If an outsider’s objective is to be taken seriously, they should write papers and submit them to peer review (e.g. conferences and journals).
Yann LeCun has gone so far to say that independent work only counts as “science” if submitted to peer review:
“Without peer review and reproducibility, chances are your methodology was flawed and you fooled yourself into thinking you did something great.”—https://x.com/ylecun/status/1795589846771147018?s=19.
From my experience, professors are very open to discuss ideas and their work with anyone who seems serious, interested, and knowledgeable. Even someone inside academia will face skepticism if their work uses completely different methods. They will have to very convincingly prove the methods are valid.
I agree, but there are two different perspectives:
whether the outsider wants to be taken seriously by academia
whether the people in academia want to collect knowledge efficiently
From the first perspective, of course, if you want to be taken seriously, you need to play by their rules. And if you don’t, then… those are your revealed preferences, I guess.
It is the second perspective I was concerned about. I agree that the outsiders are often wrong. But, consider the tweet you linked:
It seems to me that from the perspective of a researcher, taking ideas of the outsiders who have already developed successful products based on them, and examining them scientifically (and maybe rejecting them afterwards), should be a low-hanging fruit.
I am not suggesting to treat the ideas of the outsiders as scientific. I am suggesting to treat them as “hypotheses worth examining”.
Refusing to even look at a hypothesis because it is not scientifically proven yet, that’s putting the cart before the horse. Hypotheses are considered first, scientifically proved later; not the other way round. All scientific theories were non-scientific hypotheses first, at the moment they were conceived.
Choosing the right hypothesis to examine, is an art. Not a science yet; that is what it becomes after we examine it. In theory, any (falsifiable) hypothesis could be examined scientifically, and afterwards confirmed or rejected. In practice, testing completely random hypotheses would be a waste of time; they are 99.9999% likely to be wrong, and if you don’t find at least one that is right, your scientific career is over. (You won’t become famous by e.g. testing million random objects and scientifically confirming that none of them defies gravity. Well, you probably would become famous actually, but in the bad way.)
From the Bayesian perspective, what you need to do is test hypotheses that have a non-negligible prior probability of being correct. From the perspective of the truth-seeker, that’s because both the success and the (more likely) failure contribute non-negligibly to our understanding of the world. From the perspective of a scientific career-seeker, because finding the correct one is the thing that is rewarded. The incentives are almost aligned here.
I think that the opinions of smart outsiders have maybe 10% probability of being right, which makes them hypotheses worth examining scientifically. (The exact number would depend on what kind of smart outsiders are we talking about here.) Even if 10% right is still 90% wrong. Why do I claim that 10% is a good deal? Because when you look at the published results (the actual “Science according to the checklist”) that passed the p=0.05 threshold… and later half of them failed to replicate… then the math says that their prior probability was less than 10%.
(Technically, with prior probability 10%, and 95% chance of a wrong hypothesis being rejected, out of 1000 original hypotheses, 100 would be correct and published, 900 would be incorrect and 45 of them published. Which means, out of 145 published scientific findings, only about a third would fail to replicate.)
So we have a kind of motte-and-bailey situation here. The motte is that opinions of smart outsiders, no matter how popular, now matter how commercially successful, should not be treated as scientific. The bailey is that the serious researchers should not even consider them seriously as hypotheses; in other words that their prior probability is significantly lower than 10% (because hypotheses with prior probability about 10% are actually examined by serious researchers all the time).
And what I suggest here is that maybe the actual problem is not that the hypotheses of smart and successful outsiders are too unlikely, but rather that exploring hypotheses with 10% prior probability is a career-advancing move if those hypotheses originate within academia, but a career-destroying move if they originate outside of it. With the former, you get a 10% chance of successfully publishing a true result (plus a 5% chance of successfully publishing a false result), and 85% chance of being seen as a good scientist who just wasn’t very successful so far. With the latter, you get a 90% chance of being seen as a crackpot.
Returning to Yann LeCun’s tweet… if you invent some smart ideas outside of academia, and you build a successful product out of them, but the academia refuses to even look at them because the ideas are now coded as “non-scientific” and anyone who treats them seriously would lose their academic status… and therefore we will never have those ideas scientifically confirmed or rejected… that’s not just a loss for you, for also for the science.
In entrepreneurship, there is the phrase “ideas are worthless”. This is because everyone already has lots of ideas they believe are promising. Hence, a pre-business idea is unlikely to be stolen.
Similarly, every LLM researcher already has a backlog of intriguing hypotheses paired with evidence. So an outside idea would have to seem more promising than the backlog. Likely this will require the proposer to prove something beyond evidence.
For example, Krizhevsky/Sutskever/Hinton had the idea of applying then-antiquated neural nets to recognize images. Only when they validated this in the ImageNet competition did their idea attract more researchers.
This is why ideas/hypotheses—even with a bit of evidence—are not considered very useful. What would be useful is to conclusively prove an idea true. This would attract lots of researchers … but it turns out to be incredibly difficult to do, and in some cases requires sophisticated techniques. (The same applies to entrepreneurship. Few people will join/invest until you validate the idea and lower the risk.)
Sidenote: There are countless stories of academics putting forth original, non-mainstream ideas, only to be initially rejected by their peers (e.g. Cantor’s infinity). I believe this not to be an issue just with outsiders, but merely that extraordinary claims require extraordinary proof. ML is an interesting example, because lots of so-called outsiders without a PhD now present at conferences!
What would this say about subculture gatekeeping? About immigration policy?
I think one thing that’s missing here is that you’re making a first-order linear approximation of “research” as just continually improving in some direction. I would instead propose a quadratic model where there is some maximal mode of activity in the world, but this mode can face certain obstacles that people can remove. Research progress is what happens when there’s an interface for removing obstacles that people are gradually developing familiarity with (for instance because it’s a newly developed interface).
Different people have different speeds by which they reach the equillibrium, but generally those who have an advantage would also exhibit an explosion of skills and production as they use their superior understanding of the interface.
Complicated analysis (like going far beyond p-values) is easy for anyone to see and it is evidence of effort. Complex analysis usually coocurs with thoroughness so fewer mistakes. Complicated analysis coocurs with many concurrent tests so less need to produce positive results so less p-hacking. Consequently, there is a fairly simple solution to researchers with mediocre statistical skills gaining too much trust: more plots! Anyway, I find correlation graphs and multiple comparison impressive. Also I am usually more skilled in data analysis than the subject of a paper so can more easily verify that.
Is your concern simply in the ‘median’ researchers unfamiliarity with basic statistics? or the other variables that typically accompany a researcher without basic statistics knowledge?
On a different note,
Due to your influence/status in the field, I think it would be awesome for you to set a clear-cut resource detailing what you would like to see the ‘median’ researchers do that they are not (other than the obvious frustration regarding statistics incompetence stated above).
I don’t think statistics incompetence is the One Main Thing, it’s just an example which I expect to be relatively obvious and legible to readers here.
It’s not obvious to me that this is true, except insofar as a small research community can be so unusually smart/competent/etc that their median researcher is better than a whole field’s median researcher so they get better selection pressure “for free”. But if an idea’s popularity in a wide field is determined mainly by its appeal to the median researcher, I would naturally expect its popularity in a small community to be determined mainly by its appeal to the median community member.
This claim looks like it’s implying that research communities can build better-than-median selection pressures but, can they? And if so why have we hypothesized that scientific fields don’t?
I’m a bit surprised this is the crux for you. Smaller communities have a lot more control over their gatekeeping because, like, they control it themselves, whereas the larger field’s gatekeeping is determined via openended incentives in the broader world that thousands (maybe millions?) of people have influence over. (There’s also things you could do in addition to gatekeeping. See Selective, Corrective, Structural: Three Ways of Making Social Systems Work)
(This doesn’t mean smaller research communities automatically have good gatekeeping or other mechanisms, but it doesn’t feel like a very confusing or mysterious problem on how to do better)
Does the field of social psychology not control the gatekeeping of social psychology? I guess you could argue that it’s controlled by whatever legislative body passes the funding bills, but most of the social psychology incentives seem to be set by social psychologists, so both small and large communities control their gatekeeping themselves and it’s not obvious to me why smaller ones would do better.
At some level of smallness your gatekeeping can be literally one guy who decides whether an entrant is good enough to pass the gate, and I acknowledge that that seems like it could produce better than median selection pressure. But by the time you get big enough that you’re talking about communities collectively controlling the gatekeeping… aren’t we just describing the same system at a population of one thousand vs one hundred thousand?
I could imagine an argument that yes actually, differences of scale matter because larger communities have intrinsically worse dynamics for some reason, but if that’s the angle I would expect to at least hear what the reason is rather than have it be left as self-evident.
An individual Social Psychology lab (or lose collection of labs) can choose who to let in.
Frontier Lab AI companies can decide who to hire, and what sort of standards they want internally (and maybe, in a lose alliance with other Frontier Lab companies).
The Immoral Mazes outlines some reasons that you might think large institutions are dramatically worse than smaller ones (see: Recursive Middle Manager Hell for a shorter intro, although I don’t spell out the part argument about how mazes are sort of “contagious” between large institutions)
But the simpler argument is “the fewer people you have, the easier it is for a few leaders to basically make personal choices based on their goals and values,” rather than selection effects resulting in the largest institutions being better modeled as “following incentives” rather than “pursuing goals on purpose.” (If an organization didn’t follow the incentives, they’d be outcompeted by one that does)
How do you think competent people can solve this problem within their own fields of expertise?
For example, the EA community is a small & effective community like you’ve referenced for commonplace charity/altruism practices.
How could we solve the median researcher problem & improve the efficacy & reputation of altruism as a whole?
Personally, I suggest taking a marketing approach. If we endeavor to understand important similarities between “median researchers”, so that we can talk to them in the language they want to hear, we may be able to attract attention from the broader altruism community which can eventually be leveraged to place EA in a position of authority or expertise.
What do you think?
Spurious correlation here, big time, imho.
Give me the natural content of the field and I bet I easily predict whether it may or may not have replication crisis, w/o knowing the exact type of students it attracts.
I think it’s mostly that the fields where bad science may be sexy and less-trivial/unambiguous to check, or, those where you can make up/sell sexy results independently of their grounding, may, for whichever reason, also be those that attract the non-logical students.
Agree though with the mob overwhelming the smart outliers, but I just think how much that mob creates a replication crises is at least in large part dependent on the intrinsic nature of the field rather than due to the exact IQs.
We argue that memeticity—the survival and spread of ideas—is far more complex than the influence of average researchers or the appeal of articulate theories. Instead, the persistence of ideas in any field depends on a nuanced interplay of feedback mechanisms, boundary constraints, and the conditions within the field itself.
In fields like engineering, where feedback is often immediate and tied directly to empirical results, ideas face constant scrutiny through testing and validation. Failed practices here are swiftly corrected, creating a natural selection process that fosters robustness. Theories in engineering and similar disciplines rely on mathematical modeling to bridge concepts with real-world outcomes. This alignment between model and outcome isn’t instantaneous, but the structural setup of the field encourages what we might call “antifragility”: ideas that survive these feedback loops emerge stronger and more reliable, not solely because of the competence of individual researchers but because of the field’s built-in corrective pressures.
In contrast, fields like social sciences or linguistics often lack such direct empirical anchors. Theories in these areas can persist on the basis of articulation, cultural resonance, or ideological alignment, sometimes for decades. The classic linguistic theories of the 1970s, for instance, endured largely because they fit well within the intellectual climate of the time, with little empirical scrutiny available to challenge their assumptions. Without rigorous feedback, such theories may linger, shaping academic thought without the resilience-testing that empirical pressure imposes.
The emergence of large language models (LLMs) introduces a new dimension of feedback in these traditionally insulated fields. LLMs can analyze extensive linguistic and behavioral data, revealing patterns that either align with or contradict established theories. This new capacity acts as an initial “stress test” for long-standing ideas, challenging assumptions that may have previously gone unexamined. However, while LLMs provide valuable insights, they are not infallible arbiters of truth. Their analysis depends on training data that can inherit biases from past frameworks, so they function as a starting point rather than a comprehensive solution. The true rigor of empirical validation—akin to engineering’s feedback loops—remains essential for developing resilient theories.
In summary, the memetic success of ideas depends not just on the competency or articulacy of individual researchers but on how effectively feedback mechanisms, field boundaries, and empirical standards shape those ideas. Fields with strong, built-in corrective feedback—often mathematically modeled—are inherently more resilient to the persistence of weak ideas. Fields without such constraints are vulnerable to influence by articulation alone, creating environments where ideas can thrive without robust validation. The introduction of LLMs offers a valuable corrective force, but one that must be used with awareness of its limitations. By integrating empirical rigor and maintaining reflective practices, disciplines across the spectrum can ensure that memeticity aligns more closely with resilience, rather than rhetorical appeal alone.
Well, I hope that the self-importance shown in this post is not a true reflection of the community; although unfortunately I think it might well be.
Consistency in research fields and protection against elementary methodological malpractices such as p-hacking and the like should be enforced through automation. IQ over median does not correlate with creativity over median, as indicated by recent research, so i wouldn’t worry too much about this side of your argument. I think future research in general has to contemplate what is the best way to harvest human creativity while ensuring that consistency, novelty and methodological robustness are enforced through automation.
That’s not what that paper says. It says that IQ over 110 or so (quite above median) correlates less strongly (but still positively) with creativity. In Chinese children, age 11-13.
Correlation value over IQ at 100 seems to be already well under the variance so not really meaningful, and if you look at what the researchers call Originality, the correlation is actually negative over IQ 110.
Just as a correction to your comment, I am not stating this as an adamant fact, but as an “indication” not a “demonstration”, I said: “indicated by recent research”
I understand the reference I pointed out has a limited scope (Chinese children, age 11-13), as any research of this kind, but beyond the rigorous scientific demonstration of this concept, I am expressing the fact that IQ tests are very incomplete, which is not novel.
Thank you for your response.