Why Are The Human Sciences Hard? Two New Hypotheses

“Reasoning about the relative hardness of sciences is itself hard.”
—the B.A.D. philosophers


Epistemic status: Conjecture. Under a suitable specification of the problem, we have credence ~50% on the disjunction of our hypotheses explaining >1% of the variance (e.g., in values) between disciplines, and 1% on our hypotheses explaining >50% of such variance.

The Puzzle: A Tale of Two Predictions

Imagine two scientific predictions:

Prediction A: Astronomers calculate the trajectory of Comet NEOWISE as it approaches Earth, predicting its exact position in the night sky months in advance. When the date arrives, there it is—precisely where they said it would be.

Prediction B: Political scientists forecast the outcome of an election, using sophisticated models built on polling data, demographic trends, and historical patterns. When the votes are counted, the results diverge wildly from many predictions.

Why does science seem to struggle so much more with predicting human behavior than with predicting physical phenomena? This gap in predictive performance between the human sciences and natural sciences is widely acknowledged—yet explanations for it often feel unsatisfying.

“Because they deal with systems that are highly complex, adaptive and not rigorously rule-bound, the human sciences are among the most difficult of disciplines, both methodologically and intellectually...”

—Drezner (2012) “A Different Agenda,” Nature, p. 271.

The usual explanation is that human behavior is simply more complex than physical systems. But is that the whole story? We don’t think so.

In this post, we present two novel hypotheses for why the human sciences appear harder than other sciences—hypotheses that focus not on the inherent complexity of the subject matter, but on structural features of scientific inquiry itself. We don’t claim these hypotheses explain the entire difficulty gap, only that they likely explain some fraction of the difference between disciplines:

  1. Rigid Demands Hypothesis (RD): In the human sciences, we are pre-committed to a specific prediction tasks to a greater degree than in the physical sciences, limiting a powerful strategy for scientific progress—changing the question to something more tractable.

  2. Fruit in the Hand Hypothesis (FTH): Due to evolutionary pressures, humans already have relatively high baseline performance for many “low-hanging fruit” prediction tasks concerning human behavior—more so than for many physical domains—making progress beyond this baseline comparatively more challenging.

This post is based on a paper by Daniel Herrmann, Aydin Mohseni, and Gabe Orona Avakian. All proofs can be found in the mathematical appendix of that paper.

The Typical Explanations: Complexity, Methods, and Incentives

Before introducing our new hypotheses, let’s briefly survey the existing explanations for why the human sciences are hard. Most fall into three categories:

Problems in Subject Matter

The most common explanation is that human behavior is inherently more complex than physical systems:

  • Social phenomena have more variables and feedback loops

  • Human systems are context-dependent and culturally influenced

  • People respond to predictions about them, creating self-fulfilling or self-defeating prophecies

  • Individual differences and variation make generalization difficult[1]

Problems in Methods

Others point to methodological limitations:

  • Lack of unified theoretical frameworks

  • Difficulty conducting controlled experiments

  • Statistical challenges and questionable research practices

  • Vulnerability to bias on politically charged topics[2]

Problems in Incentives

Still others look to institutional factors:

  • Insufficient funding compared to natural sciences

  • Publication bias toward surprising or counterintuitive findings

  • Perverse incentives in the absence of easy verification[3]

These explanations tend to focus on specific, local features of the human sciences. Our approach differs by highlighting general, structural features of scientific inquiry itself.[4]

Our Approach: Prediction Tasks as the Unit of Analysis

Before we can analyze why some sciences might be harder than others, we need to clarify what “hardness” even means in this context. This is not a trivial task—it requires making substantive choices about how to operationalize and measure scientific difficulty.

When we say “the human sciences are hard” what do we mean exactly? We’re not claiming they require more intellect or effort. Rather, we’re suggesting something like:

In the human sciences, we tend to exhibit worse performance in prediction tasks of interest relative to the performance in prediction tasks of other sciences.

We have made several moves here:

  1. We have made explicit the comparative nature of the hardness judgement.

  2. We have cashed out hardness in terms of some sort of perceived performance.

  3. We have chosen to operationalize performance in terms of prediction tasks.

In particular, we focus on prediction tasks because:

  1. They provide a clear metric for success

  2. Even explanation and understanding can be framed as forms of prediction

  3. Prediction success or failure is relatively uncontroversial to measure

Alternative Choices: Unpacking Hardness

Clearly, science is not only about performance on prediction tasks: explanation, understanding, and guiding interventions, are just some of the ways that a science can produce good performance. (Even though we can think of prediction as subsuming some of these.)[5]

Other plausible metrics for the (perceived) hardness of a science might include reproducibility rates, theoretical unification, or methodological sophistication, etc. That said, we think focusing on prediction tasks as the unit of performance provides valuable insights while remaining (relatively) straightforward to assess. And so, it’s where we start.

A Model for Understanding Disciplinary Difficulty

Now, let’s formalize our hypotheses with a simple model. This requires making explicit choices about how to represent scientific disciplines and their difficulty—choices that themselves illustrate how challenging it is to reason rigorously about disciplinary hardness:

  • Let be the set of all possible prediction tasks (e.g., predicting comet trajectories, drug effects, election outcomes)

  • assigns a hardness value to each task (lower values mean “better performance”, i.e., being “less hard”)[6]

  • A discipline is a subset of (e.g., physics, biology, psychology)

  • We judge disciplines by their most successful performances:

This last point is crucial: we tend to evaluate fields by their greatest successes, not their average performance or their failures. Physics seems impressive because of its stunning bullseye predictions, not because it can solve every problem.

Alternative Choices: Unpacking Perceived Performance

Our model measures a discipline’s perceived performance by its most impressive achievement—essentially capturing the “best case” scenario. And again, this represents just one of several possible approaches to quantifying scientific success.

We could instead consider for each field:

  • Mean or median performance across all prediction tasks

  • The distribution of successes (e.g., variance or skewness)

  • Weighted averages favoring more socially important predictions

  • Average performance of the top most successful predictions

  • Average performance of the top of predictions

  • etc.

Our focus on the most impressive achievements aligns with how people often judge disciplines psychologically—through exemplars and standout discoveries rather than modal performance. We would guess that our analysis is robust under choices of hardness metrics that take a generalized mean of a suitably small subset of the most successful performances in a field (e.g., considering the average of the top 5 predictions rather than just the single best one), but other choices of metrics might yield different conclusions about relative discipline difficulty.

Hypothesis 1: Rigid Demands

Core Intuition

Scientists don’t just discover facts—they choose which questions to ask. This choice profoundly affects how successful a discipline appears.

Vignette: The Freedom to Choose More Tractable Problems

Imagine two researchers:

Physicist Alice wants to understand gas behavior. She realizes predicting the motion of individual gas particles is nearly impossible. Instead, she shifts to studying the relationship between temperature, pressure, and volume—variables that emerge at scale and follow elegant mathematical laws. Her predictions are precise and widely applicable.

Education Researcher Bob wants to improve student outcomes. Policymakers demand: “Will this specific intervention increase future earnings?” Bob can’t redefine the question to something more tractable—the original question is exactly what matters for policy. He’s stuck with a complex prediction task whether he likes it or not.

In many natural sciences, researchers have substantial freedom to redefine their questions to make them more tractable. In the human sciences, particularly those with policy implications, researchers often face rigid demands to answer specific questions regardless of their tractability.

Formal Expression

We can express this hypothesis by considering the cardinality (size) of disciplines:

Proposition 1: Let the set of all task difficulties be IID and the distributions non-degenerate, and let . Then .

In plain language: Even if two disciplines have the same distribution of task difficulties, if one discipline has freedom to explore more tasks, it will likely achieve more impressive successes than the discipline with less freedom.[7]

Real-World Example: Economic Pre-Commitments

Economics must predict inflation, recessions, and employment—these questions are non-negotiable because policymakers need answers to them. Economists can’t say, “Actually, we’ll study something more tractable instead.”

When entire institutions like central banks, governments, and financial markets depend on a yes/​no answer to this specific question about interest rates or inflation, the discipline can’t pivot to easier or more general tasks. The stakes and external demands lock economists into particular prediction challenges, regardless of their tractability.

In contrast, physicists were not committed to discovering the periodic table, fields or quantum wave functions. Many of the great successes of physics are answers to question no one would think to ask just decades before they were discovered. The hard sciences were formed when frontiers of highly tractable and promising theorizing opened up.[8]

The rigid demands placed on many human sciences limit a powerful strategy for scientific progress: changing the question to something more tractable.

Hypothesis 2: Fruit in the Hand

Core Intuition

Some prediction tasks are already solved for us by evolution and enculturation. We carry these solutions around without recognizing them as scientific achievements.

Vignette: Unimpressive Predictions

Imagine the following prediction task:

“If someone in a room full of quietly working philosophers suddenly screams and throws a glass of water at the wall, shattering it, what will happen?”

You can predict with extremely high accuracy that people will stop writing, look up startled, and show surprise or alarm. There will be a pause while they figure what is going on. If they think that no real threat is present, they will return to something like their previous activities.” This prediction is more accurate than many “scientific” predictions—yet we don’t consider it a scientific achievement because it seems obvious.

Why? Because evolution has already given us sophisticated cognitive machinery for predicting basic social reactions. The “low-hanging fruit” of human behavior prediction has already been picked by natural selection.

In the natural sciences, we have no evolutionary advantage in predicting quantum behavior or chemical reactions. The “easy wins” in these domains remain available for science to claim as achievements.

Formal Expression

We can express this hypothesis in two ways:

1. Removing Already-Solved Tasks

If we imagine removing the easiest prediction tasks (which evolution has already solved) from the human sciences:

Proposition 2: Let be a discipline with IID tasks, let be the easiest task in , and let be a random task in . Then and .

In plain language: Removing the easiest tasks from a discipline makes it appear harder.

2. Accounting for Impressiveness

Alternatively, we can define a function that measures how impressed we are when a prediction task is solved:

If we solve an “easy” social prediction task, we’re not impressed (low value) because our evolutionary intuitions already solved it. This makes the human sciences seem less impressive even when they achieve certain accurate predictions.[9]

Illustration: Celestial Orbits vs Facial Expressions

As philosopher Jerry Fodor noted, psychology as a science must surpass “folk psychology” to be impressive. We already have intuitive theories about how minds work—we make predictions about others’ beliefs, desires, and behaviors constantly.

In physics, deriving a neat formula for planetary motion from first principles is mind-blowing because we had zero built-in intuition for elliptical orbits. Meanwhile, in daily life, we routinely predict sophisticated and nuanced human emotions, intentions, and behavior with remarkable precision—folk psychology handles that, so a formal study confirming the same generates little excitement. The “wow factor” is vastly different because one domain leverages evolved intuitions while the other operates entirely outside them.

To begin to develop a quantitative sense of this, consider that while the “impressive” task of predicting the trajectory of celestial bodies are characteristically satisfactorily addressed by a relatively small set of 2nd-order ODEs with parameters[10], the comparatively “unimpressive” task of facial recognition has only recently been solved by neural networks requiring many millions of parameters.[11]

Bringing It Together: Why These Hypotheses Matter

The big takeaway: You can’t simply observe that physics makes more accurate predictions than social science and conclude that social phenomena are inherently harder to predict.

Our model shows that:

  • If a discipline has relatively less flexibility in changing its research questions to more tractable ones (Rigid Demands), it will appear less successful even if its domain isn’t inherently more complex.

  1. If the easy prediction tasks in a domain have already been solved by our evolved cognition (Fruit in the Hand), we’ll only notice the difficult remaining tasks, making the domain seem harder still.

Limitations and Extensions of Our Hypotheses

Our hypotheses open new perspectives on scientific difficulty, but it’s important to clarify their scope and implications.

Complementing Rather Than Contradicting Traditional Explanations

We recognize that human behavior may indeed be more complex than physical systems. Our hypotheses don’t contradict these complexity-based explanations. Rather, we suggest that even if the underlying tasks in human sciences were equally difficult (which they may not be), the structure of scientific inquiry would still make them appear harder due to the mechanisms we’ve described.

Scope and Magnitude of Our Explanation

To be clear about the explanatory power we’re claiming: these hypotheses explain only part of the judgment that human sciences are difficult. We expect that the distributions of task difficulties do genuinely vary across disciplines, likely in ways that others have proposed.

To put some tentative numbers on this: under a suitable precise formulation, we might place ~50% credence on our hypotheses explaining more than 1% of the variance (e.g., in values) between disciplines, but substantially less than 1% credence on them explaining more than 50% of such variance. Our primary contribution is demonstrating how structural features of inquiry can create or amplify apparent differences in difficulty, not claiming they explain all or even most of the difference.

Clarifying Folk Psychology’s Role

The existence of sophisticated folk psychology doesn’t mean that predicting human behavior is inherently easier. Quite the opposite: it suggests that the prediction tasks we assign to formal social science are precisely those where our evolved intuitions fail. The easy tasks have already been “picked” by evolution and excluded from what we consider science, leaving mainly the difficult ones.

Testing These Hypotheses Empirically

These hypotheses generate testable predictions. We could compare prediction tasks across disciplines, measuring both their intrinsic difficulty and how constrained researchers are in defining them. Surveys of scientists and laypeople could examine whether the best predictions in newer domains (where we lack evolutionary intuitions) appear more impressive relative to their actual difficulty. Such investigations would help determine the extent to which our proposed mechanisms contribute to the perceived difficulty gap between sciences.

Takeaways

  • Reasoning about the relative hardness of sciences is itself hard, and making this reasoning formal reveals substantive choices that have to be made in order to specify the content of claims regarding “hardness.”

  • The social sciences may appear harder than natural sciences partly due to structural features of inquiry, not just inherent complexity.

  • The Rigid Demands Hypothesis suggests that the social sciences may, on average, have less freedom to pursue more tractable questions than natural sciences, and this should impact our expectation of their success.

  • The Fruit in the Hand Hypothesis suggests that evolution has already given us solutions to many easy human behavior prediction tasks, leaving more difficult ones, on average, for the social sciences.

  • Our formal model shows how these factors can create the appearance of different levels of difficulty across disciplines, even if (hypothetically) the underlying distribution of task difficulties were precisely the same.


  1. ^
  2. ^
  3. ^
  4. ^

    Our favorite entry in this genre is undoubtedly Meehl’s (1978) “Theoretical Risks and Tabular Asterisks,” in which he wryly remarks (p. 807): “since (in 10 minutes of superficial thought) I easily came up with 20 features that make human psychology hard to scientize, I invite you to pick your own favorites.”

  5. ^
  6. ^

    One could use the coefficient of determination, , as an operationalization of our hardness metric, . With it, we can capture typical difference in values across scientific domains. Ozili et al (2022, p.2) succintly illustrate this variation:

    Typically, statisticians and scientists in the pure sciences will dismiss a model as “weak”, “unreliable” and “lacking a predictive power if the reported R-square of the model is below 0.6.”

    By contrast, they note that in the social sciences:

    …a low R-square of at least 0.1 is acceptable on the condition that some or most of the predictors or explanatory variables are statistically significant.

    A of .5 is a physicist’s embarrassment—but a social scientist’s triumph.

  7. ^

    Actually, freedom to redefine ones problem just one way to increase the cardinality of the set of tasks. Other ways to realize this include: having a larger domain of possible tasks, or to having spent longer exploring tasks. Increasing the cardinality of the set of tasks undertaken increases the expected value of one’s greatest successes.

  8. ^

    To wit, a colleague once memorably quipped, “Physics is what we would have called whichever theory had been most successful.”

  9. ^

    Contra all the hate it gets, game theory represents a significant achievement precisely because it can outperform folk psychology in predicting strategic behavior. Its empirically validated predictions—like the chain‑store paradox, the winner’s curse, the traveler’s dilemma, or the volunteer’s dilemma—sometimes go beyond our intuitive notions of rational behavior.

  10. ^

    Where comes from the initial conditions (positions + velocities) and from the masses.

  11. ^

    In particular, VGGNeT and FaceNet—the first CNNs that became known for accuracy in facial recognition tasks—had just over 100 million parameters.