The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity’s trajectory, then it makes sense to focus on such examples.
But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: “singular” discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn’t very counterfactually impactful.
Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia’s list of multiple discoveries.
To that end: what are some examples of discoveries which nobody else was anywhere close to figuring out?
A few tentative examples to kick things off:
Shannon’s information theory. The closest work I know of (notably Nyquist) was 20 years earlier, and had none of the core ideas of the theorems on fungibility of transmission. In the intervening 20 years, it seems nobody else got importantly closer to the core ideas of information theory.
Einstein’s special relativity. Poincaré and Lorentz had the math 20 years earlier IIRC, but nobody understood what the heck that math meant. Einstein brought the interpretation, and it seems nobody else got importantly closer to that interpretation in the intervening two decades.
Penicillin. Gemini tells me that the antibiotic effects of mold had been noted 30 years earlier, but nobody investigated it as a medicine in all that time.
Pasteur’s work on the germ theory of disease. There had been both speculative theories and scattered empirical results as precedent decades earlier, but Pasteur was the first to bring together the microscope observations, theory, highly compelling empirical results, and successful applications. I don’t know of anyone else who was close to putting all the pieces together, despite the obvious prerequisite technology (the microscope) having been available for two centuries by then.
(Feel free to debate any of these, as well as others’ examples.)
Lucretius in De Rerum Natura in 50 BCE seemed to have a few that were just a bit ahead of everyone else.
Survival of the fittest (book 5):
“In the beginning, there were many freaks. Earth undertook Experiments—bizarrely put together, weird of look Hermaphrodites, partaking of both sexes, but neither; some Bereft of feet, or orphaned of their hands, and others dumb, Being devoid of mouth; and others yet, with no eyes, blind. Some had their limbs stuck to the body, tightly in a bind, And couldn’t do anything, or move, and so could not evade Harm, or forage for bare necessities. And the Earth made Other kinds of monsters too, but in vain, since with each, Nature frowned upon their growth; they were not able to reach The flowering of adulthood, nor find food on which to feed, Nor be joined in the act of Venus.
For all creatures need Many different things, we realize, to multiply And to forge out the links of generations: a supply Of food, first, and a means for the engendering seed to flow Throughout the body and out of the lax limbs; and also so The female and the male can mate, a means they can employ In order to impart and to receive their mutual joy.
Then, many kinds of creatures must have vanished with no trace Because they could not reproduce or hammer out their race. For any beast you look upon that drinks life-giving air, Has either wits, or bravery, or fleetness of foot to spare, Ensuring its survival from its genesis to now.”
Trait inheritance from both parents that could skip generations (book 4):
“Sometimes children take after their grandparents instead, Or great-grandparents, bringing back the features of the dead. This is since parents carry elemental seeds inside – Many and various, mingled many ways – their bodies hide Seeds that are handed, parent to child, all down the family tree. Venus draws features from these out of her shifting lottery – Bringing back an ancestor’s look or voice or hair. Indeed These characteristics are just as much the result of certain seed As are our faces, limbs and bodies. Females can arise From the paternal seed, just as the male offspring, likewise, Can be created from the mother’s flesh. For to comprise A child requires a doubled seed – from father and from mother. And if the child resembles one more closely than the other, That parent gave the greater share – which you can plainly see Whichever gender – male or female – that the child may be.”
Objects of different weights will fall at the same rate in a vacuum (book 2):
“Whatever falls through water or thin air, the rate Of speed at which it falls must be related to its weight, Because the substance of water and the nature of thin air Do not resist all objects equally, but give way faster To heavier objects, overcome, while on the other hand Empty void cannot at any part or time withstand Any object, but it must continually heed Its nature and give way, so all things fall at equal speed, Even though of differing weights, through the still void.”
Often I see people dismiss the things the Epicureans got right with an appeal to their lack of the scientific method, which has always seemed a bit backwards to me. In hindsight, they nailed so many huge topics that didn’t end up emerging again for millennia that it was surely not mere chance, and the fact that they successfully hit so many nails on the head without the hammer we use today indicates (at least to me) that there’s value to looking closer at their methodology.
Which was also super simple:
Step 1: Entertain all possible explanations for things, not prematurely discounting false negatives or embracing false positives.
Step 2: Look for where single explanations can explain multiple phenomena.
While we have a great methodology for testable hypotheses, the scientific method isn’t very useful for untestable fields or topics. And in those cases, I suspect better understanding and appreciation for the Epicurean methodology might yield quite successful ‘counterfactual’ results (it’s served me very well throughout the years, especially coupled with the identification of emerging research trends in things that can be evaluated with the scientific method).
A precursor to Lucretius’s thoughts on natural selection is Empedocles, who we have far fewer surviving writings from, but which is clearly a precursor to Lucretius’ position. Lucretius himself cites & praises Empedocles on this subject.
Do you have a specific verse where you feel like Lucretius praised him on this subject? I only see that he praises him relative to other elementaists before tearing him and the rest apart for what he sees as erroneous thinking regarding their prior assertions around the nature of matter, saying:
“Yet when it comes to fundamentals, there they meet their doom. These men were giants; when they stumble, they have far to fall:”
(Book 1, lines 740-741)
I agree that he likely was a precursor to the later thinking in suggesting a compository model of life starting from pieces which combined to forms later on, but the lack of the source material makes it hard to truly assign credit.
It’s kind of like how the Greeks claimed atomism originated with the much earlier Mochus of Sidon, but we credit Democritus because we don’t have proof of Mochus at all but we do have the former’s writings. We don’t even so much credit Leucippus, Democritus’s teacher, as much as his student for the same reasons, similar to how we refer to “Plato’s theory of forms” and not “Socrates’ theory of forms.”
In any case, Lucretius oozes praise for Epicurus, comparing him to a god among men, and while he does say Empedocles was far above his contemporaries saying the same things he was, he doesn’t seem overly deferential to his positions as much as criticizing the shortcomings in the nuances of their theories with a special focus on theories of matter. I don’t think there’s much direct influence on Lucretius’s thinking around proto-evolution, even if there’s arguably plausible influence on Epicurus’s which in turn informed Lucretius.
[edit: nevermind I see you already know about the following quotes. There’s other evidence of the influence in Sedley’s book I link below]
In De Reum Natura around line 716:
Or for a more modern translation from Sedley’s Lucretius and the Transformation of Greek Wisdom
Very cool! I used to think Hume was the most ahead of his time, but this seems like the same feat if not better.
Democritus also has a decent claim to that for being the first to imagine atoms and materialism altogether.
Though the Greeks actually credited the idea to an even earlier Phonecian, Mochus of Sidon.
Through when it comes to antiquity credit isn’t really “first to publish” as much as “first of the last to pass the survivorship filter.”
Have you read Michel Serres’s The Birth of Physics? He suggests that the Epicureans and Lucretius in particular have worked out a serious theory of physics that’s closer to thermodynamics and fluid mechanics than Newtonian physics
The most important thing, I think, is not even hitting the nail on the head, but knowing (i.e. really acknowledging) that a nail can be hit in multiple places. If you know that, the rest is just a matter of testing.
~Don’t aim for the correct solution, (first) aim for understanding the space of possible solutions
A singleton is hard to verify unless there was a long period of time after its discovery during which it was neglected, as in the case of Mendel.
Yet if your discovery is neglected in this way, the context in which it is eventually rediscovered matters as well. In Mendel’s case, his laws were rediscovered by several other scientists decades later. Mendel got priority, but it still doesn’t seem like his accomplishment had much of a counterfactual impact.
In the case of Shannon, Einstein, etc, it’s possible their fields were “ripe and ready” for what they accomplished—as perhaps evidenced by the fact that their discoveries were accepted—and that they were simply plugged in enough to their research communities during a period of faster global dissemination of knowledge that any hot-on-heels competitors never quite got a chance to publish. But I don’t know enough about these cases to be confident.
I can think of a couple cases in which I might be convinced of this sort of counterfactual impact from a scientific singleton:
All peers in a small, tight-knit research community explicitly stated none of them were even close (though even this is hard to trust—are they being gracious? how do they know their own students wouldn’t have figured it out in another year’s time?). Do we have any such testimonials for Shannon, Einstein, etc?
The discovery was actually lost, then discovered and immediately appreciated for its significance. Imagine a math proof written in a mathematician’s papers, lost on their death, rediscovered in an antique shop 40 years later, and immediately heralded as a major advance—like if we’d found a proof by Fermat of Fermat’s Last Theorem in an attic in 1950.
Money was the bottleneck. There are many places a billion dollars can be put into research. If somebody launches a billion-dollar research institute in an underfunded subject that’s been languishing for decades and the institute they founded starts coming up with major technical advances, that’s evidence it was a game-changer. Of course it’s possible that billionaire put their money into the field because they had information that the research was coming to fruition and they wanted to get in on something hot, but I probably have more trouble believing they could make such a prediction so accurately than that their money made a counterfactual impact.
A discovery can also be “counterfactually important” even if it only speeds up science a bit and is only slightly a singleton. Let’s say that every year, there’s one important scientific discovery and a million unimportant ones, and the important ones must be discovered in sequence. If you discover 2025′s important discovery in 2024, all the future important discoveries in the sequence also arrive a year earlier. If each discovery is worth $1 billion/year, then you’ve now created $1 billion counterfactual dollars per year every year as long as this model holds.
Possibly Wantanabe’s singular learning theory. The math is recent for math, but I think only like ’70s recent, which is long given you’re impressed by a 20-year math gap for Einstein. The first book was published in 2010, and the second in 2019, so possibly attributable to the deep learning revolution, but I don’t know of anyone making the same math—except empirical stuff like the “neuron theory” of neural network learning which I was told about by you, empirical results like those here, and high-dimensional probability (which I haven’t read, but whose cover alone indicates similar content).
I guess (but don’t know) that most people who downvote Garrett’s comment overupdated on intuitive explanations of singular learning theory, not realizing that entire books with novel and nontrivial mathematical theory have been written on it.
Isn’t singular learning theory basically just another way of talking about the breadth of optima?
Singular Learning Theory is another way of “talking about the breadth of optima” in the same sense that Newton’s Universal Law of Gravitation is another way of “talking about Things Falling Down”.
Newton’s Universal Law of Gravitation was the first highly accurate model of things falling down that generalized beyond the earth, and it is also the second-most computationally applicable model of things falling down that we have today.
Are you saying that singular learning theory was the first highly accurate model of breadth of optima, and that it’s one of the most computationally applicable ones we have?
Did I just say SLT is the Newtonian gravity of deep learning? Hubris of the highest order!
But also yes… I think I am saying that
Singular Learning Theory is the first highly accurate model of breath of optima.
SLT tells us to look at a quantity Watanabe calls λ, which has the highly-technical name ’real log canonical threshold (RLCT). He proves several equivalent ways to describe it one of which is as the (fractal) volume scaling dimension around the optima.
By computing simple examples (see Shaowei’s guide in the links below) you can check for yourself how the RLCT picks up on basin broadness.
The RLCT =λ first-order term for in-distribution generalization error and also Bayesian learning (technically the ‘Bayesian free energy’). This justifies the name of ‘learning coefficient’ for lambda. I emphasize that these are mathematically precise statements that have complete proofs, not conjectures or intuitions.
Knowing a little SLT will inoculate you against many wrong theories of deep learning that abound in the literature. I won’t be going in to it but suffice to say that any paper assuming that the Fischer information metric is regular for deep neural networks or any kind of hierarchichal structure is fundamentally flawed. And you can be sure this assumption is sneaked in all over the place. For instance, this is almost always the case when people talk about Laplace approximation.
It’s one of the most computationally applicable ones we have? Yes. SLT quantities like the RLCT can be analytically computed for many statistical models of interest, correctly predicts phase transitions in toy neural networks and it can be estimated at scale.
EDIT: no hype about future work. Wait and see ! :)
Clarification: The ‘derivation’ for how the RLCT predicts generalization error IIRC goes through the same flavour of argument as the one the derivation of the vanilla Bayesian Information Criterion uses. I don’t like this derivation very much. See e.g. this one on Wikipedia.
So what it’s actually showing is just that:
If you’ve got a class of different hypotheses M, containing many individual hypotheses {θ1,θ2,…θN} .
And you’ve got a prior ahead of time that says the chance any one of the hypotheses in M is true is some number p(M)<1., let’s say it’s p(M)=0.8 as an example.
And you distribute this total probability p(M)=0.8 around the different hypotheses in an even-ish way, so p(θi,M)∝1N, roughly.
And then you encounter a bunch of data X (the training data) and find that only one or a tiny handful of hypotheses in M fit that data, so p(X|θi,M)≠0 for basically only one hypotheses θi…
Then your posterior probability p(M|X)=p(X|M)0.80.8p(X|M)+0.2p(X|¬M) that the hypothesis θi is correct will probably be tiny, scaling with 1N. If we spread your prior p(M)=0.8 over lots of hypotheses, there isn’t a whole lot of prior to go around for any single hypothesis. So if you then encounter data that discredits all hypotheses in M except one, that tiny bit of spread-out prior for that one hypothesis will make up a tiny fraction of the posterior, unless p(X|¬M) is really small, i.e. no hypothesis outside the set M can explain the data either.
So if our hypotheses correspond to different function fits (one for each parameter configuration, meaning we’d have 232k hypotheses if our function fits used k 32-bit floating point numbers), the chance we put on any one of the function fits being correct will be tiny. So having more parameters is bad, because the way we picked our prior means our belief in any one hypothesis goes to zero as N goes to infinity.
So the Wikipedia derivation for the original vanilla posterior of model selection is telling us that having lots of parameters is bad, because it means we’re spreading our prior around exponentially many hypotheses.… if we have the sort of prior that says all the hypotheses are about equally likely.
But that’s an insane prior to have! We only have 1.0 worth of probability to go around, and there’s an infinite number of different hypotheses. Which is why you’re supposed to assign prior based on K-complexity, or at least something that doesn’t go to zero as the number of hypotheses goes to infinity. The derivation is just showing us how things go bad if we don’t do that.
In summary: badly normalised priors behave badly
SLT mostly just generalises this derivation to the case where parameter configurations in our function fits don’t line up one-to-one with hypotheses.
It tells us that if we are spreading our prior around evenly over lots of parameter configurations, but exponentially many of these parameter configurations are secretly just re-expressing the same hypothesis, then that hypothesis can actually get a decent amount of prior, even if the total number of parameter configurations is exponentially large.
So our prior over hypotheses in that case is actually somewhat well-behaved in that it can end up normalised properly when we take N→∞. That is a basic requirement a sane prior needs to have, so we’re at least not completely shooting ourselves in the foot anymore. But that still doesn’t show why this prior, that neural networks sort of[1] implicitly have, is actually good. Just that it’s no longer obviously wrong in this specific way.
Why does this prior apparently make decent-ish predictions in practice? That is, why do neural networks generalise well?
I dunno. SLT doesn’t say. It just tells us how the parameter prior to hypothesis prior conversion ratio works, and in the process shows us that neural networks priors can be at least somewhat sanely normalised for large numbers of parameters. More than we might have initially thought at least.
That’s all though. It doesn’t tell us anything else about what makes a Gaussian over transformer parameter configurations a good starting guess for how the universe works.
How to make this story tighter?
If people aim to make further headway on the question of why some function fits generalise somewhat and others don’t, beyond: ‘Well, standard Bayesianism suggests you should at least normalise your prior so that having more hypotheses isn’t actively bad’, then I’d suggest a starting point might be to make a different derivation for the posterior on the fits that isn’t trying to reason about p(M) defined as the probability that one of the function fits is ‘true’ in the sense of exactly predicting the data. Of course none of them are. We know that. When we fit a 150 billion parameter transformer to internet data, we don’t expect going in that any of these 216×150×109 parameter configurations will give zero loss up to quantum noise on any and all text prediction tasks in the universe until the end of time. Under that definition of M, which the SLT derivation of the posterior and most other derivations of this sort I’ve seen seem to implicitly make, we basically have p(M)≈0 going in! Maybe look at the Bayesian posterior for a set of hypotheses we actually believe in at all before we even see any data, like M='one of these models might get <1.1 average loss on holdout data sets' .
SLT in three sentences
‘You thought your choice of prior was broken because it’s nor normalised right, and so goes to zero if you hand it too many hypotheses. But you missed that the way you count your hypotheses is also broken, and the two mistakes sort of cancel out. Also here’s a bunch of algebraic geometry that sort of helps you figure out what probabilities your weirdo prior actually assigns to hypotheses, though that parts not really finished’.
SLT in one sentence
‘Loss basins with bigger volume will have more posterior probability if you start with a uniform-ish prior over parameters, because then bigger volumes get more prior, duh.’
Sorta, kind of, arguably. There’s some stuff left to work out here. For example vanilla SLT doesn’t even actually tell you which parts of your posterior over parameters are part of the same hypothesis. It just sort of assumes that everything left with support in the posterior after training is part of the same hypothesis, even though some of these parameter settings might generalise totally differently outside the training data. My guess is that you can avoid matching this up by comparing equivalence over all possible inputs by checking which parameter settings give the same hidden representations over the training data, not just the same outputs.
I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.
EDIT: I have now changed my mind about this, not least because of Lucius’s influence. I currently think Bushnaq’s padding argument suggests that the essentials of SLT is the uniform prior on codes is equivalent to the Solomonoff prior through overparameterized and degenerate codes; SLT is a way to quantitatively study this phenomena especially for continuous models.
The story that symmetries mean that the parameter-to-function map is not injective is true but already well-understood outside of SLT. It is a common misconception that this is what SLT amounts to.
To be sure—generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.
The issue of the true distribution not being contained in the model is called ‘unrealizability’ in Bayesian statistics. It is dealt with in Watanabe’s second ‘green’ book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.
I don’t have the time to recap this story here.
Lucius-Alexander SLT dialogue?
I don’t think these conditions are particularly weak at all. Any prior that fulfils it is a prior that would not be normalised right if the parameter-function map were one-to-one.
It’s a kind of prior people like to use a lot, but that doesn’t make it a sane choice.
A well-normalised prior for a regular model probably doesn’t look very continuous or differentiable in this setting, I’d guess.
The generic symmetries are not what I’m talking about. There are symmetries in neural networks that are neither generic, nor only present at finite sample size. These symmetries correspond to different parametrisations that implement the same input-output map. Different regions in parameter space can differ in how many of those equivalent parametrisations they have, depending on the internal structure of the networks at that point.
I know it ‘deals with’ unrealizability in this sense, that’s not what I meant.
I’m not talking about the problem of characterising the posterior right when the true model is unrealizable. I’m talking about the problem where the actual logical statement we defined our prior and thus our free energy relative to is an insane statement to make and so the posterior you put on it ends up negligibly tiny compared to the probability mass that lies outside the model class.
But looking at the green book, I see it’s actually making very different, stat-mech style arguments that reason about the KL divergence between the true distribution and the guess made by averaging the predictions of all models in the parameter space according to their support in the posterior. I’m going to have to translate more of this into Bayes to know what I think of it.
Link(s) to your favorite proof(s)?
Also, do these match up with empirical results?
I have a cached belief that the Laplace approximation is also disproven by ensemble studies, so I don’t really need SLT to inoculate me against that. I’d mainly be interested if SLT shows something beyond that.
As I read the empirical formulas in this paper, they’re roughly saying that a network has a high empirical learning coefficient if an ensemble of models that are slightly less trained on average have a worse loss than the network.
But then so they don’t have to retrain the models from scratch, they basically take a trained model, and wiggle it around using Gaussian noise while retraining it.
This seems like a reasonable way to estimate how locally flat the loss landscape is. I guess there’s a question of how much the devil is in the details; like whether you need SLT to derive an exact formula that works.
I guess I’m still not super sold on it, but on reflection that’s probably partly because I don’t have any immediate need for computing basin broadness. Like I find the basin broadness theory nice to have as a model, but now that I know about it, I’m not sure why I’d want/need to study it further.
There was a period where I spent a lot of time thinking about basin broadness. I guess I eventually abandoned it because I realized the basin was built out of a bunch of sigmoid functions layered on top of each other, but the generalization was really driven by the neural tangent kernel, which in turn is mostly driven by the Jacobian of the network outputs for the dataset as a function of the weights, which in turn is mostly driven by the network activations. I guess it’s plausible that SLT has the best quantities if you stay within the basin broadness paradigm. 🤔
All proofs are contained in the Watanabe’s standard text, see here
https://www.cambridge.org/core/books/algebraic-geometry-and-statistical-learning-theory/9C8FD1BDC817E2FC79117C7F41544A3A
It’s measuring the volume of points in parameter space with loss <ϵ when ϵ is infinitesimal.
This is slightly tricky because it doesn’t restrict itself to bounded parameter spaces,[1] but you can fix it with a technicality by considering how the volume scales with ϵ instead.
In real networks trained with finite amounts of data, you care about the case where ϵ is small but finite, so this is ultimately inferior to just measuring how many configurations of floating point numbers get loss <ϵ, if you can manage that.
I still think SLT has some neat insights that helped me deconfuse myself about networks.
For example, like lots of people, I used to think you could maybe estimate the volume of basins with loss <ϵ using just the eigenvalues of the Hessian. You can’t. At least not in general.
Like the floating point numbers in a real network, which can only get so large. A prior of finite width over the parameters also effectively bounds the space
Second most? What’s the first? Linearization of a Newtonian V(r) about the earth’s surface?
Yes.
Scott Garrabrant’s discovery of Logical Inductors.
I remembered hearing about the paper from a friend and thinking it couldn’t possibly be true in a non-trivial sense. To someone with even a modicum of experience in logic - a computable procedure assigning probabilities to arbitrary logical statements in a natural way is surely to hit a no-go diagonalization barrier.
Logical Inductors get around the diagonalization barrier in a very clever way. I won’t spoil how it does here. I recommend the interested reader to watch Andrew’s Critch talk on Logical Induction.
It was the main reason convincing that MIRI != clowns but were doing substantial research.
The Logical Induction paper has a fairly thorough discussion of previous work. Relevant previous work to mention is de Finetti’s on betting and probability, previous work by MIRI & associates (Herreshof, Taylor, Christiano, Yudkowsky...), the work of Shafer-Vovk on financial interpretations of probability & Shafer’s work on aggregation of experts. There is also a field which doesn’t have a clear name that studies various forms of expert aggregation. Overall, my best judgement is that nobody else was close before Garrabrant.
The Antikythera artifact: a Hellenistic Computer.
You probably learned heliocentrism= good, geocentrism=bad, Copernicus-Kepler-Newton=good epicycles=bad. But geocentric models and heliocentric models are equivalent, it’s just that Kepler & Newton’s laws are best expressed in a heliocentric frame. However, the raw data of observations is actually made in a geocentric frame. Geocentric models stay closer to the data in some sense.
Epicyclic theory is now considered bad, an example of people refusing to see the light of scientific revolution. But actually, it was an enormous innovation. Using high-precision gearing epicycles could be actually implemented on a (Hellenistic) computer implicitly doing Fourier analysis to predict the motion of the planets. Astounding.
A Roman author (Pliny the Elder?) describes a similar device in posession of Archimedes of Rhodes. It seems likely that Archimedes or a close contemporary (s) designed the artifact and that several were made in Rhodes.
Actually, since we’re on the subject of scientific discoveries
Discovery & description of the complete Antikythera mechanism. The actual artifact that was found is just a rusty piece of bronze. Nobody knew how it worked. There were several sequential discoveries over multiple decades that eventually led to the complete solution of the mechanism.The final pieces were found just a few years ago. An astounding scientific achievement. Here is an amazing documentary on the subject:
I think Diffractor’s post shows that logical induction does hit a certain barrier, which isn’t quite diagonalization, but seems to me about as troublesome:
There’s unpublished work about a slightly weaker logical induction criterion which doesn’t have this property (there exist constant-distribution inductors in this weaker sense), but which is provably equivalent to the regular LIC whenever the inductor is computable.[1] To my eye, the weaker criterion is more natural. The basic idea is that this weird trader shouldn’t count as raking in the cash. The regular LIC (we can call it “strong LIC” or SLIC) counts traders as exploiting the market if there is a sequence of worlds in which their wealth grows unboundedly. This allows for the trick you quote: buying up larger and larger piles of sentences in diminishingly-probable worlds counts as exploiting the market.
The weak LIC (WLIC) says instead that traders have to actually make the money in order to count as exploiting the market.
Thus the limit of a logical inductor can count as a (weak) logical inductor, just not a computable one.
Roughly speaking. This is not quite an adequate description of the theorem.
Antonie van Leeuwenhoek, known as the Father of Microbiology, made the first microscopes capable of seeing microorganisms and is credited as the person who discovered them. He kept his lensmaking techniques secret, however, and microscopes capable of the same magnification didn’t become generally available until many, many years later.
Yes, beautiful example ! Van Leeuwenhoek was the one-man ASML of the 17th century. In this case, we actually have evidence to the counterfactual impact as other lensmakers trailed van Leeuwenhoek by many decades.
It’s plausible that high-precision measurement and fabrication is the key bottleneck in most technological and scientific progress- it’s difficult to oversell the importance of van Leeuwenhoek.
If you’ll allow linguistics, Pāṇini was two and a half thousand years ahead of modern descriptive linguists.
Maybe Galois with group theory? He died in 1832, but his work was only published in 1846, upon which it kicked off the development of group theory, e.g. with Cayley’s 1854 paper defining a group. Claude writes that there was not much progress in the intervening years:
Wegener’s theory of continental drift was decades ahead of its time. He published in the 1920s, but plate tectonics didn’t take over until the 1960s. His theory was wrong in important ways, but still.
I sometimes had this feeling from Conway’s work, in particular, combinatorial game theory and surreal numbers to me feel closer to mathematical invention than mathematical discovery. This kind of things are also often “leaf nodes” on the tree of knowledge, not leading to many followup discoveries, so you could say their counterfactual impact is low for that reason.
In engineering, the best example I know is vulcanization of rubber. It has had a huge impact on today’s world, but Goodyear developed it by working alone for decades, when nobody else was looking in that direction.
Not inconceivable, I would even say plausible, that surreal numbers & combinatorial game theories impact is still in the future.
Pasteur had (also highly “counterfactual”) help I think! Ignaz Semmelweis worked in this maternity ward where the women & babies kept dying. The hospital had opened up some investigations over the years as to the cause of death but kept closing them with garbage explanations. He went somewhere else for a while and when he got back he noticed that the death numbers were down in his absence. Then he noticed his hands smelled like death after one of his routine autopsies and he was about to go plunge them in some poor mother! He had washed them but just with regular soap. If he put some bleach in the washwater then his hands didn’t stink. He connected the dots. He had killed hundreds of mothers & babies but wrote a book about it anyway and thereby popularized disinfection (and strongly suggested the root cause of disease).
Probably the main reason that germ theory took so long to work out is that the people with the right evidence were too guilty and ashamed to share it.
That the earth is a sphere:
Thus begins “The Clash Between the Jesuits and Traditional Chinese Square-Earth Cosmology”. The article tells the dramatic story of how some Jesuits tried to establish the spherical-Earth theory in 16th century China, where it was still unknown, partly by creating an elaborate world map to gain the trust of the emperor.
They were ultimately not successful, and the spherical-Earth theory only gained influence in China when Western texts were increasingly translated into Chinese more than two thousand years after the theory was originally invented.
Which makes it a good candidate for one of the most non-obvious / counterfactual theories in history.
I find this very hard to believe. Shouldn’t Chinese merchants have figured out eventually, traveling long distances using maps, that the Earth was a sphere? I wonder whether the “scholars” of ancient China actually represented the state-of-the-art practical knowledge that the Chinese had.
Nevertheless, I don’t think this is all that counterfactual. If you’re obsessed with measuring everything, and like to travel (like the Greeks), I think eventually you’ll have to discover this fact.
Merchants were a lot weaker in China than in Europe. Chinese merchants also did a lot less sea voyages due to geography.
If a bunch of low-status merchants believed that the Earth is a sphere it might not have influenced Chinese high-class beliefs in the same way as beliefs of political powerful merchants in Europe.
I see no reason to doubt that the article is accurate. Why would Chinese scholars completely miss the theory if it was obvious among merchants? There should in any case exist some records of it, some maps. Yet none exist. And why would it even be obvious that the Earth is a sphere from long distance travel alone?
I don’t think this makes sense. If the Chinese didn’t reinvent the theory in more than two thousand years, this makes it highly “counterfactual”. The longer a theory isn’t reinvented, the less obvious it must be.
Maybe it’s the other way around, and it’s the Chinese elite who was unusually and stubbornly conservative on this, trusting the wisdom of their ancestors over foreign devilry (would be a pretty Confucian thing to do). The Greeks realised the Earth was round from things like seeing sails appear over the horizon. Any sailing peoples thinking about this would have noticed sooner or later.
Kind of a long shot, but did Polynesian people have ideas on this, for example?
There is a large difference between sooner and later. Highly non-obvious ideas will be discovered later, not sooner. The fact that China didn’t rediscover the theory in more than two thousand years means that it the ability to sail the ocean didn’t make it obvious.
As far as we know, nobody did, except for early Greece. There is some uncertainty about India, but these sources are dated later and from a time when there was already some contact with Greece, so they may have learned it from them.
Well, it’s hard to tell because most other civilizations at the required level of wealth to discover this (by which I mean both sailing and surplus enough to have people who worry about the shape of the Earth at all) could one way or another have learned it via osmosis from Greece. If you only have essentially two examples, how do you tell whether it was the one who discovered it who was unusually observant rather than the one who didn’t who was unusually blind? But it’s an interesting question, it might indeed be a relatively accidental thing which for some reason was accepted sooner than you would have expected (after all, sails disappearing could be explained by an Earth that’s merely dome-shaped; the strongest evidence for a completely spherical shape was probably the fact that lunar eclipses feature always a perfect disc shaped shadow, and even that requires interpreting eclipses correctly, and having enough of them in the first place).
I don’t buy this, the curvedness of the sea is obvious to sailors, e.g. you see the tops of islands long before you see the beach, and indeed to anyone who has ever swum across a bay! Inland peoples might be able to believe the world is flat, but not anyone with boats.
What’s more likely: You being wrong about the obviousness of the sphere Earth theory to sailors, or the entire written record (which included information from people who had extensive access to the sea) of two thousand years of Chinese history and astronomy somehow ommitting the spherical Earth theory? Not to speak of other pre-Hellenistic seafaring cultures which also lack records of having discovered the sphere Earth theory.
Set theory is the prototypical example I usually hear about. From Wikipedia:
An example that’s probably * not* a highly counterfactual discovery is the discovery of DNA as the inheritance particle by Watson & Crick [? Wilkins, Franklin, Gosling, Pauling...].
I had great fun reading Watson’s scientific-literary fiction the Double Helix. Watson and Crick are very clear that competitors were hot on their heels, a matter of months, a year perhaps.
EDIT: thank you nitpickers. I should have said structure of DNA, not its role as the carrier of inheritance.
Nitpick: you’re talking about the discovery of the structure of DNA; it was already known at that time to be the particle which mediates inheritance IIRC.
I would say “the thing that contains the inheritance particles” rather than “the inheritance particle”. “Particulate inheritance” is a technical term within genetics and it refers to how children don’t end up precisely with the mean of their parents’ traits (blending inheritance), but rather with some noise around that mean, which particulate inheritance asserts is due to the genetic influence being separated into discrete particles with the children receiving random subsets of their parent’s genes. The significance of this is that under blending inheritance, the genetic variation between organisms within a species would be averaged away in a small number of generations, which would make evolution by natural selection ~impossible (as natural selection doesn’t work without genetic variation).
Peter J. Bowler suggests that evolution by natural selection is this in his book “Darwin Deleted”—given that in real life, there was an “eclipse of Darwinism”, he suggests that without Darwin, various non-Darwinian theories of evolution would have been developed further, and evolution by natural selection would have come rather late
Anecdotally (I couldn’t find confirmation after a few minutes of searching), I remember hearing a claim about Darwin being particularly ahead of the curve with sexual selection & mate choice. That without Darwin it might have taken decades for biologists to come to the same realizations.
Don’t forget Wallace !
Bowler’s comment on Wallace is that his theory was not worked out to the extent that Darwin’s was, and besides I recall that he was a theistic evolutionist. Even with Wallace, there was still a plethora of non-Darwinian evolutionary theories before and after Darwin, and without the force of Darwin’s version, it’s not likely or necessary that Darwinism wins out.
Also
And he points out that minus Darwin, nobody would have paid as much attention to Wallace.
Bowler also points out that Wallace didn’t really form the connection between both natural and artificial selection.
In some of his books on evolution, Dawkins also said very similar things when commenting on Darwin vs Wallace, basically saying that there’s no comparison, Darwin had a better grasp of things, justified it better and more extensively, didn’t have muddled thinking about mechanisms, etc.
I mean to some extent, Dawkins isn’t a historian of science, presentism, yadda yadda but from what I’ve seen he’s right here. Not that Wallace is somehow worse, given that of all the people out there he was certainly closer than the rest. That’s about it
Here are some candidates from Claude and Gemini (Claude Opus seemed considerably better than Gemini Pro for this task). Unfortunately they are quite unreliable: I’ve already removed many examples from this list which I already knew to have multiple independent discoverers (like e.g. CRISPR and general relativity). If you’re familiar with the history of any of these enough to say that they clearly were/weren’t very counterfactual, please leave a comment.
Noether’s Theorem
Mendel’s Laws of Inheritance
Godel’s First Incompleteness Theorem (Claude mentions Von Neumann as an independent discoverer for the Second Incompleteness Theorem)
Feynman’s path integral formulation of quantum mechanics
Onnes’ discovery of superconductivity
Pauling’s discovery of the alpha helix structure in proteins
McClintock’s work on transposons
Observation of the cosmic microwave background
Lorentz’s work on deterministic chaos
Prusiner’s discovery of prions
Yamanaka factors for inducing pluripotency
Langmuir’s adsorption isotherm (I have no idea what this is)
Mendel’s Laws seem counterfactual by about ˜30 years, based on partial re-discovery taking that much time. His experiments are technically something which someone could have done basically any time in last few thousand years, having basic maths
I buy this argument.
I would guess that Lorentz’s work on deterministic chaos does not get many counterfactual discovery points. He noticed the chaos in his research because of his interactions with a computer doing simulations. This happened in 1961. Now, the question is, how many people were doing numerical calculations on computer in 1961? It could plausibly have been ten times as many by 1970. A hundred times as many by 1980? Those numbers are obviously made up but the direction they gesture in is my point. Chaos was a field that was made ripe for discovery by the computer. That doesn’t take anything away from Lorentz’s hard work and intelligence, but it does mean that if he had not taken the leap we can be fairly confident someone else would have. Put another way: If Lorentz is assumed to have had a high counterfactual impact, then it becomes a strange coincidence that chaos was discovered early in the history of computers.
I buy this argument.
Feymann’s path integral formulation can’t be that counterfactually large. It’s mathematically equivalent to Schwingers formulation and done several years earlier by Tomonaga.
I don’t buy mathematical equivalence as an argument against, in this case, since the whole point of the path integral formulation is that it’s mathematically equivalent but far simpler conceptually and computationally.
Idk the Nobel prize committee thought it wasn’t significant enough to give out a separate prize 🤷
I am not familiar enough with the particulars to have an informed opinion. My best guess is that in general statements to the effect of “yes X also made scientific contribution A but Y phrased it better’ overestimate the actual scientific counterfactual impact of Y. It generically weighs how well outsiders can understand the work too much vis a vis specialists/insiders who have enough hands-on experience that the value-add of a simpler/neater formalism is not that high (or even a distraction).
The reason Dick Feynmann is so much more well-known than Schwinger and Tomonaga surely must not be entirely unrelated with the magnetic charisma of Dick Feynmann.
I’ve heard an argument that Mendel was actually counter-productive to the development of genetics. That if you go and actually study peas like he did, you’ll find they don’t make perfect Punnett squares, and from the deviations you can derive recombination effects. The claim is he fudged his data a little in order to make it nicer, then this held back others from figuring out the topological structure of genotypes.
I’ve heard, in this context, the partial counterargument that he was using traits which are a little fuzzy around the edges (where is the boundary between round and wrinkled?) and that he didn’t have to intentionally fudge his data in order to get results that were too good, just be not completely objective in how he was determining them.
Of course, this sort of thing is why we have double-blind tests in modern times.
Observation of the cosmic microwave background was a simultaneous discovery, according to James Peebles’ Nobel lecture. If I’m understanding this right, Bob Dicke’s group at Princeton was already looking for the CMB based on a theoretical prediction of it, and were doing experiments to detect it, with relatively primitive equipment, when the Bell Labs publication came out.
Fun question!
IMO Edison and Shannon are both strong candidates for quite different reasons.
Edison solved a bunch of necessary problems in one go when building a working, commercializable lighting system. He did this in an area where many others had only chipped away at corners of the problem. He was not the first to the area...but I don’t think there are any strong claims that the area would have come along nearly as quickly if not for him/his team. I talk about this in-depth in a Works in Progress piece on Edison as an exception technical entrepreneur.
As far as Shannon goes, I’m not saying he initially published on his two major discoveries much earlier than others would have initially published...but Shannon had a sort of uncanny ability to open and largely close a sub-field all in one go. This is rare in scientific branch creation. Usually a process likes this takes something like 5-10 people something like 5-20 years to do. My FreakTakes piece on the early years of molecular biology give a sort of blow-by-blow of what this often looks like. Shannon’s excellence helped circumvent a lot of that. So IMO the thoroughness of his thinking was a huge time-saver.
The Buddha with dependent origination. I think it says somewhere that most of the stuff in Buddhism was from before the Buddha’s time. These are things such as breath-based practices and loving kindness, among others. He had one revelation that made the entire enlightenment thing basically which is called dependent origination.*
*At least according to my meditation teacher, I believe him since he was a neuroscientist and astrophysics masters at Berkeley before he left for India though so he’s got some pretty good epistemics.
It basically states that any system is only true based on another system being true. It has some really cool parallels to Gödel’s Incompleteness Theorem but on a metaphysical level. Emptiness of emptiness and stuff. (On a side note I can recommend TMI + Seeing That Frees if you want to experience som radical shit there.)
For anyone wondering TMI almost certainly stands for “The Mind Illuminated”; a book by John Yates, Matthew Immergut, and Jeremy Graves . Full title: The Mind Illuminated: A Complete Meditation Guide Integrating Buddhist Wisdom and Brain Science for Greater Mindfulness
Hi Jonas! Would you mind saying about more about TMI + Seeing That Frees? Thanks!
Sure! Anything more specific that you want to know about? Practice advice or more theory?
Thanks :) Uh, good question. Making some good links? Have you done much nondual practice? I highly recommend Loch Kelly :)
Maybe Hanson et al.’s Grabby aliens model? @Anders_Sandberg said that some N years before that (I think more or less at the time of working on Dissolving the Fermi Paradox), he “had all of the components [of the model] on the table” and it just didn’t occur to him that they can be composed in this way. (personal communication, so I may be misremembering some details). Although it’s less than 10 years, so...
Speaking of Hanson, prediction markets seem like a more central example. I don’t think the idea was [inconceivable in principle] 100 years ago.
ETA: I think Dissolving the Fermi Paradox may actually be a good example. Nothing in principle prohibited people puzzling about “the great silence” from using probability distributions instead of point estimates in the Drake equation. Maybe it was infeasible to compute this back in the 1950s/60s, but I guess it should be doable in 2000s and still, the paper was published only in 2017.
Here’s a document called “Upper and lower bounds for Alien Civilizations and Expansion Rate” I wrote in 2016. Hanson et al. Grabby Aliens paper was submitted in 2021.
The draft is very rough. Claude summarizes it thusly:
The draft was never finished as I felt the result wasn’t significant enough. To be clear, the Hanson-Martin-McCarter-Paulson paper contains more detailed models and much more refined statistical analysis. I didn’t pursue these ideas further.
I wasn’t part of the rationality/EA/LW community. Nobody I talked to was interested in these questions.
Let this be a lesson for young people: Don’t assume. Publish! Publish in journals. Publish on LessWrong. Make something public even if it’s not in a journal!
The Iowa Election Markets were roughly contemporaneous with Hanson’s work. They are often co-credited.
Green fluorescent protein (GFP). A curiosity-driven marine biology project (how do jellyfish produce light?), that was later adapted into an important and widely used tool in cell biology. You splice the GFP gene onto another gene, and you’ve effectively got a fluorescent tag so you can see where the protein product is in the cell.
Jellyfish luminescence wasn’t exactly a hot field, I don’t know of any near-independent discoveries of GFP. However, when people were looking for protein markers visible under a microscope, multiple labs tried GFP simultaneously, so it was determined by that point. If GFP hadn’t been discovered, would they have done marine biology as a subtask, or just used their next best option?
Fun fact: The guy who discovered GFP was living near Nagasaki when it was bombed. So we can consider the hypothetical where he was visiting the city that day.
Grothendiek seems to have been an extremely singular researcher, various of his discoveries would have likely been significantly delayed without him. His work on sheafs is mind bending the first time you see it and was seemingly ahead of its time.
Here are some reflections I wrote on the work of Grothendieck and relations with his contemporaries & predecessors.
Take it with a grain of salt—it is probably too deflationary of Grothendieck’s work, pushing back on mythical narratives common in certain mathematical circles where Grothendieck is held to be an Christ-like figure. I pushed back on that a little. Nevertheless, it would probably not be an exaggeration to say that Grothendieck’s purely scientific contributions [as opposed to real-life consequences] were comparable to those of Einstein.
I have previously used special relativity as an example to the opposite. It seems to me that the Michelson-Morley experiment laid the groundwork and all alternatives were more or less rejected by the time special relativity was formulated. This could be hindsight bias though.
If nobel prizes are any indicator, then the photoelectric effect is probably more counterfactually impactful than special relativity.
I think it’s worth noting that small delays in discovering new things would, in aggregate, be very impactful. On average, how far apart are the duplicate discoveries? If we pushed all the important discoveries back a couple of years by eliminating whoever was in fact historically first, then the result is a world that is perpetually several years behind our own in everything. This world is plausibly 5-10% poorer for centuries, maybe more if a few key hard steps have longer delays, or if the most critical delays happened a long time ago and were measured in decades or centuries instead.
Special relativity is not such a good example here when compared to general relativity, which was much further ahead of its time. See, for example, this article: https://bigthink.com/starts-with-a-bang/science-einstein-never-existed/
Regarding special relativity, Einstein himself said:[1]
As for general relativity, the ideas and the mathematics required (Riemannian Geometry) were much more obscure and further afield. The only people who came close, Nordstrom and Hilbert, arguably did so because they were directly influenced by Einstein’s ongoing work on general relativity (not just special relativity).
https://www.quora.com/Without-Einstein-would-general-relativity-be-discovered-by-now
https://en.m.wikipedia.org/wiki/Relativity_priority_dispute
First, your non-standard use of the term “counterfactual” is jarring, though, as I understand, it is somewhat normalized in your circles. “Counterfactual” unlike “factual” means something that could have happened, given your limited knowledge of the world, but did not. What you probably mean is “completely unexpected”, “surprising” or something similar. I suspect you got this feedback before.
Sticking with physics. Galilean relativity was completely against the Aristotelian grain. More recently, the singularity theorems of Penrose and Hawking unexpectedly showed that black holes are not just a mathematical artifact, but a generic feature of the world. A whole slew of discoveries, experimental and theoretical, in Quantum mechanics were almost all against the grain. Probably the simplest and yet the hardest to conceptualize was the Bell’s theorem.
Not my field, but in economics, Adam Smith’s discovery of what Scott Alexander later named Moloch was a complete surprise, as I understand it.
I think it means the more specific “a discovery that if it counterfactually hadn’t happened, wouldn’t have happened another way for a long time”. I think this is roughly the “counterfactual” in “counterfactual impact”, but I agree not the more widespread one.
It would be great to have a single word for this that was clearer.
Maybe “counterfactually robust” is an OK phrase?