AI Impacts states: “32% of trends we investigated saw at least one large, robust discontinuity”. If I take my 12 out of 50 “big” discontinuities and assume that one third would be found to be “large and robust” by a more thorough investigation, one would expect that 4 out of the 50 technologies will display a “large and robust discontinuity” in the sense which AI Impacts takes those words to mean.
This seems reasonable, but I think it may be too high. For the AI Impacts discontinuities investigation, the process for selecting cases went something like this:
All technologies <-- Select those which someone brings to our attention, thinking it might be discontinuous
Discontinuity candidates <-- Do some cursory research to see if they’re worth investigating in more depth
Strong discontinuity candidates <-- Get data, calculate, and keep those with > 10 years of unexpected progress
Discontinuities <-- Select those that are particularly large and not highly sensitive to methodology
Large and robust discontinuities
That 32% statistic is the fraction of 3) that made it all the way to 5). It seems like your investigation corresponds roughly to calculating the fraction of 1) that makes it to 3). This is useful, because, as you explain, the AI Impacts investigation did not give us evidence about the overall prevalence of discontinuities. Of your list, 8 were things that we investigated, all of which you list as plausibly big discontinuities, except one that was ‘maybe’ and ‘medium’. Of those, we found 3 to be discontinuous, and all of those three were deemed large and robust. So that’s 42% if we ignore the ‘maybe’.
This suggests that the 32% is too low. But the 42% above is the fraction of categories of technologies and lines of investigation that seemed very promising, and which showed at least one discontinuity somewhere. Some of these categories are quite broad (robotics), while some are rather narrow (condoms). Within those categories are many technologies, not all of which showed discontinuous progress. Supposing that all categories are equally likely to produce discontinuities, this suggests that the base rate is substantially lower, since even if discontinuities are very rare, we might expect most broad categories to contain at least one.
It is plausible to me that it is better to think of AI as one of these broad categories, especially since the potential discontinuities you’re flagging do seem to correspond more to advances that substantially changed the field, and less to things that only advance one small corner of it. In that case, the 42% base rate is about right. But I’m not sure if this is the right approach, from an AI safety standpoint, since many discontinuities might be largely irrelevant while others are not. A discontinuity in performance for diagnosing cancer seems unconcerning, while a discontinuity in predicting geopolitics seems more concerning.
Nonetheless, this is interesting! I particularly like the idea of having multiple people do similar investigations, to see if the results converge. Another possibility is to spot check some of them and see if they hold up to scrutiny.
Note that 8 is pretty small, so if the true base rate was 32%, you’d still expect to see 2 discontinuities (8 choose 2) * 0.32^2 * 0.68^6 = 0.283 = 28% of the time, vs (8 choose 3) * 0.32^3 * 0.68^5 = 0.266 = 27% of the time for 3 discontinuities, so I take the 3⁄8 as indicative that I’m on the right ballpark, rather than sweating too much about it.
The broad vs specific categories is an interesting hypothesis, but note that it could also be cofounded with many other factors. For example, broader fields could attract more talent and funds, and it might be harder to achieve proportional improvement in a bigger domain (e.g., it might be easier to improve making candles by 1% than to improve aviation by 1%). That is, you can have the effect that you have more points of leverage (more subfields, as you point out), but that each point of leverage affects the result less (an x% discontinuity in a subfield corresponds to much less than x% in the super-field).
Then broad categories get 24% big discontinuities, 29% medium discontinuities, 14% small discontinuities and 33% no discontinuities. In comparison, less broad categories get 24% big discontinuities, 10% medium discontinuities, 28% small discontinuities and 38% no discontinuities, i.e., no effect at the big discontinuity level but a noticeable difference at the medium level, which is somewhat consistent with your hypothesis. Data available here, the criteria used was “does this have more than one broad subcategory” combined my own intuition as to whether the field feels “broad”.
This seems reasonable, but I think it may be too high. For the AI Impacts discontinuities investigation, the process for selecting cases went something like this:
All technologies <-- Select those which someone brings to our attention, thinking it might be discontinuous
Discontinuity candidates <-- Do some cursory research to see if they’re worth investigating in more depth
Strong discontinuity candidates <-- Get data, calculate, and keep those with > 10 years of unexpected progress
Discontinuities <-- Select those that are particularly large and not highly sensitive to methodology
Large and robust discontinuities
That 32% statistic is the fraction of 3) that made it all the way to 5). It seems like your investigation corresponds roughly to calculating the fraction of 1) that makes it to 3). This is useful, because, as you explain, the AI Impacts investigation did not give us evidence about the overall prevalence of discontinuities. Of your list, 8 were things that we investigated, all of which you list as plausibly big discontinuities, except one that was ‘maybe’ and ‘medium’. Of those, we found 3 to be discontinuous, and all of those three were deemed large and robust. So that’s 42% if we ignore the ‘maybe’.
This suggests that the 32% is too low. But the 42% above is the fraction of categories of technologies and lines of investigation that seemed very promising, and which showed at least one discontinuity somewhere. Some of these categories are quite broad (robotics), while some are rather narrow (condoms). Within those categories are many technologies, not all of which showed discontinuous progress. Supposing that all categories are equally likely to produce discontinuities, this suggests that the base rate is substantially lower, since even if discontinuities are very rare, we might expect most broad categories to contain at least one.
It is plausible to me that it is better to think of AI as one of these broad categories, especially since the potential discontinuities you’re flagging do seem to correspond more to advances that substantially changed the field, and less to things that only advance one small corner of it. In that case, the 42% base rate is about right. But I’m not sure if this is the right approach, from an AI safety standpoint, since many discontinuities might be largely irrelevant while others are not. A discontinuity in performance for diagnosing cancer seems unconcerning, while a discontinuity in predicting geopolitics seems more concerning.
Nonetheless, this is interesting! I particularly like the idea of having multiple people do similar investigations, to see if the results converge. Another possibility is to spot check some of them and see if they hold up to scrutiny.
Note that 8 is pretty small, so if the true base rate was 32%, you’d still expect to see 2 discontinuities (8 choose 2) * 0.32^2 * 0.68^6 = 0.283 = 28% of the time, vs (8 choose 3) * 0.32^3 * 0.68^5 = 0.266 = 27% of the time for 3 discontinuities, so I take the 3⁄8 as indicative that I’m on the right ballpark, rather than sweating too much about it.
The broad vs specific categories is an interesting hypothesis, but note that it could also be cofounded with many other factors. For example, broader fields could attract more talent and funds, and it might be harder to achieve proportional improvement in a bigger domain (e.g., it might be easier to improve making candles by 1% than to improve aviation by 1%). That is, you can have the effect that you have more points of leverage (more subfields, as you point out), but that each point of leverage affects the result less (an x% discontinuity in a subfield corresponds to much less than x% in the super-field).
If I look at it in my database, categorizing:
Broad: aviation, ceramics, cryptography, film, furniture, glass, nuclear, petroleum, photography, printing, rail transport, robotics, water supply and sanitation, artificial life, bladed weapons, aluminium, automation, perpetual motion machines, nanotechnology, timekeeping devices, wind power
Not broad: motorcycle, multitrack recording, Oscilloscope history, paper, polymerase chain reaction, portable gas stove, roller coaster, steam engine, telescope, cycling, spaceflight, rockets, calendars, candle making, chromatography, condoms, diesel car, hearing aids, radar, radio, sound recording, submarines, television, automobile, battery, telephone, transistor, internal combustion engine, manufactured fuel gases.
Then broad categories get 24% big discontinuities, 29% medium discontinuities, 14% small discontinuities and 33% no discontinuities. In comparison, less broad categories get 24% big discontinuities, 10% medium discontinuities, 28% small discontinuities and 38% no discontinuities, i.e., no effect at the big discontinuity level but a noticeable difference at the medium level, which is somewhat consistent with your hypothesis. Data available here, the criteria used was “does this have more than one broad subcategory” combined my own intuition as to whether the field feels “broad”.