reliability is surprisingly important. if I have a software tool that is 90% reliable, it’s actually not that useful for automation, because I will spend way too much time manually fixing problems. this is especially a problem if I’m chaining multiple tools together in a script. I’ve been bit really hard by this because 90% feels pretty good if you run it a handful of times by hand, but then once you add it to your automated sweep or whatever it breaks and then you have to go in and manually fix things. and getting to 99% or 99.9% is really hard because things break in all sorts of weird ways.
I think this has lessons for AI—lack of reliability is one big reason I fail to get very much value out of AI tools. if my chatbot catastrophically hallucinates once every 10 queries, then I basically have to look up everything anyways to check. I think this is a major reason why cool demos often don’t mean things that are practically useful − 90% reliable it’s great for a demo (and also you can pick tasks that your AI is more reliable at, rather than tasks which are actually useful in practice). this is an informing factor for why my timelines are longer than some other people’s
One nuance here is that a software tool that succeeds at its goal 90% of the time, and fails in an automatically detectable fashion the other 10% of the time is pretty useful for partial automation. Concretely, if you have a web scraper which performs a series of scripted clicks in hardcoded locations after hardcoded delays, and then extracts a value from the page from immediately after some known hardcoded text, that will frequently give you a ≥ 90% success rate of getting the piece of information you want while being much faster to code up than some real logic (especially if the site does anti-scraper stuff like randomizing css classes and DOM structure) and saving a bunch of work over doing it manually (because now you only have to manually extract info from the pages that your scraper failed to scrape).
I think even if failures are automatically detectable, it’s quite annoying. the cost is very logarithmic: there’s a very large cliff in effort when going from zero manual intervention required to any manual intervention required whatsoever; and as the amount of manual intervention continues to increase, you can invest in infrastructure to make it less painful, and then to delegate the work out to other people.
even if scaling does eventually solve the reliability problem, it means that very plausibly people are overestimating how far along capabilities are, and how fast the rate of progress is, because the most impressive thing that can be done with 90% reliability plausibly advances faster than the most impressive thing that can be done with 99.9% reliability
Perhaps it shouldn’t be too surprising. Reliability, machine precision, economy are likely the deciding factors to whether many (most?) technologies take off. The classic RoP case study: the bike.
Motorola engineers figured this out a few decades ago, even 99.99 to 99.999 makes a huge difference on a large scale. They even published a few interesting papers and monographs on it from what I recall.
This can be explained when thinking about what these accuracy levels mean:
99.99% accuracy is one error every 10K trials.
99.999% accuracy is one error every 100K trials.
So the 99.999% system is 10x better!
When errors are costly and you’re operating at scale, this is a huge difference.
reliability is surprisingly important. if I have a software tool that is 90% reliable, it’s actually not that useful for automation, because I will spend way too much time manually fixing problems. this is especially a problem if I’m chaining multiple tools together in a script. I’ve been bit really hard by this because 90% feels pretty good if you run it a handful of times by hand, but then once you add it to your automated sweep or whatever it breaks and then you have to go in and manually fix things. and getting to 99% or 99.9% is really hard because things break in all sorts of weird ways.
I think this has lessons for AI—lack of reliability is one big reason I fail to get very much value out of AI tools. if my chatbot catastrophically hallucinates once every 10 queries, then I basically have to look up everything anyways to check. I think this is a major reason why cool demos often don’t mean things that are practically useful − 90% reliable it’s great for a demo (and also you can pick tasks that your AI is more reliable at, rather than tasks which are actually useful in practice). this is an informing factor for why my timelines are longer than some other people’s
One nuance here is that a software tool that succeeds at its goal 90% of the time, and fails in an automatically detectable fashion the other 10% of the time is pretty useful for partial automation. Concretely, if you have a web scraper which performs a series of scripted clicks in hardcoded locations after hardcoded delays, and then extracts a value from the page from immediately after some known hardcoded text, that will frequently give you a ≥ 90% success rate of getting the piece of information you want while being much faster to code up than some real logic (especially if the site does anti-scraper stuff like randomizing css classes and DOM structure) and saving a bunch of work over doing it manually (because now you only have to manually extract info from the pages that your scraper failed to scrape).
I think even if failures are automatically detectable, it’s quite annoying. the cost is very logarithmic: there’s a very large cliff in effort when going from zero manual intervention required to any manual intervention required whatsoever; and as the amount of manual intervention continues to increase, you can invest in infrastructure to make it less painful, and then to delegate the work out to other people.
While I agree with this, I do want to note that this:
Only lengthens timelines very much if we also assume scaling can’t solve the reliability problem.
even if scaling does eventually solve the reliability problem, it means that very plausibly people are overestimating how far along capabilities are, and how fast the rate of progress is, because the most impressive thing that can be done with 90% reliability plausibly advances faster than the most impressive thing that can be done with 99.9% reliability
Perhaps it shouldn’t be too surprising. Reliability, machine precision, economy are likely the deciding factors to whether many (most?) technologies take off. The classic RoP case study: the bike.
Motorola engineers figured this out a few decades ago, even 99.99 to 99.999 makes a huge difference on a large scale. They even published a few interesting papers and monographs on it from what I recall.
This can be explained when thinking about what these accuracy levels mean: 99.99% accuracy is one error every 10K trials. 99.999% accuracy is one error every 100K trials. So the 99.999% system is 10x better! When errors are costly and you’re operating at scale, this is a huge difference.