I come from science, so heavy scientific computing bias here.
I think you’re largely focusing on the wrong metric. Whether exceptions should be thrown has little to do with reliability (and indeed, exceptions can be detrimental to reliability), but instead is more related to correctness. They are not always the same thing. In a scientific computing context, for example, a program can be unreliable, with memory leaks resulting in processes often being killed by the OS, but still always give correct results when a computation actually manages to finish.
If you need a strong guarantee of correctness, then this is quite important. I’m not so sure that this is always the case in machine learning, since ML models by their nature can usually train around various deficiencies; with small implementation mistakes you might just be a little confused as to why your model performs worse than expected. In aerospace, correctness needs to balanced against aeroplanes suddenly losing power, so correctness always doesn’t always win. In scientific computing you might have the other extreme, where there’s very little riding on your program not exiting, since you can always do a bunch of test runs before sending your code off to a HPC cluster, but if you do run this thing and base a whole bunch of science off of it it better not be ruined by little insidious bugs. I can imagine correctness mattering a lot too in crypto and security contexts, where a bug might cause information to leak and it is probably better for your program to die from internal checks than for your private key to be leaked.
I’m not sure if I agree that a job poorly-done is worse than one not even started.
I think this is definitely highly context-dependent. A scientific result that is wrong is far worse than the lack of a result at all, because this gives a false sense of confidence, allowing for research to be built on wrong results, or for large amounts of research personpower to be wasted on research ideas/directions that depend on this wrong result. False confidence can be very detrimental in many cases.
As to why general purpose languages usually involve error handling and errors: they are general purpose languages and have to cater to use cases where you do care about errors. Built-in routines fail with exceptions rather than silently so that people building mission-critical code where correctness is the most important metric can at least kinda trust every language built-in routine to return correct results if it manages to return something successfully.
I’d add that correctness often is security: job poorly done is an opportunity for hacker to subvert your system, make your poor job into great job for himself.
This is a good reply, because its objections are close to things I already expect will be cruxes.
If you need a strong guarantee of correctness, then this is quite important. I’m not so sure that this is always the case in machine learning, since ML models by their nature can usually train around various deficiencies;
Yeah, I’m interested in why we need strong guarantees of correctness in some contexts but not others, especially if we have control over that aspect of the system we’re building as well. If we have choice over how much the system itself cares about errors, then I can design the system to be more robust to failure if I want it to be.
I think this is definitely highly context-dependent. A scientific result that is wrong is far worse than the lack of a result at all, because this gives a false sense of confidence, allowing for research to be built on wrong results, or for large amounts of research personpower to be wasted on research ideas/directions that depend on this wrong result. False confidence can be very detrimental in many cases.
I think the crux for me here is how long it takes before people notice that the belief in a wrong result causes them to receive further wrong results, null results, or reach dead-ends, and then causes them to update their wrong belief. LK-99 is the most recent instance that I have in memory (there aren’t that many that I can recall, at least).
What’s the worst that happened from having false hope? Well, researchers spent time simulating and modeling the structure of it and tried to figure out if there was any possible pathway to superconductivity. There were several replication attempts. If that researcher-time-money is more valuable (meaning potentially more to lose), then that could be because the researcher quality is high, the time spent is long, or the money spent is very high.
If the researcher quality is high (and they spent time doing this rather than something else), then presumably we also get better replication attempts, as well as more solid simulations / models. If they debunk it, then those are more reliable debunks. This prevents more researcher-time-money from being spent on it in the future. If they don’t debunk it, that signal is more reliable, and so spending more on this is less likely to be a waste.
If researcher quality is low, then researcher-time-money may also be low, and thus there will be less that could be potentially wasted. I think the risk we are trying to avoid is losing high-quality researcher time that could be spent on other things. But if our highest-quality researchers also do high-quality debunkings, then we still gain something (or at least lose less) from their time spent on it.
The universe itself also makes it so that being wrong will necessarily cause you to hit a dead-end, and if not, then you are presumably learning something, obtaining more data, etc. Situations like LK-99 may arise because before our knowledge gets to a high-enough level about some phenomenon, there is some ambiguity, where the signal we are looking for seems to be both present and not-present.
If the system as a whole (“society”) is good at recognizing signal that is more reliable without needing to be experts at the same level as its best experts, that’s another way we avoid risk.
I worked on dark matter experiments as an undergrad, and as far as I know, those experiments were built such that they were only really for testing the WIMP models, but also so that it would rule out the WIMP models if they were wrong (and it seems they did). But I don’t think they were necessarily a waste.
Yeah, I’m interested in why we need strong guarantees of correctness in some contexts but not others, especially if we have control over that aspect of the system we’re building as well. If we have choice over how much the system itself cares about errors, then I can design the system to be more robust to failure if I want it to be.
This would make sense if we are all great programmers who are perfect. In practice, that’s not the case, and from what I hear from others not even in FAANG. Because of that, it’s probably much better to give errors that will show up loudly in testing, than to rely on programmers to always handle silent failures or warnings on their own.
I think the crux for me here is how long it takes before people notice that the belief in a wrong result causes them to receive further wrong results, null results, or reach dead-ends, and then causes them to update their wrong belief. LK-99 is the most recent instance that I have in memory (there aren’t that many that I can recall, at least).
Sometimes years or decades. See the replicability crisis in psychology that’s decades in the making, and the Schron scandal that wasted years of some researchers time, just for the first two examples off the top of my head.
You have a cartoon picture of experimental science. LK-99 is quite unique in that it is easy to synthesise, and the properties being tested are easy to test. When you’re on the cutting edge, this is almost by necessity not the case, because most of the time the low-hanging fruit has been picked clean. Thus, experiments are messy and difficult, and when you fail to replicate, it is sometimes very hard to tell if it is due to your failure to reproduce the conditions (eg. synthesise a pure-enough material, have a clean enough experiment, etc.)
For a dark matter example, see DAMA/Libra. Few in the dark matter community take their result too seriously, but the attempts to reproduce this experiment has taken years and cost who knows how much, probably tens of millions.
I worked on dark matter experiments as an undergrad, and as far as I know, those experiments were built such that they were only really for testing the WIMP models, but also so that it would rule out the WIMP models if they were wrong (and it seems they did). But I don’t think they were necessarily a waste.
I am a dark matter experimentalist. This is not a good analogy. The issue is not replication, but that results get built on; when that result gets overturned, a whole bunch of scaffolding collapses. Ruling out parameter space is good, you’re searching for things like dark matter. Having to keep looking at old theories is quite different; what are you searching for?
I think your view involves a bit of catastrophizing, or relying on broadly pessimistic predictions about the performance of others.
Remember, the “exception throwing” behavior involves taking the entire space of outcomes and splitting it into two things: “Normal” and “Error.” If we say this is what we ought to do in the general case, that’s basically saying this binary property is inherent in the structure of the universe.
But we know that there’s no phenomenon that can be said to actually be an “error” in some absolute, metaphysical sense. This is an arbitrary decision that we make: We choose to abort the process and destroy work in progress when the range of observations falls outside of a single threshold.
This only makes sense if we also believe that sending the possibly malformed output to the next stage in the work creates a snowball effect or an out-of-control process.
There are probably environments where that is the case. But I don’t think that it is the default case nor is it one that we’d want to engineer into our environment if we have any choice over that—which I believe we do.
If the entire pipeline is made of checkpoints where exceptions can be thrown, then if I remove an earlier checkpoint, then it could mean that more time is wasted if it is destined to be thrown at a later time. But like I mentioned in the post, I usually think this is better, because I get more data about what the malformed input/output does to later steps in the process. Also, of course, if I remove all of the checkpoints, then it’s no longer going to be wasted work.
Mapping states to a binary range is a projection which loses information. If I instead tell you, “This is what I know, this is how much I know it,” that seems better because it carries enough to still give you the projection if you wanted that, plus additional information.
Sometimes years or decades. See the replicability crisis in psychology that’s decades in the making, and the Schron scandal that wasted years of some researchers time, just for the first two examples off the top of my head.
I don’t know if I agree that those things have anything to do with people tolerating probability and using calibration to continue working under conditions of high uncertainty.
The issue is not replication, but that results get built on; when that result gets overturned, a whole bunch of scaffolding collapses.
I think you’re also saying that when you predict that people are limited or stunted in some capacity, that we have to intervene to limit them or stunt them even more, because there is some danger in letting them operate in their original capacity.
It’s like, “Well they could be useful, if they believed what I wanted them to. But they don’t, and so, it’s better to prevent them from working at all.”
Remember, the “exception throwing” behavior involves taking the entire space of outcomes and splitting it into two things: “Normal” and “Error.” If we say this is what we ought to do in the general case, that’s basically saying this binary property is inherent in the structure of the universe.
I think it works in the specific context of programming because for a lot of functions (in the functional context for simplicity), behaviours are essentially bimodal distributions. They are rather well behaved for some inputs, and completely misbehaving (according to specification) for others. In the former category you still don’t have perfect performance; you could have quantisation/floating-point errors, for example, but it’s a tightly clustered region of performing mostly to-spec. In the second, the results would almost never be just a little wrong; instead, you’d often just get unspecified behaviour or results that aren’t even correlated to the correct one. Behaviours in between are quite rare.
I think you’re also saying that when you predict that people are limited or stunted in some capacity, that we have to intervene to limit them or stunt them even more, because there is some danger in letting them operate in their original capacity.
It’s like, “Well they could be useful, if they believed what I wanted them to. But they don’t, and so, it’s better to prevent them from working at all.”
If you were right, we’d all be hand-optimising assembly for perfect high performance in HPC. Ultimately, many people do minimal work to accomplish our task, sometimes to the detriment of the task at hand. I believe that I’m not alone in this thinking, and you’d need quite a lot of evidence to convince others. Look at the development of languages over the years, with newer languages (Rust, Julia, as examples) doing their best to leave less room for user errors and poor practices that impact both performance and security.
I come from science, so heavy scientific computing bias here.
I think you’re largely focusing on the wrong metric. Whether exceptions should be thrown has little to do with reliability (and indeed, exceptions can be detrimental to reliability), but instead is more related to correctness. They are not always the same thing. In a scientific computing context, for example, a program can be unreliable, with memory leaks resulting in processes often being killed by the OS, but still always give correct results when a computation actually manages to finish.
If you need a strong guarantee of correctness, then this is quite important. I’m not so sure that this is always the case in machine learning, since ML models by their nature can usually train around various deficiencies; with small implementation mistakes you might just be a little confused as to why your model performs worse than expected. In aerospace, correctness needs to balanced against aeroplanes suddenly losing power, so correctness always doesn’t always win. In scientific computing you might have the other extreme, where there’s very little riding on your program not exiting, since you can always do a bunch of test runs before sending your code off to a HPC cluster, but if you do run this thing and base a whole bunch of science off of it it better not be ruined by little insidious bugs. I can imagine correctness mattering a lot too in crypto and security contexts, where a bug might cause information to leak and it is probably better for your program to die from internal checks than for your private key to be leaked.
I think this is definitely highly context-dependent. A scientific result that is wrong is far worse than the lack of a result at all, because this gives a false sense of confidence, allowing for research to be built on wrong results, or for large amounts of research personpower to be wasted on research ideas/directions that depend on this wrong result. False confidence can be very detrimental in many cases.
As to why general purpose languages usually involve error handling and errors: they are general purpose languages and have to cater to use cases where you do care about errors. Built-in routines fail with exceptions rather than silently so that people building mission-critical code where correctness is the most important metric can at least kinda trust every language built-in routine to return correct results if it manages to return something successfully.
Edit: some grammatical stuff and clarity
I’d add that correctness often is security: job poorly done is an opportunity for hacker to subvert your system, make your poor job into great job for himself.
This is a good reply, because its objections are close to things I already expect will be cruxes.
Yeah, I’m interested in why we need strong guarantees of correctness in some contexts but not others, especially if we have control over that aspect of the system we’re building as well. If we have choice over how much the system itself cares about errors, then I can design the system to be more robust to failure if I want it to be.
I think the crux for me here is how long it takes before people notice that the belief in a wrong result causes them to receive further wrong results, null results, or reach dead-ends, and then causes them to update their wrong belief. LK-99 is the most recent instance that I have in memory (there aren’t that many that I can recall, at least).
What’s the worst that happened from having false hope? Well, researchers spent time simulating and modeling the structure of it and tried to figure out if there was any possible pathway to superconductivity. There were several replication attempts. If that researcher-time-money is more valuable (meaning potentially more to lose), then that could be because the researcher quality is high, the time spent is long, or the money spent is very high.
If the researcher quality is high (and they spent time doing this rather than something else), then presumably we also get better replication attempts, as well as more solid simulations / models. If they debunk it, then those are more reliable debunks. This prevents more researcher-time-money from being spent on it in the future. If they don’t debunk it, that signal is more reliable, and so spending more on this is less likely to be a waste.
If researcher quality is low, then researcher-time-money may also be low, and thus there will be less that could be potentially wasted. I think the risk we are trying to avoid is losing high-quality researcher time that could be spent on other things. But if our highest-quality researchers also do high-quality debunkings, then we still gain something (or at least lose less) from their time spent on it.
The universe itself also makes it so that being wrong will necessarily cause you to hit a dead-end, and if not, then you are presumably learning something, obtaining more data, etc. Situations like LK-99 may arise because before our knowledge gets to a high-enough level about some phenomenon, there is some ambiguity, where the signal we are looking for seems to be both present and not-present.
If the system as a whole (“society”) is good at recognizing signal that is more reliable without needing to be experts at the same level as its best experts, that’s another way we avoid risk.
I worked on dark matter experiments as an undergrad, and as far as I know, those experiments were built such that they were only really for testing the WIMP models, but also so that it would rule out the WIMP models if they were wrong (and it seems they did). But I don’t think they were necessarily a waste.
This would make sense if we are all great programmers who are perfect. In practice, that’s not the case, and from what I hear from others not even in FAANG. Because of that, it’s probably much better to give errors that will show up loudly in testing, than to rely on programmers to always handle silent failures or warnings on their own.
Sometimes years or decades. See the replicability crisis in psychology that’s decades in the making, and the Schron scandal that wasted years of some researchers time, just for the first two examples off the top of my head.
You have a cartoon picture of experimental science. LK-99 is quite unique in that it is easy to synthesise, and the properties being tested are easy to test. When you’re on the cutting edge, this is almost by necessity not the case, because most of the time the low-hanging fruit has been picked clean. Thus, experiments are messy and difficult, and when you fail to replicate, it is sometimes very hard to tell if it is due to your failure to reproduce the conditions (eg. synthesise a pure-enough material, have a clean enough experiment, etc.)
For a dark matter example, see DAMA/Libra. Few in the dark matter community take their result too seriously, but the attempts to reproduce this experiment has taken years and cost who knows how much, probably tens of millions.
I am a dark matter experimentalist. This is not a good analogy. The issue is not replication, but that results get built on; when that result gets overturned, a whole bunch of scaffolding collapses. Ruling out parameter space is good, you’re searching for things like dark matter. Having to keep looking at old theories is quite different; what are you searching for?
I think your view involves a bit of catastrophizing, or relying on broadly pessimistic predictions about the performance of others.
Remember, the “exception throwing” behavior involves taking the entire space of outcomes and splitting it into two things: “Normal” and “Error.” If we say this is what we ought to do in the general case, that’s basically saying this binary property is inherent in the structure of the universe.
But we know that there’s no phenomenon that can be said to actually be an “error” in some absolute, metaphysical sense. This is an arbitrary decision that we make: We choose to abort the process and destroy work in progress when the range of observations falls outside of a single threshold.
This only makes sense if we also believe that sending the possibly malformed output to the next stage in the work creates a snowball effect or an out-of-control process.
There are probably environments where that is the case. But I don’t think that it is the default case nor is it one that we’d want to engineer into our environment if we have any choice over that—which I believe we do.
If the entire pipeline is made of checkpoints where exceptions can be thrown, then if I remove an earlier checkpoint, then it could mean that more time is wasted if it is destined to be thrown at a later time. But like I mentioned in the post, I usually think this is better, because I get more data about what the malformed input/output does to later steps in the process. Also, of course, if I remove all of the checkpoints, then it’s no longer going to be wasted work.
Mapping states to a binary range is a projection which loses information. If I instead tell you, “This is what I know, this is how much I know it,” that seems better because it carries enough to still give you the projection if you wanted that, plus additional information.
I don’t know if I agree that those things have anything to do with people tolerating probability and using calibration to continue working under conditions of high uncertainty.
I think you’re also saying that when you predict that people are limited or stunted in some capacity, that we have to intervene to limit them or stunt them even more, because there is some danger in letting them operate in their original capacity.
It’s like, “Well they could be useful, if they believed what I wanted them to. But they don’t, and so, it’s better to prevent them from working at all.”
I think it works in the specific context of programming because for a lot of functions (in the functional context for simplicity), behaviours are essentially bimodal distributions. They are rather well behaved for some inputs, and completely misbehaving (according to specification) for others. In the former category you still don’t have perfect performance; you could have quantisation/floating-point errors, for example, but it’s a tightly clustered region of performing mostly to-spec. In the second, the results would almost never be just a little wrong; instead, you’d often just get unspecified behaviour or results that aren’t even correlated to the correct one. Behaviours in between are quite rare.
If you were right, we’d all be hand-optimising assembly for perfect high performance in HPC. Ultimately, many people do minimal work to accomplish our task, sometimes to the detriment of the task at hand. I believe that I’m not alone in this thinking, and you’d need quite a lot of evidence to convince others. Look at the development of languages over the years, with newer languages (Rust, Julia, as examples) doing their best to leave less room for user errors and poor practices that impact both performance and security.