Yeah, I’m interested in why we need strong guarantees of correctness in some contexts but not others, especially if we have control over that aspect of the system we’re building as well. If we have choice over how much the system itself cares about errors, then I can design the system to be more robust to failure if I want it to be.
This would make sense if we are all great programmers who are perfect. In practice, that’s not the case, and from what I hear from others not even in FAANG. Because of that, it’s probably much better to give errors that will show up loudly in testing, than to rely on programmers to always handle silent failures or warnings on their own.
I think the crux for me here is how long it takes before people notice that the belief in a wrong result causes them to receive further wrong results, null results, or reach dead-ends, and then causes them to update their wrong belief. LK-99 is the most recent instance that I have in memory (there aren’t that many that I can recall, at least).
Sometimes years or decades. See the replicability crisis in psychology that’s decades in the making, and the Schron scandal that wasted years of some researchers time, just for the first two examples off the top of my head.
You have a cartoon picture of experimental science. LK-99 is quite unique in that it is easy to synthesise, and the properties being tested are easy to test. When you’re on the cutting edge, this is almost by necessity not the case, because most of the time the low-hanging fruit has been picked clean. Thus, experiments are messy and difficult, and when you fail to replicate, it is sometimes very hard to tell if it is due to your failure to reproduce the conditions (eg. synthesise a pure-enough material, have a clean enough experiment, etc.)
For a dark matter example, see DAMA/Libra. Few in the dark matter community take their result too seriously, but the attempts to reproduce this experiment has taken years and cost who knows how much, probably tens of millions.
I worked on dark matter experiments as an undergrad, and as far as I know, those experiments were built such that they were only really for testing the WIMP models, but also so that it would rule out the WIMP models if they were wrong (and it seems they did). But I don’t think they were necessarily a waste.
I am a dark matter experimentalist. This is not a good analogy. The issue is not replication, but that results get built on; when that result gets overturned, a whole bunch of scaffolding collapses. Ruling out parameter space is good, you’re searching for things like dark matter. Having to keep looking at old theories is quite different; what are you searching for?
I think your view involves a bit of catastrophizing, or relying on broadly pessimistic predictions about the performance of others.
Remember, the “exception throwing” behavior involves taking the entire space of outcomes and splitting it into two things: “Normal” and “Error.” If we say this is what we ought to do in the general case, that’s basically saying this binary property is inherent in the structure of the universe.
But we know that there’s no phenomenon that can be said to actually be an “error” in some absolute, metaphysical sense. This is an arbitrary decision that we make: We choose to abort the process and destroy work in progress when the range of observations falls outside of a single threshold.
This only makes sense if we also believe that sending the possibly malformed output to the next stage in the work creates a snowball effect or an out-of-control process.
There are probably environments where that is the case. But I don’t think that it is the default case nor is it one that we’d want to engineer into our environment if we have any choice over that—which I believe we do.
If the entire pipeline is made of checkpoints where exceptions can be thrown, then if I remove an earlier checkpoint, then it could mean that more time is wasted if it is destined to be thrown at a later time. But like I mentioned in the post, I usually think this is better, because I get more data about what the malformed input/output does to later steps in the process. Also, of course, if I remove all of the checkpoints, then it’s no longer going to be wasted work.
Mapping states to a binary range is a projection which loses information. If I instead tell you, “This is what I know, this is how much I know it,” that seems better because it carries enough to still give you the projection if you wanted that, plus additional information.
Sometimes years or decades. See the replicability crisis in psychology that’s decades in the making, and the Schron scandal that wasted years of some researchers time, just for the first two examples off the top of my head.
I don’t know if I agree that those things have anything to do with people tolerating probability and using calibration to continue working under conditions of high uncertainty.
The issue is not replication, but that results get built on; when that result gets overturned, a whole bunch of scaffolding collapses.
I think you’re also saying that when you predict that people are limited or stunted in some capacity, that we have to intervene to limit them or stunt them even more, because there is some danger in letting them operate in their original capacity.
It’s like, “Well they could be useful, if they believed what I wanted them to. But they don’t, and so, it’s better to prevent them from working at all.”
Remember, the “exception throwing” behavior involves taking the entire space of outcomes and splitting it into two things: “Normal” and “Error.” If we say this is what we ought to do in the general case, that’s basically saying this binary property is inherent in the structure of the universe.
I think it works in the specific context of programming because for a lot of functions (in the functional context for simplicity), behaviours are essentially bimodal distributions. They are rather well behaved for some inputs, and completely misbehaving (according to specification) for others. In the former category you still don’t have perfect performance; you could have quantisation/floating-point errors, for example, but it’s a tightly clustered region of performing mostly to-spec. In the second, the results would almost never be just a little wrong; instead, you’d often just get unspecified behaviour or results that aren’t even correlated to the correct one. Behaviours in between are quite rare.
I think you’re also saying that when you predict that people are limited or stunted in some capacity, that we have to intervene to limit them or stunt them even more, because there is some danger in letting them operate in their original capacity.
It’s like, “Well they could be useful, if they believed what I wanted them to. But they don’t, and so, it’s better to prevent them from working at all.”
If you were right, we’d all be hand-optimising assembly for perfect high performance in HPC. Ultimately, many people do minimal work to accomplish our task, sometimes to the detriment of the task at hand. I believe that I’m not alone in this thinking, and you’d need quite a lot of evidence to convince others. Look at the development of languages over the years, with newer languages (Rust, Julia, as examples) doing their best to leave less room for user errors and poor practices that impact both performance and security.
This would make sense if we are all great programmers who are perfect. In practice, that’s not the case, and from what I hear from others not even in FAANG. Because of that, it’s probably much better to give errors that will show up loudly in testing, than to rely on programmers to always handle silent failures or warnings on their own.
Sometimes years or decades. See the replicability crisis in psychology that’s decades in the making, and the Schron scandal that wasted years of some researchers time, just for the first two examples off the top of my head.
You have a cartoon picture of experimental science. LK-99 is quite unique in that it is easy to synthesise, and the properties being tested are easy to test. When you’re on the cutting edge, this is almost by necessity not the case, because most of the time the low-hanging fruit has been picked clean. Thus, experiments are messy and difficult, and when you fail to replicate, it is sometimes very hard to tell if it is due to your failure to reproduce the conditions (eg. synthesise a pure-enough material, have a clean enough experiment, etc.)
For a dark matter example, see DAMA/Libra. Few in the dark matter community take their result too seriously, but the attempts to reproduce this experiment has taken years and cost who knows how much, probably tens of millions.
I am a dark matter experimentalist. This is not a good analogy. The issue is not replication, but that results get built on; when that result gets overturned, a whole bunch of scaffolding collapses. Ruling out parameter space is good, you’re searching for things like dark matter. Having to keep looking at old theories is quite different; what are you searching for?
I think your view involves a bit of catastrophizing, or relying on broadly pessimistic predictions about the performance of others.
Remember, the “exception throwing” behavior involves taking the entire space of outcomes and splitting it into two things: “Normal” and “Error.” If we say this is what we ought to do in the general case, that’s basically saying this binary property is inherent in the structure of the universe.
But we know that there’s no phenomenon that can be said to actually be an “error” in some absolute, metaphysical sense. This is an arbitrary decision that we make: We choose to abort the process and destroy work in progress when the range of observations falls outside of a single threshold.
This only makes sense if we also believe that sending the possibly malformed output to the next stage in the work creates a snowball effect or an out-of-control process.
There are probably environments where that is the case. But I don’t think that it is the default case nor is it one that we’d want to engineer into our environment if we have any choice over that—which I believe we do.
If the entire pipeline is made of checkpoints where exceptions can be thrown, then if I remove an earlier checkpoint, then it could mean that more time is wasted if it is destined to be thrown at a later time. But like I mentioned in the post, I usually think this is better, because I get more data about what the malformed input/output does to later steps in the process. Also, of course, if I remove all of the checkpoints, then it’s no longer going to be wasted work.
Mapping states to a binary range is a projection which loses information. If I instead tell you, “This is what I know, this is how much I know it,” that seems better because it carries enough to still give you the projection if you wanted that, plus additional information.
I don’t know if I agree that those things have anything to do with people tolerating probability and using calibration to continue working under conditions of high uncertainty.
I think you’re also saying that when you predict that people are limited or stunted in some capacity, that we have to intervene to limit them or stunt them even more, because there is some danger in letting them operate in their original capacity.
It’s like, “Well they could be useful, if they believed what I wanted them to. But they don’t, and so, it’s better to prevent them from working at all.”
I think it works in the specific context of programming because for a lot of functions (in the functional context for simplicity), behaviours are essentially bimodal distributions. They are rather well behaved for some inputs, and completely misbehaving (according to specification) for others. In the former category you still don’t have perfect performance; you could have quantisation/floating-point errors, for example, but it’s a tightly clustered region of performing mostly to-spec. In the second, the results would almost never be just a little wrong; instead, you’d often just get unspecified behaviour or results that aren’t even correlated to the correct one. Behaviours in between are quite rare.
If you were right, we’d all be hand-optimising assembly for perfect high performance in HPC. Ultimately, many people do minimal work to accomplish our task, sometimes to the detriment of the task at hand. I believe that I’m not alone in this thinking, and you’d need quite a lot of evidence to convince others. Look at the development of languages over the years, with newer languages (Rust, Julia, as examples) doing their best to leave less room for user errors and poor practices that impact both performance and security.