This is a good reply, because its objections are close to things I already expect will be cruxes.
If you need a strong guarantee of correctness, then this is quite important. I’m not so sure that this is always the case in machine learning, since ML models by their nature can usually train around various deficiencies;
Yeah, I’m interested in why we need strong guarantees of correctness in some contexts but not others, especially if we have control over that aspect of the system we’re building as well. If we have choice over how much the system itself cares about errors, then I can design the system to be more robust to failure if I want it to be.
I think this is definitely highly context-dependent. A scientific result that is wrong is far worse than the lack of a result at all, because this gives a false sense of confidence, allowing for research to be built on wrong results, or for large amounts of research personpower to be wasted on research ideas/directions that depend on this wrong result. False confidence can be very detrimental in many cases.
I think the crux for me here is how long it takes before people notice that the belief in a wrong result causes them to receive further wrong results, null results, or reach dead-ends, and then causes them to update their wrong belief. LK-99 is the most recent instance that I have in memory (there aren’t that many that I can recall, at least).
What’s the worst that happened from having false hope? Well, researchers spent time simulating and modeling the structure of it and tried to figure out if there was any possible pathway to superconductivity. There were several replication attempts. If that researcher-time-money is more valuable (meaning potentially more to lose), then that could be because the researcher quality is high, the time spent is long, or the money spent is very high.
If the researcher quality is high (and they spent time doing this rather than something else), then presumably we also get better replication attempts, as well as more solid simulations / models. If they debunk it, then those are more reliable debunks. This prevents more researcher-time-money from being spent on it in the future. If they don’t debunk it, that signal is more reliable, and so spending more on this is less likely to be a waste.
If researcher quality is low, then researcher-time-money may also be low, and thus there will be less that could be potentially wasted. I think the risk we are trying to avoid is losing high-quality researcher time that could be spent on other things. But if our highest-quality researchers also do high-quality debunkings, then we still gain something (or at least lose less) from their time spent on it.
The universe itself also makes it so that being wrong will necessarily cause you to hit a dead-end, and if not, then you are presumably learning something, obtaining more data, etc. Situations like LK-99 may arise because before our knowledge gets to a high-enough level about some phenomenon, there is some ambiguity, where the signal we are looking for seems to be both present and not-present.
If the system as a whole (“society”) is good at recognizing signal that is more reliable without needing to be experts at the same level as its best experts, that’s another way we avoid risk.
I worked on dark matter experiments as an undergrad, and as far as I know, those experiments were built such that they were only really for testing the WIMP models, but also so that it would rule out the WIMP models if they were wrong (and it seems they did). But I don’t think they were necessarily a waste.
Yeah, I’m interested in why we need strong guarantees of correctness in some contexts but not others, especially if we have control over that aspect of the system we’re building as well. If we have choice over how much the system itself cares about errors, then I can design the system to be more robust to failure if I want it to be.
This would make sense if we are all great programmers who are perfect. In practice, that’s not the case, and from what I hear from others not even in FAANG. Because of that, it’s probably much better to give errors that will show up loudly in testing, than to rely on programmers to always handle silent failures or warnings on their own.
I think the crux for me here is how long it takes before people notice that the belief in a wrong result causes them to receive further wrong results, null results, or reach dead-ends, and then causes them to update their wrong belief. LK-99 is the most recent instance that I have in memory (there aren’t that many that I can recall, at least).
Sometimes years or decades. See the replicability crisis in psychology that’s decades in the making, and the Schron scandal that wasted years of some researchers time, just for the first two examples off the top of my head.
You have a cartoon picture of experimental science. LK-99 is quite unique in that it is easy to synthesise, and the properties being tested are easy to test. When you’re on the cutting edge, this is almost by necessity not the case, because most of the time the low-hanging fruit has been picked clean. Thus, experiments are messy and difficult, and when you fail to replicate, it is sometimes very hard to tell if it is due to your failure to reproduce the conditions (eg. synthesise a pure-enough material, have a clean enough experiment, etc.)
For a dark matter example, see DAMA/Libra. Few in the dark matter community take their result too seriously, but the attempts to reproduce this experiment has taken years and cost who knows how much, probably tens of millions.
I worked on dark matter experiments as an undergrad, and as far as I know, those experiments were built such that they were only really for testing the WIMP models, but also so that it would rule out the WIMP models if they were wrong (and it seems they did). But I don’t think they were necessarily a waste.
I am a dark matter experimentalist. This is not a good analogy. The issue is not replication, but that results get built on; when that result gets overturned, a whole bunch of scaffolding collapses. Ruling out parameter space is good, you’re searching for things like dark matter. Having to keep looking at old theories is quite different; what are you searching for?
I think your view involves a bit of catastrophizing, or relying on broadly pessimistic predictions about the performance of others.
Remember, the “exception throwing” behavior involves taking the entire space of outcomes and splitting it into two things: “Normal” and “Error.” If we say this is what we ought to do in the general case, that’s basically saying this binary property is inherent in the structure of the universe.
But we know that there’s no phenomenon that can be said to actually be an “error” in some absolute, metaphysical sense. This is an arbitrary decision that we make: We choose to abort the process and destroy work in progress when the range of observations falls outside of a single threshold.
This only makes sense if we also believe that sending the possibly malformed output to the next stage in the work creates a snowball effect or an out-of-control process.
There are probably environments where that is the case. But I don’t think that it is the default case nor is it one that we’d want to engineer into our environment if we have any choice over that—which I believe we do.
If the entire pipeline is made of checkpoints where exceptions can be thrown, then if I remove an earlier checkpoint, then it could mean that more time is wasted if it is destined to be thrown at a later time. But like I mentioned in the post, I usually think this is better, because I get more data about what the malformed input/output does to later steps in the process. Also, of course, if I remove all of the checkpoints, then it’s no longer going to be wasted work.
Mapping states to a binary range is a projection which loses information. If I instead tell you, “This is what I know, this is how much I know it,” that seems better because it carries enough to still give you the projection if you wanted that, plus additional information.
Sometimes years or decades. See the replicability crisis in psychology that’s decades in the making, and the Schron scandal that wasted years of some researchers time, just for the first two examples off the top of my head.
I don’t know if I agree that those things have anything to do with people tolerating probability and using calibration to continue working under conditions of high uncertainty.
The issue is not replication, but that results get built on; when that result gets overturned, a whole bunch of scaffolding collapses.
I think you’re also saying that when you predict that people are limited or stunted in some capacity, that we have to intervene to limit them or stunt them even more, because there is some danger in letting them operate in their original capacity.
It’s like, “Well they could be useful, if they believed what I wanted them to. But they don’t, and so, it’s better to prevent them from working at all.”
Remember, the “exception throwing” behavior involves taking the entire space of outcomes and splitting it into two things: “Normal” and “Error.” If we say this is what we ought to do in the general case, that’s basically saying this binary property is inherent in the structure of the universe.
I think it works in the specific context of programming because for a lot of functions (in the functional context for simplicity), behaviours are essentially bimodal distributions. They are rather well behaved for some inputs, and completely misbehaving (according to specification) for others. In the former category you still don’t have perfect performance; you could have quantisation/floating-point errors, for example, but it’s a tightly clustered region of performing mostly to-spec. In the second, the results would almost never be just a little wrong; instead, you’d often just get unspecified behaviour or results that aren’t even correlated to the correct one. Behaviours in between are quite rare.
I think you’re also saying that when you predict that people are limited or stunted in some capacity, that we have to intervene to limit them or stunt them even more, because there is some danger in letting them operate in their original capacity.
It’s like, “Well they could be useful, if they believed what I wanted them to. But they don’t, and so, it’s better to prevent them from working at all.”
If you were right, we’d all be hand-optimising assembly for perfect high performance in HPC. Ultimately, many people do minimal work to accomplish our task, sometimes to the detriment of the task at hand. I believe that I’m not alone in this thinking, and you’d need quite a lot of evidence to convince others. Look at the development of languages over the years, with newer languages (Rust, Julia, as examples) doing their best to leave less room for user errors and poor practices that impact both performance and security.
This is a good reply, because its objections are close to things I already expect will be cruxes.
Yeah, I’m interested in why we need strong guarantees of correctness in some contexts but not others, especially if we have control over that aspect of the system we’re building as well. If we have choice over how much the system itself cares about errors, then I can design the system to be more robust to failure if I want it to be.
I think the crux for me here is how long it takes before people notice that the belief in a wrong result causes them to receive further wrong results, null results, or reach dead-ends, and then causes them to update their wrong belief. LK-99 is the most recent instance that I have in memory (there aren’t that many that I can recall, at least).
What’s the worst that happened from having false hope? Well, researchers spent time simulating and modeling the structure of it and tried to figure out if there was any possible pathway to superconductivity. There were several replication attempts. If that researcher-time-money is more valuable (meaning potentially more to lose), then that could be because the researcher quality is high, the time spent is long, or the money spent is very high.
If the researcher quality is high (and they spent time doing this rather than something else), then presumably we also get better replication attempts, as well as more solid simulations / models. If they debunk it, then those are more reliable debunks. This prevents more researcher-time-money from being spent on it in the future. If they don’t debunk it, that signal is more reliable, and so spending more on this is less likely to be a waste.
If researcher quality is low, then researcher-time-money may also be low, and thus there will be less that could be potentially wasted. I think the risk we are trying to avoid is losing high-quality researcher time that could be spent on other things. But if our highest-quality researchers also do high-quality debunkings, then we still gain something (or at least lose less) from their time spent on it.
The universe itself also makes it so that being wrong will necessarily cause you to hit a dead-end, and if not, then you are presumably learning something, obtaining more data, etc. Situations like LK-99 may arise because before our knowledge gets to a high-enough level about some phenomenon, there is some ambiguity, where the signal we are looking for seems to be both present and not-present.
If the system as a whole (“society”) is good at recognizing signal that is more reliable without needing to be experts at the same level as its best experts, that’s another way we avoid risk.
I worked on dark matter experiments as an undergrad, and as far as I know, those experiments were built such that they were only really for testing the WIMP models, but also so that it would rule out the WIMP models if they were wrong (and it seems they did). But I don’t think they were necessarily a waste.
This would make sense if we are all great programmers who are perfect. In practice, that’s not the case, and from what I hear from others not even in FAANG. Because of that, it’s probably much better to give errors that will show up loudly in testing, than to rely on programmers to always handle silent failures or warnings on their own.
Sometimes years or decades. See the replicability crisis in psychology that’s decades in the making, and the Schron scandal that wasted years of some researchers time, just for the first two examples off the top of my head.
You have a cartoon picture of experimental science. LK-99 is quite unique in that it is easy to synthesise, and the properties being tested are easy to test. When you’re on the cutting edge, this is almost by necessity not the case, because most of the time the low-hanging fruit has been picked clean. Thus, experiments are messy and difficult, and when you fail to replicate, it is sometimes very hard to tell if it is due to your failure to reproduce the conditions (eg. synthesise a pure-enough material, have a clean enough experiment, etc.)
For a dark matter example, see DAMA/Libra. Few in the dark matter community take their result too seriously, but the attempts to reproduce this experiment has taken years and cost who knows how much, probably tens of millions.
I am a dark matter experimentalist. This is not a good analogy. The issue is not replication, but that results get built on; when that result gets overturned, a whole bunch of scaffolding collapses. Ruling out parameter space is good, you’re searching for things like dark matter. Having to keep looking at old theories is quite different; what are you searching for?
I think your view involves a bit of catastrophizing, or relying on broadly pessimistic predictions about the performance of others.
Remember, the “exception throwing” behavior involves taking the entire space of outcomes and splitting it into two things: “Normal” and “Error.” If we say this is what we ought to do in the general case, that’s basically saying this binary property is inherent in the structure of the universe.
But we know that there’s no phenomenon that can be said to actually be an “error” in some absolute, metaphysical sense. This is an arbitrary decision that we make: We choose to abort the process and destroy work in progress when the range of observations falls outside of a single threshold.
This only makes sense if we also believe that sending the possibly malformed output to the next stage in the work creates a snowball effect or an out-of-control process.
There are probably environments where that is the case. But I don’t think that it is the default case nor is it one that we’d want to engineer into our environment if we have any choice over that—which I believe we do.
If the entire pipeline is made of checkpoints where exceptions can be thrown, then if I remove an earlier checkpoint, then it could mean that more time is wasted if it is destined to be thrown at a later time. But like I mentioned in the post, I usually think this is better, because I get more data about what the malformed input/output does to later steps in the process. Also, of course, if I remove all of the checkpoints, then it’s no longer going to be wasted work.
Mapping states to a binary range is a projection which loses information. If I instead tell you, “This is what I know, this is how much I know it,” that seems better because it carries enough to still give you the projection if you wanted that, plus additional information.
I don’t know if I agree that those things have anything to do with people tolerating probability and using calibration to continue working under conditions of high uncertainty.
I think you’re also saying that when you predict that people are limited or stunted in some capacity, that we have to intervene to limit them or stunt them even more, because there is some danger in letting them operate in their original capacity.
It’s like, “Well they could be useful, if they believed what I wanted them to. But they don’t, and so, it’s better to prevent them from working at all.”
I think it works in the specific context of programming because for a lot of functions (in the functional context for simplicity), behaviours are essentially bimodal distributions. They are rather well behaved for some inputs, and completely misbehaving (according to specification) for others. In the former category you still don’t have perfect performance; you could have quantisation/floating-point errors, for example, but it’s a tightly clustered region of performing mostly to-spec. In the second, the results would almost never be just a little wrong; instead, you’d often just get unspecified behaviour or results that aren’t even correlated to the correct one. Behaviours in between are quite rare.
If you were right, we’d all be hand-optimising assembly for perfect high performance in HPC. Ultimately, many people do minimal work to accomplish our task, sometimes to the detriment of the task at hand. I believe that I’m not alone in this thinking, and you’d need quite a lot of evidence to convince others. Look at the development of languages over the years, with newer languages (Rust, Julia, as examples) doing their best to leave less room for user errors and poor practices that impact both performance and security.