The only reason to do the “don’t use all available data when formulating your hypothesis, so you have additional data to test it on” thing suggested in the article, is that you’re sufficiently irrational that more data can hurt you. And of course, this does happen in science; the most obvious failure case is probably overfitting.
If you look at the data rationally (or even just with the best approximation you can get for a certain amount of computing resources), there is nothing at all you can do when getting it in small pieces that you couldn’t also do if you got it all at once.
If getting the data all at once hurts you in any way, then you’re either approaching the problem wholly wrongly, or your priors are broken (for instance because they don’t give sufficiently greater probability mass to simpler hypotheses than more complicated ones).
If getting the data all at once hurts you in any way, then you’re either approaching the problem wholly wrongly, or your priors are broken
Well, it should be noted that, if a theory based on a subset of the data predicts the whole data set, then that theory has a higher probability of being correct than a theory based on the whole data set.
But of course it’s harder to construct such an effective theory from a subset of the date. It might be so much harder that the higher probability that such a theory would have isn’t worth the effort.
Well, it should be noted that, if a theory based on a subset of the data predicts the whole data set, then that theory has a higher probability of being correct than a theory based on the whole data set.
But that’s exactly because you don’t trust the scientist who came up with the hypothesis looking at the whole data set to discount correctly for the complexity of their hypothesis. This might happen either because you think they’re irrational, or because you’re worried about intellectual dishonesty—though in the latter case you should alwo worry about the scientist with the allegedly limited-data theory having snuck a peek at the full set, or just having come up with enough overly specific theories that one of them was likely to survive the follow-up test.
As the comments to that post say, if you can actually look at the hypotheses in question, and you’re completely confident in your own judgement of simplicity, that judgement completely screens off how much data was used in formulating them.
The idea behind the scientific method is to design procedures that are robust to the scientist being biased or incompetent or even corrupt. Any approach that starts with “assume a perfect scientist” is not going to work in reality.
Science is a set of hacks to get usable modelling out of humans, accepting that
There are things that humans do which are critical to modelling reality, and which you do not understand to the point of being able to reimplement them, but
you also can’t just leave humans to do free-form theorizing, because that has been conclusively shown to lead to all kinds of problems.
The critical black box in this specific case is about how to judge a theory’s simplicity, and what the best way to build a prior from that is.
As long as either of these things is a black box to you, you won’t be able to do much better than using high-level heuristical hacks of the sort science is made out of. But that’s going to bite you every time you don’t have the luxury of being able to apply these hacks—say because you’re modelling (some aspect of) human history, and can’t rerun the experiment. Also, you won’t be able to build an AGI.
In addition, if you’re really worried about corruption, the holding-back-data-on-purpose thing is setting up great profits to be made this way:
Corrupt scientist takes out a loan for BIGNUM $.
Corrupt scientist pays this money to someone with access to the still-secret data.
Bribed data keeper gives corrupt scientist a copy of the data.
Corrupt scientist fits their hypothesis to the whole data set.
Corrupt scientist publishes hypothesis.
Full data set is released officially.
Hypothesis of corrupt scientist is verified to match whole data set. Corrupt scientist gains great prestige, and uses that to obtain sufficient money to pay off the loan from 1, and then some.
You could try to set up the data keeper organization so that a premature limited data release is unlikely even in the face of potentially large bribes, but that seems like a fairly tough problem (and are they even thinking about it seriously?). Data is very easy to copy, preventing it from being copied is hard. And in this case, more so than in most cases where you’re worried about leaks, figuring out that a leak has in fact happened might be extremely difficult—at least if you really are ignorant about what hypothesis simplicity looks like.
But that’s going to bite you every time you don’t have the luxury of being able to apply these hacks—say because you’re modelling (some aspect of) human history, and can’t rerun the experiment.
? History sounds like exactly the situation where “hold back half the data, hypothsise on the other half, then look at the whole” is the only way of reasonably going about this.
Also, you won’t be able to build an AGI.
Don’t follow that argument at all—in the worst case scenario, you can brute force it by scanning and moddelling a human brain. But even if true, it’s not really an issue for social scientists and their ilk. And there the “look at half the data” would cause definite improvements in their proceedures. It would make science work for the “flawed but honest” crowd.
As for deliberately holding back half the data from other scientists (as opposed to one guy simply choosing to only look at half), that’s a different issue. I’ve got no really strong feelings on that. It could go either way.
It’s an ok hack for someone in the “flawed but honest” crowd, individually. But note that it really doesn’t scale to allowing you to deal with corruption (which was one of the problems I assumed in the post you replied to).
Extended to an entire field, this means that you may end up with N papers, all about the same data set, all proposing a different hypothesis that produces a good match on the set, and all of them claiming that their hypothesis was formulated using this procedure. IOW, you end up with unverifiable “trust us, we didn’t cheat” claims for each of those hypotheses. Which is not a good basis for arriving at a consensus in the field.
Re AI design, assuming you actually understand what you implemented (as opposed to just blindly copying algorithms from the human brain without understanding what they do), the reason this method would work is that you’ve successfully extracted the human built-in simplicity prior (and I don’t know how good that one is exactly, but it has to be a halfway workable approximation; otherwise humans couldn’t model reality at all).
As the comments to that post say, if you can actually look at the hypotheses in question, and you’re completely confident in your own judgement of simplicity, that judgement completely screens off how much data was used in formulating them.
I agree that it wouldn’t matter how much data we gave the scientists if they had fixed a method for turning data into a theory beforehand.
And I agree that such a method should settle on the simplest theory among all candidates. It should implement Occam’s razor.
But we shouldn’t expect the scientists to fix such a method before seeing the data. Occam’s razor is not enough. You first have to have a computationally feasible way to generate good candidate theories from which you choose the simplest one. And we have every reason to expect that cosmologists will eventually come up with better methods for turning cosmological data into good candidate theories. Therefore, it doesn’t make sense to force the cosmologists to bind themselves to a method now. They need the freedom to discover better methods than any that they’ve yet found.
The requirement of “computational feasibility” means that we can expect to have several candidate methods with no a priori way to judge confidently that one is better than the other. We will need recourse to empirical observations to compare the methods.
In this comment of mine to the post linked above, I showed that if a method produces a theory that predicts the whole data set from a subset, then that method is probably superior to a method that uses the whole data set. The proof goes through even if we assume that each method has a step where it applies Occam’s razor:
Define a method to be a map that takes in a batch of evidence and returns a theory. We have two assumptions
ASSUMPTION 1: The theory produced by giving an input batch to a method will at least predict that input. That is, no matter how flawed a method of theory-construction is, it won’t contradict the evidence fed into it. More precisely,
p( M(B) predicts B ) = 1.
[...]
ASSUMPTION 2: If a method M is known to be flawed, then its theories are less likely to make correct predictions of future observations. More precisely, if B2 is not contained in B1, then
And I agree that such a method should settle on the simplest theory among all candidate theories. It should implement Occam’s razor.
It’s not quite that simple in practice. There’s a tradeoff here, between accuracy in retrospect and theory simplicity. The two extreme pathological cases are:
You demand absolute accuracy in retrospect, i.e. P(observed data | hypothesis) = 1. This is the limit case of overfitting, and yields a GLUT, which makes no or completely useless predictions about the future.
You demand maximum simplicity. This is the limit case of underfitting, and yields a maximum-entropy distribution.
You want something inbetween those cases. I don’t know where exactly, but you would have to figure out some way to determine that point if you were, say, building an AGI.
I can’t really follow your earlier post. Specifically, I can’t parse your use of ” predicts ”, which you seem to use as a boolean value. But theories don’t “predict” or “not predict” outcomes in any absolute sense, they just assign probabilities to outcomes. Please explain your use of the phrase.
I can’t really follow your earlier post. Specifically, I can’t parse your use of ” predicts ”, which you seem to use as a boolean value. But theories don’t “predict” or “not predict” outcomes in any absolute sense, they just assign probabilities to outcomes. Please explain your use of the phrase.
Sorry, the earlier post was in the context of a toy problem in which predictions were boolean. I should have mentioned that. (I had made this assumption explicit in an earlier comment.)
My argument shows that, in the limiting case of boolean predictions, we should trust successful theories constructed using a subset of the data over theories constructed using all the data, even if all the theories were constructed using Occam’s razor. This at least strongly suggests the same possibility in more realistic cases where the theories assign probability distributions.
Ok, I think I get your earlier post now. I think you might be overcomplicating things here.
Sure, if you’re not confident what the correct simplicity prior is, you can get real evidence about which theory is likely to be stronger by observing things like their ability to correctly predict the outcome of new experiments. And to the extent that this tells you something about the way the originating scientist generates theories, there should even be some shifting of probability mass regarding the power of other theories proiduced by the same scientist. But that’s quite a lot of indirection, and there’s significant unknown factors that will dilute these shifts.
Attempting this is somewhat like trying to estimate the probability of a scientist being right about a famous problem in their field based on their prestige. There’s a signal, but it’s quite noisy.
If you know what simplicity looks like (and of course that’s uncomputable, but you can always approximate) - and how much it’s worth in terms of probability mass—you can make a much better guess as to which hypothesis is stronger by just looking at the actual hypotheses.
Looking at things like “how many experimental results did this hypothesis actually predict correctly” is only informative to the extent that your understanding of simplicity and its value is lacking. Note that the phrase lacking understanding of simplicity isn’t meant to be especially disparaging; good understanding of simplicity is hard. There’s a reason the scientific process includes an inelegant workaround instead.
The only reason to do the “don’t use all available data when formulating your hypothesis, so you have additional data to test it on” thing suggested in the article, is that you’re sufficiently irrational that more data can hurt you. And of course, this does happen in science; the most obvious failure case is probably overfitting.
If you look at the data rationally (or even just with the best approximation you can get for a certain amount of computing resources), there is nothing at all you can do when getting it in small pieces that you couldn’t also do if you got it all at once.
If getting the data all at once hurts you in any way, then you’re either approaching the problem wholly wrongly, or your priors are broken (for instance because they don’t give sufficiently greater probability mass to simpler hypotheses than more complicated ones).
Well, it should be noted that, if a theory based on a subset of the data predicts the whole data set, then that theory has a higher probability of being correct than a theory based on the whole data set.
But of course it’s harder to construct such an effective theory from a subset of the date. It might be so much harder that the higher probability that such a theory would have isn’t worth the effort.
But that’s exactly because you don’t trust the scientist who came up with the hypothesis looking at the whole data set to discount correctly for the complexity of their hypothesis. This might happen either because you think they’re irrational, or because you’re worried about intellectual dishonesty—though in the latter case you should alwo worry about the scientist with the allegedly limited-data theory having snuck a peek at the full set, or just having come up with enough overly specific theories that one of them was likely to survive the follow-up test.
As the comments to that post say, if you can actually look at the hypotheses in question, and you’re completely confident in your own judgement of simplicity, that judgement completely screens off how much data was used in formulating them.
The idea behind the scientific method is to design procedures that are robust to the scientist being biased or incompetent or even corrupt. Any approach that starts with “assume a perfect scientist” is not going to work in reality.
Science is a set of hacks to get usable modelling out of humans, accepting that
There are things that humans do which are critical to modelling reality, and which you do not understand to the point of being able to reimplement them, but
you also can’t just leave humans to do free-form theorizing, because that has been conclusively shown to lead to all kinds of problems.
The critical black box in this specific case is about how to judge a theory’s simplicity, and what the best way to build a prior from that is. As long as either of these things is a black box to you, you won’t be able to do much better than using high-level heuristical hacks of the sort science is made out of. But that’s going to bite you every time you don’t have the luxury of being able to apply these hacks—say because you’re modelling (some aspect of) human history, and can’t rerun the experiment. Also, you won’t be able to build an AGI.
In addition, if you’re really worried about corruption, the holding-back-data-on-purpose thing is setting up great profits to be made this way:
Corrupt scientist takes out a loan for BIGNUM $.
Corrupt scientist pays this money to someone with access to the still-secret data.
Bribed data keeper gives corrupt scientist a copy of the data.
Corrupt scientist fits their hypothesis to the whole data set.
Corrupt scientist publishes hypothesis.
Full data set is released officially.
Hypothesis of corrupt scientist is verified to match whole data set. Corrupt scientist gains great prestige, and uses that to obtain sufficient money to pay off the loan from 1, and then some.
You could try to set up the data keeper organization so that a premature limited data release is unlikely even in the face of potentially large bribes, but that seems like a fairly tough problem (and are they even thinking about it seriously?). Data is very easy to copy, preventing it from being copied is hard. And in this case, more so than in most cases where you’re worried about leaks, figuring out that a leak has in fact happened might be extremely difficult—at least if you really are ignorant about what hypothesis simplicity looks like.
? History sounds like exactly the situation where “hold back half the data, hypothsise on the other half, then look at the whole” is the only way of reasonably going about this.
Don’t follow that argument at all—in the worst case scenario, you can brute force it by scanning and moddelling a human brain. But even if true, it’s not really an issue for social scientists and their ilk. And there the “look at half the data” would cause definite improvements in their proceedures. It would make science work for the “flawed but honest” crowd.
As for deliberately holding back half the data from other scientists (as opposed to one guy simply choosing to only look at half), that’s a different issue. I’ve got no really strong feelings on that. It could go either way.
It’s an ok hack for someone in the “flawed but honest” crowd, individually. But note that it really doesn’t scale to allowing you to deal with corruption (which was one of the problems I assumed in the post you replied to).
Extended to an entire field, this means that you may end up with N papers, all about the same data set, all proposing a different hypothesis that produces a good match on the set, and all of them claiming that their hypothesis was formulated using this procedure. IOW, you end up with unverifiable “trust us, we didn’t cheat” claims for each of those hypotheses. Which is not a good basis for arriving at a consensus in the field.
Re AI design, assuming you actually understand what you implemented (as opposed to just blindly copying algorithms from the human brain without understanding what they do), the reason this method would work is that you’ve successfully extracted the human built-in simplicity prior (and I don’t know how good that one is exactly, but it has to be a halfway workable approximation; otherwise humans couldn’t model reality at all).
I agree that it wouldn’t matter how much data we gave the scientists if they had fixed a method for turning data into a theory beforehand.
And I agree that such a method should settle on the simplest theory among all candidates. It should implement Occam’s razor.
But we shouldn’t expect the scientists to fix such a method before seeing the data. Occam’s razor is not enough. You first have to have a computationally feasible way to generate good candidate theories from which you choose the simplest one. And we have every reason to expect that cosmologists will eventually come up with better methods for turning cosmological data into good candidate theories. Therefore, it doesn’t make sense to force the cosmologists to bind themselves to a method now. They need the freedom to discover better methods than any that they’ve yet found.
The requirement of “computational feasibility” means that we can expect to have several candidate methods with no a priori way to judge confidently that one is better than the other. We will need recourse to empirical observations to compare the methods.
In this comment of mine to the post linked above, I showed that if a method produces a theory that predicts the whole data set from a subset, then that method is probably superior to a method that uses the whole data set. The proof goes through even if we assume that each method has a step where it applies Occam’s razor:
(See the comment for a proof.)
It’s not quite that simple in practice. There’s a tradeoff here, between accuracy in retrospect and theory simplicity. The two extreme pathological cases are:
You demand absolute accuracy in retrospect, i.e. P(observed data | hypothesis) = 1. This is the limit case of overfitting, and yields a GLUT, which makes no or completely useless predictions about the future.
You demand maximum simplicity. This is the limit case of underfitting, and yields a maximum-entropy distribution.
You want something inbetween those cases. I don’t know where exactly, but you would have to figure out some way to determine that point if you were, say, building an AGI.
I can’t really follow your earlier post. Specifically, I can’t parse your use of ” predicts ”, which you seem to use as a boolean value. But theories don’t “predict” or “not predict” outcomes in any absolute sense, they just assign probabilities to outcomes. Please explain your use of the phrase.
Sorry, the earlier post was in the context of a toy problem in which predictions were boolean. I should have mentioned that. (I had made this assumption explicit in an earlier comment.)
My argument shows that, in the limiting case of boolean predictions, we should trust successful theories constructed using a subset of the data over theories constructed using all the data, even if all the theories were constructed using Occam’s razor. This at least strongly suggests the same possibility in more realistic cases where the theories assign probability distributions.
Ok, I think I get your earlier post now. I think you might be overcomplicating things here.
Sure, if you’re not confident what the correct simplicity prior is, you can get real evidence about which theory is likely to be stronger by observing things like their ability to correctly predict the outcome of new experiments. And to the extent that this tells you something about the way the originating scientist generates theories, there should even be some shifting of probability mass regarding the power of other theories proiduced by the same scientist. But that’s quite a lot of indirection, and there’s significant unknown factors that will dilute these shifts.
Attempting this is somewhat like trying to estimate the probability of a scientist being right about a famous problem in their field based on their prestige. There’s a signal, but it’s quite noisy.
If you know what simplicity looks like (and of course that’s uncomputable, but you can always approximate) - and how much it’s worth in terms of probability mass—you can make a much better guess as to which hypothesis is stronger by just looking at the actual hypotheses.
Looking at things like “how many experimental results did this hypothesis actually predict correctly” is only informative to the extent that your understanding of simplicity and its value is lacking. Note that the phrase lacking understanding of simplicity isn’t meant to be especially disparaging; good understanding of simplicity is hard. There’s a reason the scientific process includes an inelegant workaround instead.