When they say things like “in cognitive science, Bayesian reasoner is the technically precise codeword that we use to mean rational mind,” they really do mean it. Move over, Aristotle!
Of course, in Catholicism, Catholic is the technically precise codeword that they use to mean rational mind. I am not a Catholic or even a Christian, but frankly, I think that if I had to vote for a dictator of the world and the only information I had was whether the candidate was an orthodox Bayesian or an orthodox Catholic, I’d go with the latter.
The only problem is that this little formula is not a complete, drop-in replacement for your brain. If a reservationist is skeptical of anything on God’s green earth, it’s people who want to replace his (or her) brain with a formula.
To make this more concrete, let’s look at how fragile Bayesian inference is in the presence of an attacker who’s filtering our event stream. By throwing off P(B), any undetected pattern of correlation can completely foul the whole system. If the attacker, whenever he pulls a red ball out of the urn, puts it back and keeps pulling until he gets a blue ball, the Bayesian “rational mind” will conclude that the urn is entirely full of blue balls. And Bayesian inference certainly does not offer any suggestion that you should look at who’s pulling balls out of the urn and see what he has up his sleeves.
Once again, the problem is not that Bayesianism is untrue. The problem is that the human brain has a very limited capacity for analytic reasoning to begin with.
If the attacker, whenever he pulls a red ball out of the urn, puts it back and keeps pulling until he gets a blue ball, the Bayesian “rational mind” will conclude that the urn is entirely full of blue balls.
Surely the actual Bayesian rational mind’s conclusion is that the attacker will (probably) always show a blue ball, nothing to do with the urn at all.
And Bayesian inference certainly does not offer any suggestion that you should look at who’s pulling balls out of the urn and see what he has up his sleeves.
I just facepalmed the hardest I’ve ever done while reading Unqualified Reservations. That is, not very hard—Mencius is nothing if not a charming and polite author—but still. Maybe he really ought to read at least one Sequence!
Could we start that reading with the classic Bayes’ Theorem example? Suppose 1% of women have breast cancer, 80% of mammograms on a cancerous woman will detect it, 9.6% on an uncancerous woman will be false positives. Suppose woman A gets a mammogram which indicates cancer. What are the odds she has cancer?
Now suppose women B, C, D, E, F… Z, AA, AB, AC, AD, etc., the entire patient list getting screened today, all test positive for cancer. Is the probability that woman A has cancer still 7.8%? Bayes’ rule, with the priors above, still says “yes”! You need more complicated prior probabilities (e.g. what are the odds that the test equipment is malfunctioning?) before your evidence can tell you what’s actually likely to be happening. But those more complicated, more accurate priors would have (very slightly) changed our original p(A|X) as well!
It’s not that Bayesian updating is wrong. It’s just that Bayes’ theorem never allows you to have a non-zero posterior probability coming from a zero prior, and to make any practical problem tractable everybody ends up implicitly assuming huge swaths of zero prior probability.
It’s not assuming zero probability. It’s assuming independence. Under the original model, it’s possible for all the women to get positives, but only 1% to actually have breast cancer. It’s just that a better prior would give a much higher probability.
Is there any practical difference between “assuming independent results” and “assuming zero probability for all models which do not generate independent results”? If not then I think we’ve just been exposed to people using different terminology.
Is there any practical difference between “assuming independent results” and “assuming zero probability for all models which do not generate independent results”?
No.
If not then I think we’ve just been exposed to people using different terminology.
I think it’s more than terminology. And if Mencius can be dismissed as someone who does not really get Bayesian inference, one can surely not say the same of Cosma Shalizi, who has made the same argument somewhere on his blog. (It was a few years ago and I can’t easily find a link. It might have been in a technical report or a published paper instead.) Suppose a Bayesian is trying to estimate the mean of a normal distribution from incoming data. He has a prior distribution of the mean, and each new observation updates that prior. But what if the data are not drawn from a normal distribution, but from the sum of two such distributions with well separated peaks? The Bayesian (he says) can never discover that. Instead, his estimate of the position of the single peak that he is committed to will wander up and down between the two real peaks, like the Flying Dutchman cursed never to find a port, while the posterior probability of seeing the data that he has seen plummets (on the log-odds scale) towards minus infinity. But he cannot avoid this: no evidence can let him update towards anything his prior gives zero probability to.
What (he says) can save the Bayesian from this fate? Model-checking. Look at the data and see if they are actually consistent with any model in the class you are trying to fit. If not, think of a better model and fit that.
Andrew Gelman says the same; there’s a chapter of his book devoted to model checking. And here’s a paper by both of them on Bayesian inference and philosophy of science, in which they explicitly describe model-checking as “non-Bayesian checking of Bayesian models”. My impression (not being a statistician) is that their view is currently the standard one.
I believe the hard-line Bayesian response to that would be that model checking should itself be a Bayesian process. (I’m distancing myself from this claim, because as a non-statistician, I don’t need to have any position on this. I just want to see the position stated here.) The single-peaked prior in Shalizi’s story was merely a conditional one: supposing the true distribution to be in that family, the Bayesian estimate does indeed behave in that way. But all we have to do to save the Bayesian from a fate worse than frequentism is to widen the picture. That prior was merely a subset, worked with for computational convenience, but in the true prior, that prior only accounted for some fraction p<1 of the probability mass, the remaining 1-p being assigned to “something else”. Then when the data fail to conform to any single Gaussian, the “something else” alternative will eventually overshadow the Gaussian model, and will need to be expanded into more detail.
“But,” the soft Bayesians might say, “how do you expand that ‘something else’ into new models by Bayesian means? You would need a universal prior, a prior whose support includes every possible hypothesis. Where do you get one of those? Solomonoff? Ha! And if what you actually do when your model doesn’t fit looks the same as what we do, why pretend it’s Bayesian inference?”
I suppose this would be Eliezer’s answer to that last question.
I am not persuaded that the harder Bayesians have any more concrete answer. Solmonoff induction is uncomputable and seems to unnaturally favour short hypotheses involving Busy-Beaver-sized numbers. And any computable approximation to it looks to me like brute-forcing an NP-hard problem.
I believe the hard-line Bayesian response to that would be that model checking should itself be a Bayesian process.
and
“But,” the soft Bayesians might say, “how do you expand that ‘something else’ into new models by Bayesian means? You would need a universal prior, a prior whose support includes every possible hypothesis. Where do you get one of those? Solomonoff? Ha! And if what you actually do when your model doesn’t fit looks the same as what we do, why pretend it’s Bayesian inference?”
I think a hard line needs to be drawn between statistics and epistemology. Statistics is merely a method of approximating epistemology—though a very useful one. The best statistical method in a given situation is the one that best approximates correct epistemology. (I’m not saying this is the only use for statistics, but I can’t seem to make sense of it otherwise)
Now suppose Bayesian epistemology is correct—i.e. let’s say Cox’s theorem + Solomonoff prior. The correct answer to any induction problem is to do the true Bayesian update implied by this epistemology, but that’s not computable. Statistics gives us some common ways to get around this problem. Here are a couple:
1) Bayesian statistics approach: restrict the class of possible models and put a reasonable prior over that class, then do the Bayesian update. This has exactly the same problem that Mencius and Cosma pointed out.
2) Frequentist statistics approach: restrict the class of possible models and come up with a consistent estimate of which model in that class is correct. This has all the problems that Bayesians constantly criticize frequentists for, but it typically allows for a much wider class of possible models in some sense (crucially, you often don’t have to assume distributional forms)
3) Something hybrid: e.g., Bayesian statistics with model checking. Empirical Bayes (where the prior is estimated from the data). Etc.
Now superficially, 1) looks the most like the true Bayesian update—you don’t look at the data twice, and you’re actually performing a Bayesian update. But you don’t get points for looking like the true Bayesian update, you get points for giving the same answer as the true Bayesian update. If you do 1), there’s always some chance that the class of models you’ve chosen is too restrictive for some reason. Theoretically you could continue to do 1) by just expanding the class of possible models and putting a prior over that class, but at some point that becomes computationally infeasible. Model checking is a computationally feasible way of approximating this process. And, a priori, I see no reason to think that some frequentist method won’t give the best computationally feasible approximation in some situation.
So, basically, a “hardline Bayesian” should do model checking and sometimes even frequentist statistics. (Similarly, a “hardline frequentist” in the epistemological sense should sometimes do Bayesian statistics. And, in fact, they do this all the time in econometrics.)
And any computable approximation to it looks to me like brute-forcing an NP-hard problem.
I find this a curious thing to say. Isn’t this an argument against every possible remotely optimal computable form of induction or decision-making? Of course a good computable approximation may wind up spending lots of resources solving a problem if that problem is important enough, this is not a blackmark against it. Problems in the real world can be hard, so dealing with them may not be easy!
“Omega flies up to you and hands you a box containing the Secrets of Immortality; the box is opened by the solution to an NP problem inscribed on it.” Is the optimal solution really to not even try the problem—because then you’re trying “brute-forcing an NP-hard problem”! - even if it turns out to be one of the majority of easily-solved problems? “You start a business and discover one of your problems is NP-hard. You immediately declare bankruptcy because your optimal induction optimally infers that the problem cannot be solved and this most optimally limits your losses.”
And why NP-hard, exactly? You know there are a ton of harder complexity classes in the complexity zoo, right?
The right answer is simply to point out that the worst case of the optimal algorithm is going to be the worst case of all possible problems presented, and this is exactly what we would expect since there is no magic fairy dust which will collapse all problems to constant-time solutions.
I find this a curious thing to say. Isn’t this an argument against every possible remotely optimal computable form of induction or decision-making?
There might well be a theorem formalising that statement. There might also be one formalising the statement that every remotely optimal form of induction or decision-making is uncomputable. If that’s the way it is, well, that’s the way it is.
“Omega flies up to you
This is an argument of the form “Suppose X were true—then X would be true! So couldn’t X be true?”
“You start a business and discover one of your problems is NP-hard. You immediately declare bankruptcy because your optimal induction optimally infers that the problem cannot be solved and this most optimally limits your losses.”
You try to find a method that solves enough examples of the NP-hard problem well enough to sell the solutions, such that your more bounded ambition puts you back in the realm of P. This is done all the time—freight scheduling software, for example. Or airline ticket price searching. Part of designing optimising compilers is not attempting analyses that take insanely long.
And why NP-hard, exactly? You know there are a ton of harder complexity classes in the complexity zoo, right?
Harder classes are subsets of NP-hard, and everything in NP-hard is hard enough to make the point. Of course, there is the whole uncomputability zoo above all that, but computing the uncomputable is even more of a wild goose chase. “Omega flies up to you and hands you a box containing the Secrets of Immortality; for every digit of Chatin’s Omega you correctly type in, you get an extra year, and it stops working after the first wrong answer”.
This is an argument of the form “Suppose X were true—then X would be true! So couldn’t X be true?”
No, this is pointing out that if you provide an optimal outcome barricaded by a particular obstacle, then that optimal outcome will trivially be at least as hard as that obstacle.
You try to find a method that solves enough examples of the NP-hard problem well enough to sell the solutions, such that your more bounded ambition puts you back in the realm of P. This is done all the time—freight scheduling software, for example. Or airline ticket price searching. Part of designing optimising compilers is not attempting analyses that take insanely long.
This is exactly the point made for computable approximations to AIXI. Thank you for agreeing.
Harder classes are subsets of NP-hard, and everything in NP-hard is hard enough to make the point.
Are you sure you want to make that claim? That all harder classes are subsets of NP-hard?
but computing the uncomputable is even more of a wild goose chase. “Omega flies up to you and hands you a box containing the Secrets of Immortality; for every digit of Chatin’s Omega you correctly type in, you get an extra year, and it stops working after the first wrong answer”.
Harder classes are subsets of NP-hard, and everything in NP-hard is hard enough to make the point.
Are you sure you want to make that claim? That all harder classes are subsets of NP-hard?
No, carelessness on my part. Doesn’t affect my original point, that schemes for approximating Solomonoff or AIXI look like at least exponential brute force search.
You try to find a method that solves enough examples of the NP-hard problem well enough to sell the solutions, such that your more bounded ambition puts you back in the realm of P. This is done all the time—freight scheduling software, for example. Or airline ticket price searching. Part of designing optimising compilers is not attempting analyses that take insanely long.
This is exactly the point made for computable approximations to AIXI.
Since AIXI is, by construction, the best possible intelligent agent, all work on AGI can, in a rather useless sense, be described as an approximation to AIXI. To the extent that such an attempt works (i.e. gets substantially further than past attempts at AGI), it will be because of new ideas not discovered by brute force search, not because it approximates AIXI.
schemes for approximating Solomonoff or AIXI look like at least exponential brute force search.
Well, yeah. Again—why would you expect anything else? Given that there exist problems which require that or worse for solution? How can a universal problem solver do any better?
Since AIXI is, by construction, the best possible intelligent agent, all work on AGI can, in a rather useless sense, be described as an approximation to AIXI.
Yes.
To the extent that such an attempt works (i.e. gets substantially further than past attempts at AGI), it will be because of new ideas not discovered by brute force search, not because it approximates AIXI.
No. Given how strange and different AIXI works, it can easily stimulate new ideas.
No. Given how strange and different AIXI works, it can easily stimulate new ideas.
The spin-off argument. Here’s a huge compendium of spinoffs of previous approaches to AGI. All very useful, but not AGI. I’m not expecting better from AIXI.
Hm, so let’s see; you started off mocking the impossibility and infeasibility of AIXI and any computable version:
I am not persuaded that the harder Bayesians have any more concrete answer. Solmonoff induction is uncomputable and seems to unnaturally favour short hypotheses involving Busy-Beaver-sized numbers. And any computable approximation to it looks to me like brute-forcing an NP-hard problem.
Then you admitted that actually every working solution can be seen as a form of SI/AIXI:
There might well be a theorem formalising that statement. There might also be one formalising the statement that every remotely optimal form of induction or decision-making is uncomputable. If that’s the way it is, well, that’s the way it is… Since AIXI is, by construction, the best possible intelligent agent, all work on AGI can, in a rather useless sense, be described as an approximation to AIXI
And now you’re down to arguing that it’ll be “very useful, but not AGI”.
I stand by the first quote. Every working solution can in a useless sense be seen as a form of SI/AIXI. The sense that a hot-air balloon can be seen as an approach to landing on the Moon.
And now you’re down to arguing that it’ll be “very useful, but not AGI”.
At the very most. Whether AIXI-like algorithms get into the next edition of Russell and Norvig, having proved of practical value, well, history will decide that, and I’m not interested in predicting it. I will predict that it won’t prove to be a viable approach to AGI.
Isn’t there a very wide middle ground between (1) assigning 100% of your mental probability to a single model, like a normal curve and (2) assigning your mental probability proportionately across every conceivable model ala Solomonoff?
I mean the whole approach here sounds more philosophical than practical. If you have any kind of constraint on your computing power, and you are trying to identify a model that most fully and simply explains a set of observed data, then it seems like the obvious way to use your computing power is to put about a quarter of your computing cycles on testing your preferred model, another quarter on testing mild variations on that model, another quarter on all different common distribution curves out of the back of your freshman statistics textbook, and the final quarter on brute-force fitting the data as best you can given that your priors about what kind of model to use for this data seem to be inaccurate.
I can’t imagine any human being who is smart enough to run a statistical modeling exercise yet foolish enough to cycle between two peaks forever without ever questioning the assumption of a single peak, nor any human being foolish enough to test every imaginable hypothesis, even including hypotheses that are infinitely more complicated than the data they seek to explain. Why would we program computers (or design algorithms) to be stupider than we are? If you actually want to solve a problem, you try to get the computer to at least model your best cognitive features, if not improve on them. Am I missing something here?
Isn’t there a very wide middle ground between (1) assigning 100% of your mental probability to a single model, like a normal curve and (2) assigning your mental probability proportionately across every conceivable model ala Solomonoff?
Yes, the question is what that middle ground looks like—how you actually come up with new models. Gelman and Shalizi say it’s a non-Bayesian process depending on human judgement. The behaviour that you rightly say is absurd, of the Bayesian Flying Dutchman, is indeed Shalizi’s reductio ad absurdum of universal Bayesianism. I’m not sure what gwern has just been arguing, but it looks like doing whatever gets results through the week while going to the church of Solomonoff on Sundays.
An algorithmic method of finding new hypotheses that works better than people is equivalent to AGI, so this is not an issue I expect to see solved any time soon.
An algorithmic method of finding new hypotheses that works better than people is equivalent to AGI, so this is not an issue I expect to see solved any time soon.
Eh. What seems AGI-ish to me is making models interact fruitfully across domains; algorithmic models to find new hypotheses for a particular set of data are not that tough and already exist (and are ‘better than people’ in the sense that they require far less computational effort and are far more precise at distinguishing between models).
The hypothesis-discovery methods are universal; you just need to feed them data. My view is that the hard part is picking what data to feed them, and what to do with the models they discover.
Edit: I should specify, the models discovered grow in complexity based on the data provided, and so it’s very difficult to go meta (i.e. run hypothesis discovery on the hypotheses you’ve discovered), because the amount of data you need grows very rapidly.
I don’t think any robot scientists would be eligible for Nobel prizes; Nobel’s will specifies persons. We’ve had robot scientists for almost a decade now, but they tend to excel in routine and easily automatized areas. I don’t think they will make Nobel-level contributions anytime soon, and by the time they do, the intelligence explosion will be underway.
I am not persuaded that the harder Bayesians have any more concrete answer. Solmonoff induction is uncomputable and seems to unnaturally favour short hypotheses involving Busy-Beaver-sized numbers. And any computable approximation to it looks to me like brute-forcing an NP-hard problem.
What if, as a computational approximation of the universal prior, we use genetic algorithms to generate a collection of agents, each using different heuristics to generate hypotheses? I mean, there’s probably better approximations than that; but we have strong evidence that this one works and is computable.
What if, as a computational approximation of the universal prior, we use genetic algorithms to generate a collection of agents, each using different heuristics to generate hypotheses?
Whatever approach to AGI anyone has, let them go ahead and try it, and see if it works. Ok, that would be rash advice if I thought it would work (because of UFAI), but if it has any chance of working, the only way to find out is to try it.
I’m not saying I’m willing to code that up; I’m just saying that a genetic algorithm (such as Evolution) creating agents which use heuristics to generate hypotheses (such as humans) can work at least as well as anything we’ve got so far.
I’m just saying that a genetic algorithm (such as Evolution) creating agents which use heuristics to generate hypotheses (such as humans) can work at least as well as anything we’ve got so far.
I am not persuaded that the harder Bayesians have any more concrete answer. Solmonoff induction is uncomputable and seems to unnaturally favour short hypotheses involving Busy-Beaver-sized numbers. And any computable approximation to it looks to me like brute-forcing an NP-hard problem.
Eh. I like the approach of “begin with a simple system hypothesis, and when your residuals aren’t distributed the way you want them to be, construct a more complicated hypothesis based on where the simple hypothesis failed.” It’s tractable (this is the elevator-talk version of one of the techniques my lab uses for modeling manufacturing systems), and seems like a decent approximation of Solomonoff induction on the space of system models.
It’s basically different terminology. His point is valid.
A model isn’t something you assign probability to. It’s something you use to come up with a set of prior probabilities. The model he used assumed independence. It didn’t actually assign zero probability to any result. It doesn’t assign a probability, zero or otherwise, to the machine being broken, because that’s not something that’s considered. It also doesn’t assign a probability to whether or not its raining.
From the little Moldbug I’ve been able to slog through, my main impression of him is “reader-hostile”. If he were polite maybe he would get to the effing point already.
I think his point is that you are still entirely unable to even enumerate, let alone process, all the relevant hypotheses, nor does the formula inform you of those, nor does it inform you how to deal with cyclic updates (or even that those are a complicated case), etc.
It’s particularly bad when it comes to what rationalists describe as “expected utility calculations”. The ideal expected utility is a sum of the differential effect of the actions being compared, over all hypotheses, multiplied with their probabilities… a single component of the sum provides very little or no information about the value of the sum, especially when picked by someone with a financial interest as strong as “if i don’t convince those people I can’t pay my rent”. Then, the actions themselves have an impact on the future decision making, which makes the expected value sum grow and branch out like some crazy googol-headed fractal hydra. Mostly when someone’s talking much about Bayes they have some simple and invalid expected value calculation that they want you to perform and act upon, so that you’ll be worse off in the end and they’ll be better off in the end.
More quotes by Mencius Moldbug:
They are all from the article A Reservationist Epistemology
Surely the actual Bayesian rational mind’s conclusion is that the attacker will (probably) always show a blue ball, nothing to do with the urn at all.
Solomonoff prior gives nonzero probability to the attacker deceiving us. But humans are not very good at operating with such probabilities precisely.
I just facepalmed the hardest I’ve ever done while reading Unqualified Reservations. That is, not very hard—Mencius is nothing if not a charming and polite author—but still. Maybe he really ought to read at least one Sequence!
Could we start that reading with the classic Bayes’ Theorem example? Suppose 1% of women have breast cancer, 80% of mammograms on a cancerous woman will detect it, 9.6% on an uncancerous woman will be false positives. Suppose woman A gets a mammogram which indicates cancer. What are the odds she has cancer?
p(A|X) = p(X|A)p(A)/(p(X|A)p(A)+p(X|~A)p(~A)) = 7.8% Hooray?
Now suppose women B, C, D, E, F… Z, AA, AB, AC, AD, etc., the entire patient list getting screened today, all test positive for cancer. Is the probability that woman A has cancer still 7.8%? Bayes’ rule, with the priors above, still says “yes”! You need more complicated prior probabilities (e.g. what are the odds that the test equipment is malfunctioning?) before your evidence can tell you what’s actually likely to be happening. But those more complicated, more accurate priors would have (very slightly) changed our original p(A|X) as well!
It’s not that Bayesian updating is wrong. It’s just that Bayes’ theorem never allows you to have a non-zero posterior probability coming from a zero prior, and to make any practical problem tractable everybody ends up implicitly assuming huge swaths of zero prior probability.
It’s not assuming zero probability. It’s assuming independence. Under the original model, it’s possible for all the women to get positives, but only 1% to actually have breast cancer. It’s just that a better prior would give a much higher probability.
Is there any practical difference between “assuming independent results” and “assuming zero probability for all models which do not generate independent results”? If not then I think we’ve just been exposed to people using different terminology.
No.
I think it’s more than terminology. And if Mencius can be dismissed as someone who does not really get Bayesian inference, one can surely not say the same of Cosma Shalizi, who has made the same argument somewhere on his blog. (It was a few years ago and I can’t easily find a link. It might have been in a technical report or a published paper instead.) Suppose a Bayesian is trying to estimate the mean of a normal distribution from incoming data. He has a prior distribution of the mean, and each new observation updates that prior. But what if the data are not drawn from a normal distribution, but from the sum of two such distributions with well separated peaks? The Bayesian (he says) can never discover that. Instead, his estimate of the position of the single peak that he is committed to will wander up and down between the two real peaks, like the Flying Dutchman cursed never to find a port, while the posterior probability of seeing the data that he has seen plummets (on the log-odds scale) towards minus infinity. But he cannot avoid this: no evidence can let him update towards anything his prior gives zero probability to.
What (he says) can save the Bayesian from this fate? Model-checking. Look at the data and see if they are actually consistent with any model in the class you are trying to fit. If not, think of a better model and fit that.
Andrew Gelman says the same; there’s a chapter of his book devoted to model checking. And here’s a paper by both of them on Bayesian inference and philosophy of science, in which they explicitly describe model-checking as “non-Bayesian checking of Bayesian models”. My impression (not being a statistician) is that their view is currently the standard one.
I believe the hard-line Bayesian response to that would be that model checking should itself be a Bayesian process. (I’m distancing myself from this claim, because as a non-statistician, I don’t need to have any position on this. I just want to see the position stated here.) The single-peaked prior in Shalizi’s story was merely a conditional one: supposing the true distribution to be in that family, the Bayesian estimate does indeed behave in that way. But all we have to do to save the Bayesian from a fate worse than frequentism is to widen the picture. That prior was merely a subset, worked with for computational convenience, but in the true prior, that prior only accounted for some fraction p<1 of the probability mass, the remaining 1-p being assigned to “something else”. Then when the data fail to conform to any single Gaussian, the “something else” alternative will eventually overshadow the Gaussian model, and will need to be expanded into more detail.
“But,” the soft Bayesians might say, “how do you expand that ‘something else’ into new models by Bayesian means? You would need a universal prior, a prior whose support includes every possible hypothesis. Where do you get one of those? Solomonoff? Ha! And if what you actually do when your model doesn’t fit looks the same as what we do, why pretend it’s Bayesian inference?”
I suppose this would be Eliezer’s answer to that last question.
I am not persuaded that the harder Bayesians have any more concrete answer. Solmonoff induction is uncomputable and seems to unnaturally favour short hypotheses involving Busy-Beaver-sized numbers. And any computable approximation to it looks to me like brute-forcing an NP-hard problem.
In response to:
and
I think a hard line needs to be drawn between statistics and epistemology. Statistics is merely a method of approximating epistemology—though a very useful one. The best statistical method in a given situation is the one that best approximates correct epistemology. (I’m not saying this is the only use for statistics, but I can’t seem to make sense of it otherwise)
Now suppose Bayesian epistemology is correct—i.e. let’s say Cox’s theorem + Solomonoff prior. The correct answer to any induction problem is to do the true Bayesian update implied by this epistemology, but that’s not computable. Statistics gives us some common ways to get around this problem. Here are a couple:
1) Bayesian statistics approach: restrict the class of possible models and put a reasonable prior over that class, then do the Bayesian update. This has exactly the same problem that Mencius and Cosma pointed out.
2) Frequentist statistics approach: restrict the class of possible models and come up with a consistent estimate of which model in that class is correct. This has all the problems that Bayesians constantly criticize frequentists for, but it typically allows for a much wider class of possible models in some sense (crucially, you often don’t have to assume distributional forms)
3) Something hybrid: e.g., Bayesian statistics with model checking. Empirical Bayes (where the prior is estimated from the data). Etc.
Now superficially, 1) looks the most like the true Bayesian update—you don’t look at the data twice, and you’re actually performing a Bayesian update. But you don’t get points for looking like the true Bayesian update, you get points for giving the same answer as the true Bayesian update. If you do 1), there’s always some chance that the class of models you’ve chosen is too restrictive for some reason. Theoretically you could continue to do 1) by just expanding the class of possible models and putting a prior over that class, but at some point that becomes computationally infeasible. Model checking is a computationally feasible way of approximating this process. And, a priori, I see no reason to think that some frequentist method won’t give the best computationally feasible approximation in some situation.
So, basically, a “hardline Bayesian” should do model checking and sometimes even frequentist statistics. (Similarly, a “hardline frequentist” in the epistemological sense should sometimes do Bayesian statistics. And, in fact, they do this all the time in econometrics.)
See my similar comments here and here.
I find this a curious thing to say. Isn’t this an argument against every possible remotely optimal computable form of induction or decision-making? Of course a good computable approximation may wind up spending lots of resources solving a problem if that problem is important enough, this is not a blackmark against it. Problems in the real world can be hard, so dealing with them may not be easy!
“Omega flies up to you and hands you a box containing the Secrets of Immortality; the box is opened by the solution to an NP problem inscribed on it.” Is the optimal solution really to not even try the problem—because then you’re trying “brute-forcing an NP-hard problem”! - even if it turns out to be one of the majority of easily-solved problems? “You start a business and discover one of your problems is NP-hard. You immediately declare bankruptcy because your optimal induction optimally infers that the problem cannot be solved and this most optimally limits your losses.”
And why NP-hard, exactly? You know there are a ton of harder complexity classes in the complexity zoo, right?
The right answer is simply to point out that the worst case of the optimal algorithm is going to be the worst case of all possible problems presented, and this is exactly what we would expect since there is no magic fairy dust which will collapse all problems to constant-time solutions.
There might well be a theorem formalising that statement. There might also be one formalising the statement that every remotely optimal form of induction or decision-making is uncomputable. If that’s the way it is, well, that’s the way it is.
This is an argument of the form “Suppose X were true—then X would be true! So couldn’t X be true?”
You try to find a method that solves enough examples of the NP-hard problem well enough to sell the solutions, such that your more bounded ambition puts you back in the realm of P. This is done all the time—freight scheduling software, for example. Or airline ticket price searching. Part of designing optimising compilers is not attempting analyses that take insanely long.
Harder classes are subsets of NP-hard, and everything in NP-hard is hard enough to make the point. Of course, there is the whole uncomputability zoo above all that, but computing the uncomputable is even more of a wild goose chase. “Omega flies up to you and hands you a box containing the Secrets of Immortality; for every digit of Chatin’s Omega you correctly type in, you get an extra year, and it stops working after the first wrong answer”.
No, this is pointing out that if you provide an optimal outcome barricaded by a particular obstacle, then that optimal outcome will trivially be at least as hard as that obstacle.
This is exactly the point made for computable approximations to AIXI. Thank you for agreeing.
Are you sure you want to make that claim? That all harder classes are subsets of NP-hard?
Fantastic! I claim my extra 43 years of life.
No, carelessness on my part. Doesn’t affect my original point, that schemes for approximating Solomonoff or AIXI look like at least exponential brute force search.
Since AIXI is, by construction, the best possible intelligent agent, all work on AGI can, in a rather useless sense, be described as an approximation to AIXI. To the extent that such an attempt works (i.e. gets substantially further than past attempts at AGI), it will be because of new ideas not discovered by brute force search, not because it approximates AIXI.
43 years is a poor sort of immortality.
Well, yeah. Again—why would you expect anything else? Given that there exist problems which require that or worse for solution? How can a universal problem solver do any better?
Yes.
No. Given how strange and different AIXI works, it can easily stimulate new ideas.
It’s more than I had before.
The spin-off argument. Here’s a huge compendium of spinoffs of previous approaches to AGI. All very useful, but not AGI. I’m not expecting better from AIXI.
Hm, so let’s see; you started off mocking the impossibility and infeasibility of AIXI and any computable version:
Then you admitted that actually every working solution can be seen as a form of SI/AIXI:
And now you’re down to arguing that it’ll be “very useful, but not AGI”.
Well, I guess I can settle for that.
I stand by the first quote. Every working solution can in a useless sense be seen as a form of SI/AIXI. The sense that a hot-air balloon can be seen as an approach to landing on the Moon.
At the very most. Whether AIXI-like algorithms get into the next edition of Russell and Norvig, having proved of practical value, well, history will decide that, and I’m not interested in predicting it. I will predict that it won’t prove to be a viable approach to AGI.
How can a hot air balloon even in theory be seen as that? Hot air has a specific limit, does it not—where its density equals the outside density?
Isn’t there a very wide middle ground between (1) assigning 100% of your mental probability to a single model, like a normal curve and (2) assigning your mental probability proportionately across every conceivable model ala Solomonoff?
I mean the whole approach here sounds more philosophical than practical. If you have any kind of constraint on your computing power, and you are trying to identify a model that most fully and simply explains a set of observed data, then it seems like the obvious way to use your computing power is to put about a quarter of your computing cycles on testing your preferred model, another quarter on testing mild variations on that model, another quarter on all different common distribution curves out of the back of your freshman statistics textbook, and the final quarter on brute-force fitting the data as best you can given that your priors about what kind of model to use for this data seem to be inaccurate.
I can’t imagine any human being who is smart enough to run a statistical modeling exercise yet foolish enough to cycle between two peaks forever without ever questioning the assumption of a single peak, nor any human being foolish enough to test every imaginable hypothesis, even including hypotheses that are infinitely more complicated than the data they seek to explain. Why would we program computers (or design algorithms) to be stupider than we are? If you actually want to solve a problem, you try to get the computer to at least model your best cognitive features, if not improve on them. Am I missing something here?
Yes, the question is what that middle ground looks like—how you actually come up with new models. Gelman and Shalizi say it’s a non-Bayesian process depending on human judgement. The behaviour that you rightly say is absurd, of the Bayesian Flying Dutchman, is indeed Shalizi’s reductio ad absurdum of universal Bayesianism. I’m not sure what gwern has just been arguing, but it looks like doing whatever gets results through the week while going to the church of Solomonoff on Sundays.
An algorithmic method of finding new hypotheses that works better than people is equivalent to AGI, so this is not an issue I expect to see solved any time soon.
Eh. What seems AGI-ish to me is making models interact fruitfully across domains; algorithmic models to find new hypotheses for a particular set of data are not that tough and already exist (and are ‘better than people’ in the sense that they require far less computational effort and are far more precise at distinguishing between models).
Yes, I had in mind a universal algorithmic method, rather than a niche application.
The hypothesis-discovery methods are universal; you just need to feed them data. My view is that the hard part is picking what data to feed them, and what to do with the models they discover.
Edit: I should specify, the models discovered grow in complexity based on the data provided, and so it’s very difficult to go meta (i.e. run hypothesis discovery on the hypotheses you’ve discovered), because the amount of data you need grows very rapidly.
Hmmm. Are we going to see a Nobel awarded to an AI any time soon?
I don’t think any robot scientists would be eligible for Nobel prizes; Nobel’s will specifies persons. We’ve had robot scientists for almost a decade now, but they tend to excel in routine and easily automatized areas. I don’t think they will make Nobel-level contributions anytime soon, and by the time they do, the intelligence explosion will be underway.
What if, as a computational approximation of the universal prior, we use genetic algorithms to generate a collection of agents, each using different heuristics to generate hypotheses? I mean, there’s probably better approximations than that; but we have strong evidence that this one works and is computable.
Whatever approach to AGI anyone has, let them go ahead and try it, and see if it works. Ok, that would be rash advice if I thought it would work (because of UFAI), but if it has any chance of working, the only way to find out is to try it.
I’m not saying I’m willing to code that up; I’m just saying that a genetic algorithm (such as Evolution) creating agents which use heuristics to generate hypotheses (such as humans) can work at least as well as anything we’ve got so far.
If you have a few billion years to wait.
No reason that can’t be sped up.
Eh. I like the approach of “begin with a simple system hypothesis, and when your residuals aren’t distributed the way you want them to be, construct a more complicated hypothesis based on where the simple hypothesis failed.” It’s tractable (this is the elevator-talk version of one of the techniques my lab uses for modeling manufacturing systems), and seems like a decent approximation of Solomonoff induction on the space of system models.
It’s basically different terminology. His point is valid.
A model isn’t something you assign probability to. It’s something you use to come up with a set of prior probabilities. The model he used assumed independence. It didn’t actually assign zero probability to any result. It doesn’t assign a probability, zero or otherwise, to the machine being broken, because that’s not something that’s considered. It also doesn’t assign a probability to whether or not its raining.
From the little Moldbug I’ve been able to slog through, my main impression of him is “reader-hostile”. If he were polite maybe he would get to the effing point already.
I think his point is that you are still entirely unable to even enumerate, let alone process, all the relevant hypotheses, nor does the formula inform you of those, nor does it inform you how to deal with cyclic updates (or even that those are a complicated case), etc.
It’s particularly bad when it comes to what rationalists describe as “expected utility calculations”. The ideal expected utility is a sum of the differential effect of the actions being compared, over all hypotheses, multiplied with their probabilities… a single component of the sum provides very little or no information about the value of the sum, especially when picked by someone with a financial interest as strong as “if i don’t convince those people I can’t pay my rent”. Then, the actions themselves have an impact on the future decision making, which makes the expected value sum grow and branch out like some crazy googol-headed fractal hydra. Mostly when someone’s talking much about Bayes they have some simple and invalid expected value calculation that they want you to perform and act upon, so that you’ll be worse off in the end and they’ll be better off in the end.