I plan to develop this into a top level post, and it expands on my ideas in this comment, this comment, and the end of this comment. I’m interested in what LWers have to say about it.
Basically, I think the concept of intelligence is somewhere between a category error and a fallacy of compression. For example Marcus Hutter’s AIXI purports to identify the inferences a maximally-intelligent being would make, yet it (and efficient approximations) does not have practical application. The reason (I think) is that it works by finding the shortest hypothesis that fits any data given to it. This means it makes the best inference, on average, over all conceivable worlds it could be placed in. But the No Free Lunch theorems suggest that this means it will be suboptimal compared to any algorithm tailored to any specific world. At the very least, having to be optimal for the all of the random worlds and anti-inductive worlds, should imply poor performance in this world.
The point is that I think “intelligence” can refer to two useful but very distinct attributes: 1) the ability to find the shortest hypothesis fitting the available data, and 2) having beliefs (a prior probability distribution) about one’s world that are closest to (have the smallest KL divergence from) that world. (These attributes roughly correspond to what we intuit as “book smarts” and “street smarts” respectively.) A being can “win” if it does well on 2) even if it’s not good at 1), since using a prior can be more advantageous than finding short hypothesis since the prior already points you to the right hypothesis.
Making something intelligent means optimizing the combination of each that it has, given your resources. What’s more, no one algorithm can be generally optimal for finding the current world’s probability distribution, because that would also violate the NFL theorems.
Organisms on earth have high intelligence in the second sense. Over their evolution history they had to make use of whatever regularity they could find about their environment, and the ability to use this regularity became “built in”. So the history of evolution is showing the result of one approach to finding the environment’s distribution (ETC), and making an intelligent being means improving upon this method, and programming it to “springboard” from that prior with intelligence in the first sense.
Basically, I think the concept of intelligence is somewhere between a category error and a fallacy of compression.
This may be tangential to your point, but it’s worth remembering that human intelligence has a very special property, which is that it is strongly domain-independent. A person’s ability to solve word puzzles correlates with her ability to solve math puzzles. So you can measure someone’s IQ by giving her a logic puzzle test, and the score will tell you a lot about the person’s general mental capabilities.
Because of that very special property, people feel more or less comfortable referring to “intelligence” as a tangible thing that impacts the real world. If you had to pick between two doctors to perform a life-or-death operation, and you knew that one had an IQ of 100 and the other an IQ of 160, you would probably go with the latter. Most people would feel comfortable with the statement “Harvard students are smarter than high school dropouts”, and make real-world predictions based on it (e.g a Harvard student is more likely to be able to write a good computer program than a high school dropout, even if the former didn’t study computer science).
The point is that there’s no reason this special domain-independence property of human intelligence should hold for non-human reasoning machines. So while it makes sense to score humans based on this “intelligence” quantity, it might be totally meaningless to attempt to do so for machines.
This may be tangential to your point, but it’s worth remembering that human intelligence has a very special property, which is that it is strongly domain-independent.
Not so fast. Human intelligence is relatively domain independent. But human minds are constantly exploiting known regularities of the environment (by making assumptions) to make better inferences. These reguarities make up a tiny sliver of the Platonic space of generating functions. By (correctly) assuming we’re in that sliver, we vastly improve our capabilities compared to if we were AIXIs lacking that knowledge.
Human intelligence appears strongly domain-indepdent because it generalizes to all the domains that we see. It does not generalize to the full set of computable environments—no intelligence can do that while still performing as well in each as we do in this environment.
Non-human animals are likewise “domain-independently intelligent” for the domains that they exist in. Most humans would die, for example, if dropped in the middle of the desert, ocean, or arctic.
But human minds are constantly exploiting known regularities of the environment (by making assumptions) to make better inferences.
Not just by making assumptions: you can learn (domain-specific) optimizations that don’t introduce new info, but improve ability, allowing to understand more from the info you have (better conceptual pictures for natural science; math).
Another example of how domain-dependent human intelligence actually is, is optical illusions.
Optical illusions are when an image violates an assumption your brain is making to interpret visual data, causing it to misinterpret the image. And remember, this is only going slightly outside of the boundary of the assumptions your brain makes.
But the No Free Lunch theorems suggest that this means it will be suboptimal compared to any algorithm tailored to any specific world.
This is a subtle point. The NFL theorem does prohibit any algorithm from doing well over all possible worlds. But Solomonoff induction does well on any world that has any kind of computable regularity. If there is no computable regularity, then no prior can do well. In fact, the Solomonoff prior does just as well asymptotically as any computable prior.
As is often the case, thinking in terms of codes can clear up the issue. A world is a big data file. Certainly, an Earth-specific algorithm can get good compression rates if it is fed data that comes from Earth. But as the data file gets large, the Solomonoff general-purpose compression algorithm will achieve compression rates that are nearly as good; in the worst case, just has to prepend the code of the Earth-specific algorithm to its encoded data stream, and it only underperforms by that program size.
The reason AIXI doesn’t work in practice is that the “efficient approximations” aren’t really efficient, or aren’t good approximations.
If there is no computable regularity, then no prior can do well. In fact, the Solomonoff prior does just as well asymptotically as any computable prior.
This seems to be a common belief. But see this discussion I had with Eliezer where I offered some good arguments and counterexamples against it.
The link goes to the middle, most relevant part of the discussion. But if you look at the top of it, I’m not arguing against the Solomonoff approach, but instead trying to find a generalization of it that makes more sense.
I’ve linked to that discussion several times in my comments here, but I guess many people still haven’t seen it. Maybe I should make a top-level post about it?
This is a subtle point. The NFL theorem does prohibit any algorithm from doing well over all possible worlds. But Solomonoff induction does well on any world that has any kind of computable regularity.
Okay, fair point. But by smearing its optimality over the entire Platonic space of computable functions, it is significantly worse than those algorithms tuned for this world’s function. And not surprisingly, AIXI has very little practical application.
If there is no computable regularity, then no prior can do well. In fact, the Solomonoff prior does just as well asymptotically as any computable prior.
And that’s unhelpful when, as is likely, you don’t hit that asymptote until the heat death of the universe.
The reason AIXI doesn’t work in practice is that the “efficient approximations” aren’t really efficient, or aren’t good approximations.
My point is that the most efficient approximations can’t be efficient in any absolute sense. In order to make AIXI useful, you have to feed it information about which functions it can safely skip over—in other words, feed it intelligence of type 2), the information about its environment that you already gained through other methods Which just shows that those kinds of intelligence are not the same.
Actually Solomonoff induction is insanely fast. Its generality is not just that it learns everything to as good an extent as anything else, but also that typically it learns everything from indirect observations “almost as fast as directly” (not really, but...). The “only problem” is that Solomonoff induction is not an algorithm, and so its “speed” is for all practical purposes a meaningless statement.
The analog would be to theorem proving. No one claims that knowing the axioms of math gets you to every theorem “very fast”—because the problem of finding a proof/disproof for an arbitrary proposition is also uncomputable.
A “solution” might be that only proofs matter, while theorems (as formulas) are in general meaningless in themselves, only useful as commentary on proofs.
Nevertheless, the original point stands: no one says “I’ve discovered math! Now I can can learn the answer to any math problem very fast.” In contrast, you are saying that because we have Solomonoff induction, we can infer distributions “very fast”.
In contrast, you are saying that because we have Solomonoff induction, we can infer distributions “very fast”.
To be more precise, we can specify the denotation of distributions very close to the real deal from very few data. This technical sense doesn’t allow the analogy with theorem-proving, which is about algorithms, not denotation.
But by smearing its optimality over the entire Platonic space of computable functions, it is significantly worse than those algorithms tuned for this world’s function. And not surprisingly, AIXI has very little practical application.
Let’s do another thought experiment. Say that humanity has finally resolved to send colonists to nearby star systems. The first group is getting ready to head out to Alpha Centauri.
The plan is that after the colonists arrive and set up their initial civilization, they will assemble a data archive of size T about the new world and send it back to Earth for review. Now it is expensive to send data across light-years, so obviously they want to minimize the number of bits they have to send. So the question is: what data format do the two parties agree on at the moment of parting?
If T is small, then it makes sense to think this issue over quite a bit. What should we expect the data to look like? Will it be images, audio, health reports? If we can figure something out about what the data will look like in advance (ie, choose a good prior), then we can develop a good data format and get short codes.
But if T is large (terabytes) then there’s no point in doing that. When the Alpha Centauri people build their data archive, they spend some time analyzing it and figuring out ways to compress it. Finally they find a really good compression format (=prior). Of course, Earth doesn’t know the format—but that doesn’t matter, since the specification for the format can just be prepended to the transmission.
I think this thought experiment is nice because it reveals the pointlessness of a lot of philosophical debates about Solomonoff, Bayes, etc. Of course the colonists have to choose a prior before the moment of parting, and of course if they choose a good prior they will get short codes. And the Solomonoff distribution may not be perfect in some metaphysical sense, but it’s obviously the right prior to choose in the large T regime. Better world-specific formats exist, but their benefit is small compared to T.
I think this thought experiment is nice because it reveals the pointlessness of a lot of philosophical debates about Solomonoff, Bayes, etc. Of course the colonists have to choose a prior before the moment of parting, and of course if they choose a good prior they will get short codes. And the Solomonoff distribution may not be perfect in some metaphysical sense, but it’s obviously the right prior to choose in the large T regime. Better world-specific formats exist, but their benefit is small compared to T.
Well, the thought experiment doesn’t accomplish that. Solomonoff induction is not necessarily optimal (and most probably isn’t optimal) in your scenario, even and especially for large T. The amount of time it takes for any computable Occamian approximation of S/I to find the the optimal encoding, is superexponential in the length of the raw source data. So the fact that it will eventually get to a superior or near-superior encoding is little consolation, when Alpha Centauri and Sol will have long burned out before Solomonoff has converged on a solution.
The inferiority of Solomonoff Occamian induction, of iterating up through shorter generating algorithms until the data is matched, is not some metaphysical or philosophical issue, but rather, deals directly with the real-world time constraints that arise in practical situations.
My point is, any practical attempt to incorporate Solomonoff induction must also make use of knowledge of the data’s regularity that was found some other way, making it questionable whether Solomonoff induction incorporates everything we mean by “intelligence”. This incompleteness also raises the issue of what this-world-specific methods we actually did use to get to our current state of knowledge that makes Bayesian inference actually effective.
One pattern I have noticed: those who think the No Free Lunch theorems are interesting and important are usually the people who talk the most nonsense about them. The first thing people need to learn about those theorems is how useless and inapplicable to most of the real world they are.
So, you’re disagreeing that an algorithm that is optimal, on average, over a set of randomly-selected computable environments, will perform worse in any specific environment than an algorithm optimized specifically for that environment?
Because if not, that’s all I need to make my point, no matter what subtlety of NFL I’m missing. (Actually, I can probably make it with an even weaker premise, and could have gone without NFL altogether, but it grants some insight on the issue I’m trying to illuminate.)
The NFL deals with a space of all possible problems—while the universe typically presents embedded agents with a subset of those problems that are produced by short programs or small mechanisms. So: the NFL theorems rarely apply to the real world. In the real world, there are useful general-purpose compression algorithms.
The NFL deals with a space of all possible problems—while the universe typically presents embedded agents with a subset of those problems that are produced by short programs or small mechanisms.
Okay. I stated the NFL-free version of the premise I need. If you agree with that, this point is moot.
In the real world, there are useful general-purpose compression algorithms.
Now I know I’m definitely not using NFL, because I agree with this and it’s consistent with the point in my initial post.
Yes, there are useful general-purpose programs: because researchers recognize regularities that generally appear across all types of files, which there must be because the raw data is rarely purely random. But they identify this regularity before writing the compressor, which then exploits that regularity by (basically) reserving shorter codes for the kinds of data consistent with that regularity.
Likewise, people have identified regularities specific to video files: each frame is very similar to the last. And regularities specific to picture files: each column or row is very similar to the neighboring.
But what they did not do was write an unbiased, Occamian prior program that went through various files and told them what regularities existed, because finding the shortest compression is uncomputable. Rather, they imported prior knowledge of the distribution of data in certain types of files, gained through some other method (type 2 intelligence in my convention), and tailored the compression algorithm to that distribution.
No “universal, all purpose” algorithm found that knowledge.
I should probably give you some proper feedback, as well as caustic comments.
The intelligence subdivision looks useful and interesting—though innate intelligence is usually referred to as being ‘instinctual’.
However, I was less impressed with the idea that the concept of intelligence lies somewhere between a category error and a fallacy of compression.
And I may be more leaning toward the “fallacy of compression” side, I’ll grant that. But I don’t see how you’d disagree with it since you find the subdivision I outlined to have some potential. If people are unknowingly shifting between two very different meanings of intelligence, that certainly is a fallacy of compression.
Another point: I’m not sure your description of AIXI is particularly great. AIXI works where Solomonoff induction works. Solomonoff induction works pretty well in this world. It might not be perfect—due to reference machine issues—but it is pretty good. AIXI would work very badly in worlds where Solomonoff induction was a misleading guide to its sense data. Its performance in this world doesn’t suffer through trying to deal with those worlds—since in those worlds it would be screwed.
Well, actually you’re highlighting the issue I raised in my first post: computable approximations of Solomonoff induction work pretty well … when fed useful priors! But those priors come from a lot of implicit knowledge about the world that skips over an exponentially large number of shorter hypotheses by the time you get to applying it to any specific problem.
AIXI (and computable approximations), starting from a purely Occamian prior, is stuck iterating through lots of generating functions before it gets to the right one—unfeasably long. To speed it up you have to feed it knowledge you gained elsewhere (and of course, find a way to represent that knowledge). But at that point, your prior includes a lot more than a penalty for length!
But the No Free Lunch theorems suggest that this means it will be suboptimal compared to any algorithm tailored to any specific world.
NFL theorems are about max-entropy worlds. Solomonoff induction works on highly lawful, simplicity-biased, low-entropy worlds.
If you could actually do Solomonoff induction, you would become at least as smart as a human baby in roughly 0 seconds (some rounding error may have occurred).
NFL theorems are about max-entropy worlds. Solomonoff induction works on highly lawful, simplicity-biased, low-entropy worlds.
The same (or a similar) point applies. If you limit yourself to the set of lawful worlds and use an Occamian prior, you will start off much worse than an algorithm that implictly assumes a prior that’s close to the true distribution. As Solomonoff induction works its way up through longer algorithms, it will hit some that run into an infinite loop. Even if you program a constraint that gets it past or out of these, the optimality is only present “after a long time”, which, in practice, means later than we need or want the results.
If you could actually do Solomonoff induction, you would become at least as smart as a human baby in roughly 0 seconds (some rounding error may have occurred).
What else can you tell us about the implications of being able to compute uncomputable functions?
As Solomonoff induction works its way up through longer algorithms, it will hit some that run into an infinite loop. Even you program a constraint that gets it past or out of these, the optimality is only present “after a long time”, which, in practice, means later than we need or want the results.
You are arguing against a strawman: it’s not obvious that there are no algorithms that approximate Solomonoff induction well enough in practical cases. Of course there are silly implementations that are way worse than magical oracles.
it’s not obvious that there are no algorithms that approximate Solomonoff induction well enough in practical cases.
Right, but any such approximation works by introducing a prior about which functions it can skip over. And for such knowledge to actually speed it up, it must involve knowledge (gained separately from S/I) about the true distribution.
But at that point, you’re optimizing for a narrower domain, not implementing universal intelligence. (In my naming convention, you’re bringing in type 2 intelligence.)
You can’t “speed up” an uncomputable non-algorithm.
Okay, we’re going in circles. You had just mentioned possible computable algorithms that approximate Solomonoff induction.
it’s not obvious that there are no algorithms that approximate Solomonoff induction well enough in practical cases. [emphasis added—SB]
So, we were talking about approximating algorithms. The point I was making, in response to this argument that “well, we can have working algorithms that are close enough to S/I”, was that to do so, you have to bring in knowledge of the distribution gained some other way, at which point it is no longer universal. (And, in which case talk of “speeding up” is meaningful.)
Demonstrating my point that universal intelligence has its limits and must combine with intelligence in a different sense of the term.
You introduce operations on the approximate algorithms (changing the algorithm by adding data), something absent from the original problem. What doesn’t make sense is to compare “speed” of non-algorithmic specification with the speed of algorithmic approximations. And absent any approximate algorithms, it’s also futile to compare their speed, much less propose mechanisms for their improvement that assume specific structure of these absent algorithms (if you are not serious about exploring the design space in this manner to obtain actual results).
You introduce operations on the approximate algorithms (changing the algorithm by adding data), something absent from the original problem.
What you call “the original problem” (pure Solomonoff induction) isn’t. It’s not a problem. It can’t be done, so it’s a moot point.
What doesn’t make sense is to compare “speed” of non-algorithmic specification with the speed of algorithmic approximations
Sure it does. The uncomputable Solomonoff induction has a speed of zero. Non-halting approximations have a speed greater than zero. Sounds comparable to me for the purposes of this discussion.
And absent any approximate algorithms, it’s also futile to compare their speed, much less propose mechanisms for their improvement that assume specific structure of these absent algorithms (if you are not serious about exploring the design space in this manner to obtain actual results).
There are approximate algorithms. Even Bayesian inference counts. And my point is that any time you add something to modify Solomonoff induction to make it useful is, directly or indirectly, introducing a prior unique to the search space—cleary showing the distinctness of type 2 intelligence.
I don’t understand why you’d continue arguing definitions about speed of Solomonoff induction or it being “the original problem”. It’s clear what we both mean.
I believe you are wrong about general statements about what needs to be done to implement approximate Solomonoff induction. Since we don’t technically define in what sense this approximation has to be general, there remain possibilities for a good technical definition that preserves “generality” in an approximate implementation.
don’t understand why you’d continue arguing definitions about speed of Solomonoff induction or it being “the original problem”. It’s clear what we both mean.
A better question would be why you brought up the issue. We both knew what the other meant before that, but you kept bringing it up.
I believe you are wrong … there remain possibilities for a good technical definition that preserves “generality” in an approximate implementation.
Okay, well, I’ll believe it when I see it. In the mean time, I suspect it will be far more productive to exploit whatever regularity we already know about the environment, and work on building that into the inference program’s prior. (Arguably, even the Occamian prior does that by using our hard-won belief in the universe’s preference for simplicity!)
I plan to develop this into a top level post, and it expands on my ideas in this comment, this comment, and the end of this comment. I’m interested in what LWers have to say about it.
Basically, I think the concept of intelligence is somewhere between a category error and a fallacy of compression. For example Marcus Hutter’s AIXI purports to identify the inferences a maximally-intelligent being would make, yet it (and efficient approximations) does not have practical application. The reason (I think) is that it works by finding the shortest hypothesis that fits any data given to it. This means it makes the best inference, on average, over all conceivable worlds it could be placed in. But the No Free Lunch theorems suggest that this means it will be suboptimal compared to any algorithm tailored to any specific world. At the very least, having to be optimal for the all of the random worlds and anti-inductive worlds, should imply poor performance in this world.
The point is that I think “intelligence” can refer to two useful but very distinct attributes: 1) the ability to find the shortest hypothesis fitting the available data, and 2) having beliefs (a prior probability distribution) about one’s world that are closest to (have the smallest KL divergence from) that world. (These attributes roughly correspond to what we intuit as “book smarts” and “street smarts” respectively.) A being can “win” if it does well on 2) even if it’s not good at 1), since using a prior can be more advantageous than finding short hypothesis since the prior already points you to the right hypothesis.
Making something intelligent means optimizing the combination of each that it has, given your resources. What’s more, no one algorithm can be generally optimal for finding the current world’s probability distribution, because that would also violate the NFL theorems.
Organisms on earth have high intelligence in the second sense. Over their evolution history they had to make use of whatever regularity they could find about their environment, and the ability to use this regularity became “built in”. So the history of evolution is showing the result of one approach to finding the environment’s distribution (ETC), and making an intelligent being means improving upon this method, and programming it to “springboard” from that prior with intelligence in the first sense.
Thoughts?
This may be tangential to your point, but it’s worth remembering that human intelligence has a very special property, which is that it is strongly domain-independent. A person’s ability to solve word puzzles correlates with her ability to solve math puzzles. So you can measure someone’s IQ by giving her a logic puzzle test, and the score will tell you a lot about the person’s general mental capabilities.
Because of that very special property, people feel more or less comfortable referring to “intelligence” as a tangible thing that impacts the real world. If you had to pick between two doctors to perform a life-or-death operation, and you knew that one had an IQ of 100 and the other an IQ of 160, you would probably go with the latter. Most people would feel comfortable with the statement “Harvard students are smarter than high school dropouts”, and make real-world predictions based on it (e.g a Harvard student is more likely to be able to write a good computer program than a high school dropout, even if the former didn’t study computer science).
The point is that there’s no reason this special domain-independence property of human intelligence should hold for non-human reasoning machines. So while it makes sense to score humans based on this “intelligence” quantity, it might be totally meaningless to attempt to do so for machines.
Not so fast. Human intelligence is relatively domain independent. But human minds are constantly exploiting known regularities of the environment (by making assumptions) to make better inferences. These reguarities make up a tiny sliver of the Platonic space of generating functions. By (correctly) assuming we’re in that sliver, we vastly improve our capabilities compared to if we were AIXIs lacking that knowledge.
Human intelligence appears strongly domain-indepdent because it generalizes to all the domains that we see. It does not generalize to the full set of computable environments—no intelligence can do that while still performing as well in each as we do in this environment.
Non-human animals are likewise “domain-independently intelligent” for the domains that they exist in. Most humans would die, for example, if dropped in the middle of the desert, ocean, or arctic.
Not just by making assumptions: you can learn (domain-specific) optimizations that don’t introduce new info, but improve ability, allowing to understand more from the info you have (better conceptual pictures for natural science; math).
Another example of how domain-dependent human intelligence actually is, is optical illusions.
Optical illusions are when an image violates an assumption your brain is making to interpret visual data, causing it to misinterpret the image. And remember, this is only going slightly outside of the boundary of the assumptions your brain makes.
This is a subtle point. The NFL theorem does prohibit any algorithm from doing well over all possible worlds. But Solomonoff induction does well on any world that has any kind of computable regularity. If there is no computable regularity, then no prior can do well. In fact, the Solomonoff prior does just as well asymptotically as any computable prior.
As is often the case, thinking in terms of codes can clear up the issue. A world is a big data file. Certainly, an Earth-specific algorithm can get good compression rates if it is fed data that comes from Earth. But as the data file gets large, the Solomonoff general-purpose compression algorithm will achieve compression rates that are nearly as good; in the worst case, just has to prepend the code of the Earth-specific algorithm to its encoded data stream, and it only underperforms by that program size.
The reason AIXI doesn’t work in practice is that the “efficient approximations” aren’t really efficient, or aren’t good approximations.
This seems to be a common belief. But see this discussion I had with Eliezer where I offered some good arguments and counterexamples against it.
The link goes to the middle, most relevant part of the discussion. But if you look at the top of it, I’m not arguing against the Solomonoff approach, but instead trying to find a generalization of it that makes more sense.
I’ve linked to that discussion several times in my comments here, but I guess many people still haven’t seen it. Maybe I should make a top-level post about it?
Okay, fair point. But by smearing its optimality over the entire Platonic space of computable functions, it is significantly worse than those algorithms tuned for this world’s function. And not surprisingly, AIXI has very little practical application.
And that’s unhelpful when, as is likely, you don’t hit that asymptote until the heat death of the universe.
My point is that the most efficient approximations can’t be efficient in any absolute sense. In order to make AIXI useful, you have to feed it information about which functions it can safely skip over—in other words, feed it intelligence of type 2), the information about its environment that you already gained through other methods Which just shows that those kinds of intelligence are not the same.
Actually Solomonoff induction is insanely fast. Its generality is not just that it learns everything to as good an extent as anything else, but also that typically it learns everything from indirect observations “almost as fast as directly” (not really, but...). The “only problem” is that Solomonoff induction is not an algorithm, and so its “speed” is for all practical purposes a meaningless statement.
When someone says “very fast, but uncomputable”, what I hear is “dragon in garage, but invisible”.
Generalize that to a good chunk of classical math.
The analog would be to theorem proving. No one claims that knowing the axioms of math gets you to every theorem “very fast”—because the problem of finding a proof/disproof for an arbitrary proposition is also uncomputable.
A “solution” might be that only proofs matter, while theorems (as formulas) are in general meaningless in themselves, only useful as commentary on proofs.
Nevertheless, the original point stands: no one says “I’ve discovered math! Now I can can learn the answer to any math problem very fast.” In contrast, you are saying that because we have Solomonoff induction, we can infer distributions “very fast”.
To be more precise, we can specify the denotation of distributions very close to the real deal from very few data. This technical sense doesn’t allow the analogy with theorem-proving, which is about algorithms, not denotation.
But the analogy is in terms of the “fast but uncomputable” oxymoron.
Let’s do another thought experiment. Say that humanity has finally resolved to send colonists to nearby star systems. The first group is getting ready to head out to Alpha Centauri.
The plan is that after the colonists arrive and set up their initial civilization, they will assemble a data archive of size T about the new world and send it back to Earth for review. Now it is expensive to send data across light-years, so obviously they want to minimize the number of bits they have to send. So the question is: what data format do the two parties agree on at the moment of parting?
If T is small, then it makes sense to think this issue over quite a bit. What should we expect the data to look like? Will it be images, audio, health reports? If we can figure something out about what the data will look like in advance (ie, choose a good prior), then we can develop a good data format and get short codes.
But if T is large (terabytes) then there’s no point in doing that. When the Alpha Centauri people build their data archive, they spend some time analyzing it and figuring out ways to compress it. Finally they find a really good compression format (=prior). Of course, Earth doesn’t know the format—but that doesn’t matter, since the specification for the format can just be prepended to the transmission.
I think this thought experiment is nice because it reveals the pointlessness of a lot of philosophical debates about Solomonoff, Bayes, etc. Of course the colonists have to choose a prior before the moment of parting, and of course if they choose a good prior they will get short codes. And the Solomonoff distribution may not be perfect in some metaphysical sense, but it’s obviously the right prior to choose in the large T regime. Better world-specific formats exist, but their benefit is small compared to T.
The choice that they will prepend a description (and the format of the description) is a choice of prior.
Well, the thought experiment doesn’t accomplish that. Solomonoff induction is not necessarily optimal (and most probably isn’t optimal) in your scenario, even and especially for large T. The amount of time it takes for any computable Occamian approximation of S/I to find the the optimal encoding, is superexponential in the length of the raw source data. So the fact that it will eventually get to a superior or near-superior encoding is little consolation, when Alpha Centauri and Sol will have long burned out before Solomonoff has converged on a solution.
The inferiority of Solomonoff Occamian induction, of iterating up through shorter generating algorithms until the data is matched, is not some metaphysical or philosophical issue, but rather, deals directly with the real-world time constraints that arise in practical situations.
My point is, any practical attempt to incorporate Solomonoff induction must also make use of knowledge of the data’s regularity that was found some other way, making it questionable whether Solomonoff induction incorporates everything we mean by “intelligence”. This incompleteness also raises the issue of what this-world-specific methods we actually did use to get to our current state of knowledge that makes Bayesian inference actually effective.
One pattern I have noticed: those who think the No Free Lunch theorems are interesting and important are usually the people who talk the most nonsense about them. The first thing people need to learn about those theorems is how useless and inapplicable to most of the real world they are.
So, you’re disagreeing that an algorithm that is optimal, on average, over a set of randomly-selected computable environments, will perform worse in any specific environment than an algorithm optimized specifically for that environment?
Because if not, that’s all I need to make my point, no matter what subtlety of NFL I’m missing. (Actually, I can probably make it with an even weaker premise, and could have gone without NFL altogether, but it grants some insight on the issue I’m trying to illuminate.)
The NFL deals with a space of all possible problems—while the universe typically presents embedded agents with a subset of those problems that are produced by short programs or small mechanisms. So: the NFL theorems rarely apply to the real world. In the real world, there are useful general-purpose compression algorithms.
Okay. I stated the NFL-free version of the premise I need. If you agree with that, this point is moot.
Now I know I’m definitely not using NFL, because I agree with this and it’s consistent with the point in my initial post.
Yes, there are useful general-purpose programs: because researchers recognize regularities that generally appear across all types of files, which there must be because the raw data is rarely purely random. But they identify this regularity before writing the compressor, which then exploits that regularity by (basically) reserving shorter codes for the kinds of data consistent with that regularity.
Likewise, people have identified regularities specific to video files: each frame is very similar to the last. And regularities specific to picture files: each column or row is very similar to the neighboring.
But what they did not do was write an unbiased, Occamian prior program that went through various files and told them what regularities existed, because finding the shortest compression is uncomputable. Rather, they imported prior knowledge of the distribution of data in certain types of files, gained through some other method (type 2 intelligence in my convention), and tailored the compression algorithm to that distribution.
No “universal, all purpose” algorithm found that knowledge.
I should probably give you some proper feedback, as well as caustic comments. The intelligence subdivision looks useful and interesting—though innate intelligence is usually referred to as being ‘instinctual’.
However, I was less impressed with the idea that the concept of intelligence lies somewhere between a category error and a fallacy of compression.
Okay, thanks for the proper feedback :-)
And I may be more leaning toward the “fallacy of compression” side, I’ll grant that. But I don’t see how you’d disagree with it since you find the subdivision I outlined to have some potential. If people are unknowingly shifting between two very different meanings of intelligence, that certainly is a fallacy of compression.
Another point: I’m not sure your description of AIXI is particularly great. AIXI works where Solomonoff induction works. Solomonoff induction works pretty well in this world. It might not be perfect—due to reference machine issues—but it is pretty good. AIXI would work very badly in worlds where Solomonoff induction was a misleading guide to its sense data. Its performance in this world doesn’t suffer through trying to deal with those worlds—since in those worlds it would be screwed.
Well, actually you’re highlighting the issue I raised in my first post: computable approximations of Solomonoff induction work pretty well … when fed useful priors! But those priors come from a lot of implicit knowledge about the world that skips over an exponentially large number of shorter hypotheses by the time you get to applying it to any specific problem.
AIXI (and computable approximations), starting from a purely Occamian prior, is stuck iterating through lots of generating functions before it gets to the right one—unfeasably long. To speed it up you have to feed it knowledge you gained elsewhere (and of course, find a way to represent that knowledge). But at that point, your prior includes a lot more than a penalty for length!
NFL theorems are about max-entropy worlds. Solomonoff induction works on highly lawful, simplicity-biased, low-entropy worlds.
If you could actually do Solomonoff induction, you would become at least as smart as a human baby in roughly 0 seconds (some rounding error may have occurred).
The same (or a similar) point applies. If you limit yourself to the set of lawful worlds and use an Occamian prior, you will start off much worse than an algorithm that implictly assumes a prior that’s close to the true distribution. As Solomonoff induction works its way up through longer algorithms, it will hit some that run into an infinite loop. Even if you program a constraint that gets it past or out of these, the optimality is only present “after a long time”, which, in practice, means later than we need or want the results.
What else can you tell us about the implications of being able to compute uncomputable functions?
You are arguing against a strawman: it’s not obvious that there are no algorithms that approximate Solomonoff induction well enough in practical cases. Of course there are silly implementations that are way worse than magical oracles.
Right, but any such approximation works by introducing a prior about which functions it can skip over. And for such knowledge to actually speed it up, it must involve knowledge (gained separately from S/I) about the true distribution.
But at that point, you’re optimizing for a narrower domain, not implementing universal intelligence. (In my naming convention, you’re bringing in type 2 intelligence.)
It introduces a prior, period. Not a prior about “skipping over”. Universal induction doesn’t have to “run” anything in a trivial manner.
You can’t “speed up” an uncomputable non-algorithm.
Okay, we’re going in circles. You had just mentioned possible computable algorithms that approximate Solomonoff induction.
So, we were talking about approximating algorithms. The point I was making, in response to this argument that “well, we can have working algorithms that are close enough to S/I”, was that to do so, you have to bring in knowledge of the distribution gained some other way, at which point it is no longer universal. (And, in which case talk of “speeding up” is meaningful.)
Demonstrating my point that universal intelligence has its limits and must combine with intelligence in a different sense of the term.
You introduce operations on the approximate algorithms (changing the algorithm by adding data), something absent from the original problem. What doesn’t make sense is to compare “speed” of non-algorithmic specification with the speed of algorithmic approximations. And absent any approximate algorithms, it’s also futile to compare their speed, much less propose mechanisms for their improvement that assume specific structure of these absent algorithms (if you are not serious about exploring the design space in this manner to obtain actual results).
What you call “the original problem” (pure Solomonoff induction) isn’t. It’s not a problem. It can’t be done, so it’s a moot point.
Sure it does. The uncomputable Solomonoff induction has a speed of zero. Non-halting approximations have a speed greater than zero. Sounds comparable to me for the purposes of this discussion.
There are approximate algorithms. Even Bayesian inference counts. And my point is that any time you add something to modify Solomonoff induction to make it useful is, directly or indirectly, introducing a prior unique to the search space—cleary showing the distinctness of type 2 intelligence.
To wrap up (as an alternative to not replying):
I don’t understand why you’d continue arguing definitions about speed of Solomonoff induction or it being “the original problem”. It’s clear what we both mean.
I believe you are wrong about general statements about what needs to be done to implement approximate Solomonoff induction. Since we don’t technically define in what sense this approximation has to be general, there remain possibilities for a good technical definition that preserves “generality” in an approximate implementation.
A better question would be why you brought up the issue. We both knew what the other meant before that, but you kept bringing it up.
Okay, well, I’ll believe it when I see it. In the mean time, I suspect it will be far more productive to exploit whatever regularity we already know about the environment, and work on building that into the inference program’s prior. (Arguably, even the Occamian prior does that by using our hard-won belief in the universe’s preference for simplicity!)