The core technical problem of machine intelligence is building an agent that correctly performs inductive inference.
This seems to be highly non-obvious. Even if an AI already had access to a theory of everything, and could engage in near-optimal induction, it isn’t at all clear that this helps much for practical purposes.
You are overestimating how much we can do just by compression. The key issue is not just the ability to predict accurately but the ability to predict accurately when using limited resources. For example, let A(n) be the Ackermann function and let P(n) be the nth prime number. Then the sequence described by P(A(n)) mod 3 is really compressible. But the time and space resources needed to expand that compressed form is probably massive.
There’s a similar concern here. To again use the protein folding example, even if an AI has a really good model for predicting how proteins will fold if it takes a lot of time to run that model, then it doesn’t do a good job. Similarly, if a pattern rests on the behavior of prime numbers, then the smallest Turing machine which outputs 1 iff a number is prime is probably Euclid’s sieve. But the AKS algorithm which requires a much larger Turing machine will return the data in less time.
What you are saying is correct—but it seems kind-of obvious. Without resource limits, you could compress perfectly, using a variant of Solomonoff induction.
However, to get from there to me overestimating the value of compression, seems like quite a leap. I value stream compression highly because it is the core technical problem of machine intelligence.
I am not claiming that compression ratio is everything, while resource usage is irrelevant. Rather, sensible resource usage is an intrinsic property of a good quality compressor. If you think I am not giving appropriate weight to resource usage, that’s just a communications failure.
I do go on about resource considerations—for example here:
Another issue is that compressors are not just judged by the size of their output. There are other issues to consider—for example, their compression speed, their usage of space, and their energy demands.
If you prefer a different framework think of it this way: compression matters, but for many interesting sequences how much practical compression you can with given time and space constraints depends on complexity theoretic questions.
Where are you getting the impression that this is somehow different from what I thought in the first place?
Or maybe don’t answer that. I would rather you spent your time revisiting the idea that massive software optimization progress runs into strongly believed theoretical computer science limits.
Maybe eventually—but not for a long way yet, allowing for impressive progress in the interim on the central problem of machine intelligence: induction.
Where are you getting the impression that this is somehow different from what I thought in the first place?
I got this from your remark that:
The core technical problem of machine intelligence is building an agent that correctly performs inductive inference. That is not a problem where we are particularly close to an optimal solution. Rather, solving it looks really, really difficult. Possibly machines will crack the problem and rapidly zoom up to optimal performance. However, that would represent tremendous improvement, not a lack of self-improvement.
So: your concerns about P and NP don’t seem very relevant to me.
I would rather you spent your time revisiting the idea that massive software optimization progress runs into strongly believed theoretical computer science limits.
I’m not sure what aspect of this you want me to expand. If the complexity hierarchy doesn’t collapse to a moderate extent (in the sense that P, NP, co-NP, PSPACE, and EXP are all distinct) then many practical problems that come up cannot have that much improvement. Moreover, this can be made more rigorous with slightly stronger assumptions. Thus for example, the exponential time hypothesis is a (somewhat) explicit conjecture about the bounds on solving 3-SAT. If this conjecture is true, then this gives severe limits on how much one can improve both graph coloring, traveling salesman, and Steiner trees, among other practical NP -complete problems. Both these problems show up in real world context, in terms of things like hardware and memory design, while some of these come up directly in predicting the behavior of real world systems.
Moreover, one of the very fruitful set of results in the last few years (albeit an area I must say that I don’t have much knowledge about) has been in taking NP-hard problems and producing approximate solutions or asking how good an approximation one can get. Those results however, have shown that even some old approximation algorithms are asymptotically best possible if P != NP (and require no other assumptions to show that) which include the approximate forms of graph coloring and the set cover problem. Moreover, if one assumes slightly stronger assumptions than P != NP, one gets similar results for other NP problems. It is noteworthy that many of the approximation algorithms in question date back to the 1970s, which suggests that even when finding approximation algorithms one gets close to best possible very quickly.
Another class of problems to look at are problems which are just frequently used. Common examples are finding gcds, multiplying numbers, and the two closely related problems of multiplying matrices and finding determinants.
In the case of gcds, we have near optimal algorithms in strong senses (although it is open whether there’s some way of making gcd calculations parallel, most people suspect that there’s not much improvement possible here).
Multiplying numbers can be done more efficiently than naive multiplication (which is O((log n)^2) for 2 n digits numbers) by using a version of the fast Fourier transform. This may be an example where unexpected speed-ups really do exist in what may be long-standing basic processes. However, there are clear limits to how much this can be improved, since explicit bounds exist for minimum computation required by FFTs (and these bounds exist without explicit constants). While there are still open questions about how efficient one can make FFTs, even if one assumed the most generous possible marginally plausible results, FFTs would not become much more efficient.
In the case of finding determinants, we have good algorithms that are much more efficient than the naive algorithms. For an n x n it follows from not too much linear algebra that one can calculate determinants in O(n^3) arithmetic operations, and better algorithms exist that run around O(n^A) where A is a nasty constant slightly less than 2.4. There’s a clear limit here to how much this can be improved. One cannot go better than O(n^2) because an n x n matrix has n^2 entries. So this is close to the naive lower bound. Not much room for improvement there. The primary room for improvement is that this uses the Coppersmith–Winograd for matrix multiplication, which is the most efficient algorithm for matrix multiplication known, but it isn’t used in practice, because the constants involved make it less efficient than other algorithms in the practical range of matrices (I don’t know precisely where CW exceeds other algorithms I think it is around n = 10^9 or so.) so there’s room for improvement there, but not much room for improvement. And most people who have thought about these issues are confident that determinants cannot be calculated in O(n^2).
So the upshot is that while there’s still work to do, for a lot of these problems, it doesn’t look like there’s room for “massive software optimization”.
Where are you getting the impression that this is somehow different from what I thought in the first place?
I got this from your remark that:
The core technical problem of machine intelligence is building an agent that correctly performs inductive inference. That is not a problem where we are particularly close to an optimal solution. Rather, solving it looks really, really difficult. Possibly machines will crack the problem and rapidly zoom up to optimal performance. However, that would represent tremendous improvement, not a lack of self-improvement.
So: your concerns about P and NP don’t seem very relevant to me.
I would rather you spent your time revisiting the idea that massive software optimization progress runs into strongly believed theoretical computer science limits.
So, when you said: “complexity theoretic questions”, you were actually talking about the possible equivalence of different classes within the complexity hierarchy? That wasn’t obvious to me—I took what you wrote as referring to any old complexity theoretic questions. Such as what language to use when measuring K-complexity. The idea that K-complexity has something to do with compression ratios is what I was calling obvious. A communications breakdown.
OK. FWIW, I don’t need examples of existing algorithms approaching theoretical limits.
As I said, the main problem where I think there is important scope for software improvement is induction. By my estimate, the brain spends about 80% of its resources on induction—so performance on induction seems important.
Current performance on the Hutter prize suggests that a perfect compressor could do about 25% better on the target file than the current champion program does.
So, perhaps it depends on how you measure it. If you think a mere 25% is not much improvement, you may not be too impressed. However, there are a few things to bear in mind:
Induction progress works in a similar manner to the rating scale in go: the higher you climb, the more difficult it is to make further progress.
There’s another similarity to go’s rating scale. In go, God is estimated to be 11-dan—not all that much better than a 9-dan human champion. However, this apparent close proximity to perfection is kind-of an illusion. Play go on bigger boards, and larger margins between humans and God are likely to become apparent. Similarly, measure induction using a more challenging corpus, and today’s programs will not appear to be so close to optimal.
The other thing to bear in mind is that intelligent agents are a major part of the environment of other intelligent agents. This means that it is not very realistic to model a fixed set of environmental problems (TSPs, etc), and to measure intelligence with respect to them. Rather there is an intelligence arms race—with many of the problems which intelligent agents face being posed by other agents.
We can see a related effect in mathematics. Many mathematicians work on the unresolved problems at the boundary of what is known in their field. The more progress they make, the harder the unresolved problems become, and the more intelligence is required to deal with them.
I value stream compression highly because it is the core technical problem of machine intelligence.
We’ll know what the core technical problem of machine intelligence is once we achieve machine intelligence. Achieve it, and then I’ll believe your claim that you know what’s involved in achieving it.
In general, there is no need to perform an engineering feat before you can claim to have understood what problem it involves. We understood the basics of flight before we could ourselves fly. That is also true for machine intelligence today—we have a general theory of intelligence, and can see what the technical side of the problem of building it consists of.
Induction power is the equivalent of lift. Not the only thing you need, but the most central and fundamental element, once you already have a universal computer and storage.
We understood the basics of flight before we could ourselves fly.
We did not understand in the sense of having a correct theory of fluid dynamics. We understood in the sense of having a working model, a paper plane, which actually flew.
Today it is the reverse. We have a theory, but we have no convincing paper brain. We have no working model.
That is also true for machine intelligence today—we have a general theory of intelligence, and can see what the technical side of the problem of building it consists of.
We have the reverse of what ancient China had. They had no theory but they had a paper plane. We have a theory but we have no paper brain.
In general, there is no need to perform an engineering feat before you can claim to have understood what problem it involves.
But with paper planes, we actually performed the essential engineering feat, in paper. The plane flew. But we have no paper brain that thinks.
Not obvious, perhaps, but surely pretty accurate.
Hutter: http://prize.hutter1.net/hfaq.htm#compai
Mahoney: http://cs.fit.edu/~mmahoney/compression/rationale.html
Tyler—Part 2 on: http://matchingpennies.com/machine_forecasting/
FWIW, a theory of everything is not required—induction is performed on sense-data, or preprocessed sense data.
You are overestimating how much we can do just by compression. The key issue is not just the ability to predict accurately but the ability to predict accurately when using limited resources. For example, let A(n) be the Ackermann function and let P(n) be the nth prime number. Then the sequence described by P(A(n)) mod 3 is really compressible. But the time and space resources needed to expand that compressed form is probably massive.
There’s a similar concern here. To again use the protein folding example, even if an AI has a really good model for predicting how proteins will fold if it takes a lot of time to run that model, then it doesn’t do a good job. Similarly, if a pattern rests on the behavior of prime numbers, then the smallest Turing machine which outputs 1 iff a number is prime is probably Euclid’s sieve. But the AKS algorithm which requires a much larger Turing machine will return the data in less time.
What you are saying is correct—but it seems kind-of obvious. Without resource limits, you could compress perfectly, using a variant of Solomonoff induction.
However, to get from there to me overestimating the value of compression, seems like quite a leap. I value stream compression highly because it is the core technical problem of machine intelligence.
I am not claiming that compression ratio is everything, while resource usage is irrelevant. Rather, sensible resource usage is an intrinsic property of a good quality compressor. If you think I am not giving appropriate weight to resource usage, that’s just a communications failure.
I do go on about resource considerations—for example here:
If you prefer a different framework think of it this way: compression matters, but for many interesting sequences how much practical compression you can with given time and space constraints depends on complexity theoretic questions.
Again, isn’t what you are saying simply obvious?
Where are you getting the impression that this is somehow different from what I thought in the first place?
Or maybe don’t answer that. I would rather you spent your time revisiting the idea that massive software optimization progress runs into strongly believed theoretical computer science limits.
Maybe eventually—but not for a long way yet, allowing for impressive progress in the interim on the central problem of machine intelligence: induction.
I got this from your remark that:
I’m not sure what aspect of this you want me to expand. If the complexity hierarchy doesn’t collapse to a moderate extent (in the sense that P, NP, co-NP, PSPACE, and EXP are all distinct) then many practical problems that come up cannot have that much improvement. Moreover, this can be made more rigorous with slightly stronger assumptions. Thus for example, the exponential time hypothesis is a (somewhat) explicit conjecture about the bounds on solving 3-SAT. If this conjecture is true, then this gives severe limits on how much one can improve both graph coloring, traveling salesman, and Steiner trees, among other practical NP -complete problems. Both these problems show up in real world context, in terms of things like hardware and memory design, while some of these come up directly in predicting the behavior of real world systems.
Moreover, one of the very fruitful set of results in the last few years (albeit an area I must say that I don’t have much knowledge about) has been in taking NP-hard problems and producing approximate solutions or asking how good an approximation one can get. Those results however, have shown that even some old approximation algorithms are asymptotically best possible if P != NP (and require no other assumptions to show that) which include the approximate forms of graph coloring and the set cover problem. Moreover, if one assumes slightly stronger assumptions than P != NP, one gets similar results for other NP problems. It is noteworthy that many of the approximation algorithms in question date back to the 1970s, which suggests that even when finding approximation algorithms one gets close to best possible very quickly.
Another class of problems to look at are problems which are just frequently used. Common examples are finding gcds, multiplying numbers, and the two closely related problems of multiplying matrices and finding determinants.
In the case of gcds, we have near optimal algorithms in strong senses (although it is open whether there’s some way of making gcd calculations parallel, most people suspect that there’s not much improvement possible here).
Multiplying numbers can be done more efficiently than naive multiplication (which is O((log n)^2) for 2 n digits numbers) by using a version of the fast Fourier transform. This may be an example where unexpected speed-ups really do exist in what may be long-standing basic processes. However, there are clear limits to how much this can be improved, since explicit bounds exist for minimum computation required by FFTs (and these bounds exist without explicit constants). While there are still open questions about how efficient one can make FFTs, even if one assumed the most generous possible marginally plausible results, FFTs would not become much more efficient.
In the case of finding determinants, we have good algorithms that are much more efficient than the naive algorithms. For an n x n it follows from not too much linear algebra that one can calculate determinants in O(n^3) arithmetic operations, and better algorithms exist that run around O(n^A) where A is a nasty constant slightly less than 2.4. There’s a clear limit here to how much this can be improved. One cannot go better than O(n^2) because an n x n matrix has n^2 entries. So this is close to the naive lower bound. Not much room for improvement there. The primary room for improvement is that this uses the Coppersmith–Winograd for matrix multiplication, which is the most efficient algorithm for matrix multiplication known, but it isn’t used in practice, because the constants involved make it less efficient than other algorithms in the practical range of matrices (I don’t know precisely where CW exceeds other algorithms I think it is around n = 10^9 or so.) so there’s room for improvement there, but not much room for improvement. And most people who have thought about these issues are confident that determinants cannot be calculated in O(n^2).
So the upshot is that while there’s still work to do, for a lot of these problems, it doesn’t look like there’s room for “massive software optimization”.
So, when you said: “complexity theoretic questions”, you were actually talking about the possible equivalence of different classes within the complexity hierarchy? That wasn’t obvious to me—I took what you wrote as referring to any old complexity theoretic questions. Such as what language to use when measuring K-complexity. The idea that K-complexity has something to do with compression ratios is what I was calling obvious. A communications breakdown.
Yes, with some apparent illusion of transparency too.
OK. FWIW, I don’t need examples of existing algorithms approaching theoretical limits.
As I said, the main problem where I think there is important scope for software improvement is induction. By my estimate, the brain spends about 80% of its resources on induction—so performance on induction seems important.
Current performance on the Hutter prize suggests that a perfect compressor could do about 25% better on the target file than the current champion program does.
So, perhaps it depends on how you measure it. If you think a mere 25% is not much improvement, you may not be too impressed. However, there are a few things to bear in mind:
Induction progress works in a similar manner to the rating scale in go: the higher you climb, the more difficult it is to make further progress.
There’s another similarity to go’s rating scale. In go, God is estimated to be 11-dan—not all that much better than a 9-dan human champion. However, this apparent close proximity to perfection is kind-of an illusion. Play go on bigger boards, and larger margins between humans and God are likely to become apparent. Similarly, measure induction using a more challenging corpus, and today’s programs will not appear to be so close to optimal.
The other thing to bear in mind is that intelligent agents are a major part of the environment of other intelligent agents. This means that it is not very realistic to model a fixed set of environmental problems (TSPs, etc), and to measure intelligence with respect to them. Rather there is an intelligence arms race—with many of the problems which intelligent agents face being posed by other agents.
We can see a related effect in mathematics. Many mathematicians work on the unresolved problems at the boundary of what is known in their field. The more progress they make, the harder the unresolved problems become, and the more intelligence is required to deal with them.
We’ll know what the core technical problem of machine intelligence is once we achieve machine intelligence. Achieve it, and then I’ll believe your claim that you know what’s involved in achieving it.
IMO, we can clearly see what it is now.
In general, there is no need to perform an engineering feat before you can claim to have understood what problem it involves. We understood the basics of flight before we could ourselves fly. That is also true for machine intelligence today—we have a general theory of intelligence, and can see what the technical side of the problem of building it consists of.
Induction power is the equivalent of lift. Not the only thing you need, but the most central and fundamental element, once you already have a universal computer and storage.
We did not understand in the sense of having a correct theory of fluid dynamics. We understood in the sense of having a working model, a paper plane, which actually flew.
Today it is the reverse. We have a theory, but we have no convincing paper brain. We have no working model.
We have the reverse of what ancient China had. They had no theory but they had a paper plane. We have a theory but we have no paper brain.
But with paper planes, we actually performed the essential engineering feat, in paper. The plane flew. But we have no paper brain that thinks.
IMO, we do have the machine-intelligence equivalent of paper planes.
They come in two main forms: forecasters and expert systems.
What we need is more heavy lifting.