No free lunch theorem is irrelevant
I often see people cite the “no free lunch theorem” as a counterargument to the existence of an AGI. I think this is a very bad argument. This theorem is formulated as a problem of predicting random and uniformly distributed data. Just open the Wikipedia!. In my opinion, this is another example of how people use results from mathematics simply because they sound relevant without understanding their original meaning and limitations. What makes them think it’s relevant to AGI?
It sounds something like “well, since there is such a theorem, it turns out that we will not be able to create an algorithm that can do many tasks.” This is complete nonsense. In order to somehow bind it to real systems, you need to make a reservation and add a restriction on some kind of physical size of the model: the number of parameters, disk space, total equipment weight, training time, total budget whatever. In this case, well, yes. If we have a general algorithm for solving many problems, then for any problem we can probably use the same resources to make an algorithm that performs better. The problem is that we don’t say “how much better”. Any improvement will do.
It will not be the original theorem, but I think this is close to “what people think when they say no free lunch theorem”.
If our algorithm for solving many problems shows itself “kind of meh” compared to specialized algorithms, then what prevents us from increasing the resource budget and collecting all our specialized algorithms into one big one? Yes, the new one can also be reassembled into specialized algorithms, but sooner or later we will hit diminishing returns. It is impossible to calculate arithmetic more accurately than in 100% of cases, it is impossible to predict the structure of a protein more accurately than it will turn out in reality, and so on. By increasing the available resources, distilling and amplifying, we can make a general-purpose algorithm as good as we like in any task. As close to maximum performance for the metrics as we need.
See also:
https://arbital.com/p/nofreelunch_irrelevant/
Consider an algorithm specialized to chess in particular, as compared to a more general algorithm that plays games in general.
The chess specific algorithm can have a large advantage, for the same amount of compute, by containing precomputed data specific to chess, for example a table of good opening moves.
All that precomputation must have happened somewhere, possibly in human brains. So there exists a general algorithm that could be told the rules of chess, do a large amount of precomputation, and then will play equally good chess for the same resources as the specialized chess algorithm.
No algorithm can contain precomputed values for all possible games, as there are infinitely many / exponentially many of those.
I don’t understand how this contradicts anything? As soon as you let loose some of the physical constraints, you can start to pile up precomputation/memory/ budget/volume/whatever. If you spend all of this to solve one task, then, well, you should get higher performance than any other approach that doesn’t focus on one thing. Or, you can make an algorithm that can outperform anything that you’ve made before. Given enough of any kind of unconstrained resource.
Precompute is just another resource
Reading Wolpert’s Paper (no-free-lunch for optimization) (Scihub is your friend), I got the impression that he might in large part be himself responsible for people misunderstanding the implications of his theorem, because he does not understand them himself:
Wolpert does not actually justify this assumption in his paper, but here are some of the interesting results he is getting from them:
I think in the real world, you can use induction. I am going to stick my neck out, claiming that Wolpert uses induction himself.
For an in-depth critique of the weird assumptions in the paper, see here.
I think it’s tangentially relevant in certain cases. Here’s something I wrote in another context, where I think it’s perhaps useful to understand what we really mean when we say “intelligence.”
Agree it’s mostly not relevant.
I think I disagree (that generality is irrelevant (this is only the case, because nfl-theorems use unreasonable priors)). If your problem has “any” structure: your environment is not maximally random, then you can use Occam’s razor and make sense of your environment. No need for the “real world”. The paper on universal intelligence is great, by the way, if formalizing intelligence seems interesting.
To spell out how this applies to your comment: If you use a reasonable prior, then intelligence is well-defined: some things are smarter than others. For example, GPT3 is smarter than GPT2.