gwern comments on Does anyone know what Marvin Minsky is talking about here?

gwern 19 Nov 2021 3:00 UTC
24 points
As far as #1 goes, it’s worth pointing out that we do not use Lenat’s systems like Cyc. So one answer to “if it works pretty well, why don’t we do that everywhere” may simply be “but it doesn’t work pretty well”. (Like Rodney Brooks, Minsky was never a big fan of Bayesian or connectionist approaches—perhaps because they are both so depressingly infeasibly computationally expensive—and he’s always looking for shortcuts or loopholes.)

A more serious answer is: sorting (ranking) is a pervasive trick throughout statistics and decision theory. Many nonparametric methods will involve ranking, and the improper linear model literature and stuff like ⁰⁄₁ discretized variables can work well and even outperform humans. I recall Chris Stucchio’s somewhat notorious “Why a pro/con list is 75% as good as your fancy machine learning algorithm” post in this vein too. Lenat’s list trick there sounds a lot like the “player-the-winner” adaptive trial/bandit algorithm, where if a treatment works, you use it on the next iteration, else switch to a random other, which is a great deal simpler than Thompson sampling or other adaptive algorithms. It can also be seen as a policy gradient with very large fixed-sized updates.

The drawback is, as Stucchio’s title indicates, the methods may work, but you are giving up a potentially lot of performance. In order statistics or nonparametric statistics, ranking gets you robustness to outliers because you throw away all the info aside from the ordering, but then that means that if you can model the data more sensibly, you can estimate everything more efficiently than the nonparametric rank-based approaches (you could look at estimates of the cost of a Wilcox U-test vs a t-test on normally-distributed data to quantify this, as estimating median/mean is something of a best case and the comparisons get worse from there). In Stucchio’s checklist example, 25% of the total is a lot of utility to leave on the table! Play-the-winner is simple, but that is its only virtue, and there are a lot of analyses to the effect that it’s a pretty bad adaptive algorithm compared to Waldian sequential testing or Thompson sampling or best-arm finding. One way to put it is that from a bandit perspective, the regret is much worse than the usual log(t) regret that principled methods enjoy. (I think the regret would be linear?) No matter how much data you accumulate, a bad round might send the optimal choice back to the end of the list. A policy gradient method which never decays its step sizes will never converge, and will just oscillate eternally. Then you enjoy additional problems like being unable to calculate Value of Information or do exploration because you don’t have anything remotely like a standard error, much less a posterior… And how are you going to think about expected value and risk for planning if all you have is a sorted list of options?

Cox is not mocked. If you don’t update in a Bayesian fashion, it’s going to cost you somehow.