I listened to a few Marvin Minsky lectures a few weeks ago. Now I’m trying to go back and find more information on two things he discussed in lecture 3: Cognitive Architectures, given Fall 2011. Sorry to offload this on people here but I have very little idea how to search for more information here (I tried a few things on Google and Google Scholar without any luck)
Here’s quote #1 from 1:08:21:
”I think I mentioned Doug Lenat’s rule. Some people will assign probabilities to things, to behaviors, and then pick the way to react in proportional to the probability that that thing has worked in the past. And Doug Lenat thought of doing that, but instead, he just put the things in a list. And whenever a hypothesis worked better than another one, he would raise it, push it toward the front of the list. And then whenever there was a choice, he would pick—of all the rules that fit, he would pick the one at the top of the list. And if that didn’t work, it would get demoted. So that’s when I became an anti-probability person. That is, if just sorting the things on a list worked pretty well, are probability’s going to do much better?”
This sounds fascinating. It’s sounds much more computationally feasible than keeping track of probabilities and Bayesian updating. Does anyone have references for further work along these lines? Ie, specifically on keeping a small list of hypotheses, ranked, rather than computing probabilities over a large number of hypotheses? [by the way, Minksy did mention Lenat in Lecture 2, but it was very brief and didn’t contain any useful information.] [Trying to search through Douglas Lenat’s work is quite a headache because he’s got decades of publications, most on esoteric rule-based systems. ]
Here’s quote #2 from 1:09:37: (lightly edited for readability)
”Ray Solomonoff discovered that if you have a set of probabilities that something will work, and you have no memory… I think I mentioned that the other day, but it’s worth emphasizing, because nobody in the world seems to know it. Suppose you have a list of things, p equals this, or that, or that. In other words, suppose there’s 100 boxes here, and one of them has a gold brick in it, and the others don’t. And so for each box, suppose the probability is 0.9 that this one has the gold brick, and this one as 0.01. And this has 0.01. Let’s see, how many of them—so there’s 10 of these. Now, what should you do? Suppose you’re allowed to keep choosing a box, and you want to get your gold brick as soon as possible. What’s the smart thing to do? … you have no memory. Maybe the gold brick is decreasing in value, I don’t care. So should you keep trying 0.9 if you have no memory? Of course not. Because if you don’t get it the first time, you’ll never get it. Whereas if you tried them at random each time, then you’d have 0.9 chance of getting it, so in two trials, you’d have—what am I saying? In 100 trials, you’re pretty sure to get it, but in [? e-hundred ?] trials, almost certain. So if you don’t have any memory, then probability matching is not a good idea. Certainly, picking the highest probability is not a good idea, because if you don’t get it the first trial, you’ll never get it. If you keep using the probabilities at—what am I saying? Anyway, what do you think is the best thing to do? It’s to take the square roots of those probabilities, and then divide them by the sum of the square roots so it adds up to 1. So a lot of psychologists design experiments until they get the rat to match the probability. And then they publish it. … but if the animal is optimal and doesn’t have much memory, then it shouldn’t match the probability of the unknown. It should—end of story. Every now and then, I search every few years to see if anybody has noticed this thing—and I’ve never found it on the web.”
It’s definitely not obvious to me why the square root of the probabilities is the optimal choice here. Am curious to know more what Solomonoff was up to here. I tried searching for it, and can’t find it.
[Question] Does anyone know what Marvin Minsky is talking about here?
As far as #1 goes, it’s worth pointing out that we do not use Lenat’s systems like Cyc. So one answer to “if it works pretty well, why don’t we do that everywhere” may simply be “but it doesn’t work pretty well”. (Like Rodney Brooks, Minsky was never a big fan of Bayesian or connectionist approaches—perhaps because they are both so depressingly infeasibly computationally expensive—and he’s always looking for shortcuts or loopholes.)
A more serious answer is: sorting (ranking) is a pervasive trick throughout statistics and decision theory. Many nonparametric methods will involve ranking, and the improper linear model literature and stuff like 0⁄1 discretized variables can work well and even outperform humans. I recall Chris Stucchio’s somewhat notorious “Why a pro/con list is 75% as good as your fancy machine learning algorithm” post in this vein too. Lenat’s list trick there sounds a lot like the “player-the-winner” adaptive trial/bandit algorithm, where if a treatment works, you use it on the next iteration, else switch to a random other, which is a great deal simpler than Thompson sampling or other adaptive algorithms. It can also be seen as a policy gradient with very large fixed-sized updates.
The drawback is, as Stucchio’s title indicates, the methods may work, but you are giving up a potentially lot of performance. In order statistics or nonparametric statistics, ranking gets you robustness to outliers because you throw away all the info aside from the ordering, but then that means that if you can model the data more sensibly, you can estimate everything more efficiently than the nonparametric rank-based approaches (you could look at estimates of the cost of a Wilcox U-test vs a t-test on normally-distributed data to quantify this, as estimating median/mean is something of a best case and the comparisons get worse from there). In Stucchio’s checklist example, 25% of the total is a lot of utility to leave on the table! Play-the-winner is simple, but that is its only virtue, and there are a lot of analyses to the effect that it’s a pretty bad adaptive algorithm compared to Waldian sequential testing or Thompson sampling or best-arm finding. One way to put it is that from a bandit perspective, the regret is much worse than the usual log(t) regret that principled methods enjoy. (I think the regret would be linear?) No matter how much data you accumulate, a bad round might send the optimal choice back to the end of the list. A policy gradient method which never decays its step sizes will never converge, and will just oscillate eternally. Then you enjoy additional problems like being unable to calculate Value of Information or do exploration because you don’t have anything remotely like a standard error, much less a posterior… And how are you going to think about expected value and risk for planning if all you have is a sorted list of options?
Cox is not mocked. If you don’t update in a Bayesian fashion, it’s going to cost you somehow.
For your second question, this paper describes the square root law, though in a somewhat different setting: Strong profiling is not mathematically optimal for discovering rare malfeasors | PNAS. (Incidentally, a friend of mine used this once in an argument against stop-and-frisk.)
It doesn’t give a complete proof, though it describes it as a “straightforward minimization with a Lagrange multiplier”.
Very cool, will take a look. This basically solves question 1. It seems the original Solomonoff work isn’t published anywhere. By the way, the author, William H. Press, is a real polymath! I am curious if there is any extension of this work to agents with finite memory.. as an example, the same situation where you’re screening a large number of people, but now you have a memory where you can store N results of prior screenings for reference. I’m going to look into it..
Seems like a memory version would be identical, just with a smaller n after subtracting the individuals you screen. When you fill up your memory with cleared individuals, why would you then ever want to ‘forget’ them? By stipulation, you learn nothing about other individuals or the population, only about the ones you look at. If you forget them to replace them with a new memory, that de facto makes the n bigger, and worsens your odds since you’ve flushed back into the pool the only individuals you knew for sure you never want to sample again (because they are clear) and so now you may waste a sample to test them again while gaining nothing. And once you remove them from the population via your memory, you’re back to the solved memoryless problem and have to square-root it.
With respect to the second question the answer will depend on the discount rate. I expect Solomonoff is assuming that we are in the limit of low discount rate, where exponential decay will look linear, so essentially you are minimizing the expected total number of attempts.
I haven’t done the math to confirm Somolonoff’s answer, but if you were to go to each box with probability equal to it being correct, then your expected number of attempts would be equal to the number of boxes, since each box would have an expected number of attempts conditional on it being the right box equal to the the inverse of its probability. So this is no better than choosing randomly. With this in mind it seems intuitive that some intermediate strategy, such as square roots, would then be better.
https://www.reddit.com/r/compsci/comments/7yd765/reference_request_on_a_result_by_ray_solomonoff/