The Weighted Majority Algorithm
Followup to: Worse Than Random, Trust In Bayes
In the wider field of Artificial Intelligence, it is not universally agreed and acknowledged that noise hath no power. Indeed, the conventional view in machine learning is that randomized algorithms sometimes perform better than unrandomized counterparts and there is nothing peculiar about this. Thus, reading an ordinary paper in the AI literature, you may suddenly run across a remark: “There is also an improved version of this algorithm, which takes advantage of randomization...”
Now for myself I will be instantly suspicious; I shall inspect the math for reasons why the unrandomized algorithm is being somewhere stupid, or why the randomized algorithm has a hidden disadvantage. I will look for something peculiar enough to explain the peculiar circumstance of a randomized algorithm somehow doing better.
I am not completely alone in this view. E. T. Jaynes, I found, was of the same mind: “It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought.” Apparently there’s now a small cottage industry in derandomizing algorithms. But so far as I know, it is not yet the majority, mainstream view that “we can improve this algorithm by randomizing it” is an extremely suspicious thing to say.
Let us now consider a specific example—a mainstream AI algorithm where there is, apparently, a mathematical proof that the randomized version performs better. By showing how subtle the gotcha can be, I hope to convince you that, even if you run across a case where the randomized algorithm is widely believed to perform better, and you can’t find the gotcha yourself, you should nonetheless trust that there’s a gotcha to be found.
For our particular example we shall examine the weighted majority algorithm, first introduced by Littlestone and Warmuth, but as a substantially more readable online reference I will use section 2 of Blum’s On-Line Algorithms in Machine Learning. (Blum has an easier-to-read, more concrete version of the proofs; and the unrandomized and randomized algorithms are presented side-by-side rather than far part.)
The weighted majority algorithm is an ensemble method—a way to combine the advice from several other algorithms or hypotheses, called “experts”. Experts strive to classify a set of environmental instances into “positive” and “negative” categories; each expert produces a prediction for each instance. For example, each expert might look at a photo of a forest, and try to determine whether or not there is a camouflaged tank in the forest. Expert predictions are binary, either “positive” or “negative”, with no probability attached. We do not assume that any of our experts is perfect, and we do not know which of our experts is the best. We also do not assume anything about the samples (they may not be independent and identically distributed).
The weighted majority algorithm initially assigns an equal weight of 1 to all experts. On each round, we ask all the experts for their predictions, and sum up the weights for each of the two possible predictions, “positive” or “negative”. We output the prediction that has the higher weight. Then, when we see the actual result, we multiply by 1⁄2 the weight of every expert that got the prediction wrong.
Suppose the total number of experts is n, and the best expert makes no more than m mistakes over some given sampling sequence. Then we can prove that the weighted majority algorithm makes a total number of mistakes M that is bounded by 2.41*(m + log2(n)). In other words, the weighted majority algorithm makes no more mistakes than the best expert, plus a factor of log n, times a constant factor of 2.41.
Proof (taken from Blum 1996; a similar proof appears in Littlestone and Warmuth 1989): The combined weight of all the experts at the start of the problem is W = n. If the weighted majority algorithm makes a mistake, at least half the total weight of experts predicted incorrectly, so the total weight is reduced by a factor of at least 1⁄4. Thus, after the weighted majority algorithm makes M mistakes, its total weight W has been reduced by at least (3/4)^M. In other words:
W ⇐ n*( (3/4)^M )
But if the best expert has made at most m mistakes, its weight is at least (1/2)^m. And since W includes the weight of all experts,
W >= (1/2)^m
Therefore:
(1/2)^m ⇐ W ⇐ n*( (3/4)^M )
(1/2)^m ⇐ n*( (3/4)^M )
-m ⇐ log2(n) + M*log2(3/4)
-m ⇐ log2(n) + M*-log2(4/3)
M*log2(4/3) ⇐ m + log2(n)
M ⇐ (1/log2(4/3))*(m + log2(n))
M ⇐ 2.41*(m + log2(n))
Blum then says that “we can achieve a better bound than that described above”, by randomizing the algorithm to predict “positive” or “negative” with probability proportional to the weight assigned each prediction. Thus, if 3⁄4 of the expert weight went to “positive”, we would predict “positive” with probability 75%, and “negative” with probability 25%.
An essentially similar proof, summing over the expected probability of a mistake on each round, will show that in this case:
M ⇐ 1.39m + 2 ln(n) (note: M is now an expectation)
Since the penalty applied to particular experts does not depend on the global prediction but only on the actual outcome, most of the proof proceeds as before. We have
W >= (1/2)^m
where again m is the best expert and W is the total weight of all experts.
We also have that W is the starting weight n times the product of (1 − 1⁄2 F_i), where F_i is the fraction of mistaken expert weight on round i:
W = n * Prod_i (1 − 1⁄2 F_i)
And since we predict with probability proportional to the expert weight, the expected number of mistakes is just the sum over F_i:
M = Sum_i F_i
So:
(1/2)^m ⇐ n * Prod_i (1 − 1⁄2 F_i)
-m*ln(2) ⇐ ln(n) + Sum_i ln(1 − 1⁄2 F_i)
Sum_i -ln(1 − 1⁄2 F_i) ⇐ ln(n) + m*ln(2)
Sum_i (1/2 F_i) ⇐ ln(n) + m*ln(2) (because x < -ln(1 - x) )
Sum_i F_i ⇐ 2*(ln(n) + m*ln(2))
M ⇐ 1.39m + 2 ln(n)
Behold, we have done better by randomizing!
We should be especially suspicious that the randomized algorithm guesses with probability proportional to the expert weight assigned. This seems strongly reminiscent of betting with 70% probability on blue, when the environment is a random mix of 70% blue and 30% red cards. We know the best bet—and yet we only sometimes make this best bet, at other times betting on a condition we believe to be less probable.
Yet we thereby prove a smaller upper bound on the expected error. Is there an algebraic error in the second proof? Are we extracting useful work from a noise source? Is our knowledge harming us so much that we can do better through ignorance?
Maybe the unrandomized algorithm fails to take into account the Bayesian value of information, and hence fails to exhibit exploratory behavior? Maybe it doesn’t test unusual cases to see if they might have surprising results?
But, examining the assumptions, we see that the feedback we receive is fixed, regardless of the prediction’s output. Nor does the updating of the expert weights depend on the predictions we output. It doesn’t matter whether we substitute a completely random string of 1s and 0s as our actual output. We get back exactly the same data from the environment, and the weighted majority algorithm updates the expert weights in exactly the same way. So that can’t be the explanation.
Are the component experts doing worse than random, so that by randomizing our predictions, we can creep back up toward maximum entropy? But we didn’t assume anything about how often the component experts were right, or use the fact in the proofs. Therefore the proofs would carry even if we specified that all experts were right at least 60% of the time. It’s hard to see how randomizing our predictions could help, in that case—but the proofs still go through, unchanged.
So where’s the gotcha?
Maybe I’m about to tell you that I looked, and I couldn’t find the gotcha either. What would you believe, in that case? Would you think that the whole thesis on the futility of randomness was probably just wrong—a reasonable-sounding philosophical argument that simply wasn’t correct in practice? Would you despair of being able to follow the math, and give up and shrug, unable to decide who might be right?
We don’t always see the flaws right away, and that’s something to always remember.
In any case, I’m about to start explaining the gotcha. If you want to go back, read the paper, and look yourself, you should stop reading now...
There does exist a rare class of occasions where we want a source of “true” randomness, such as a quantum measurement device. For example, you are playing rock-paper-scissors against an opponent who is smarter than you are, and who knows exactly how you will be making your choices. In this condition it is wise to choose randomly, because any method your opponent can predict will do worse-than-average. Assuming that the enemy knows your source code has strange consequences: the action sequence 1, 2, 1, 3 is good when derived from a quantum noise generator, but bad when derived from any deterministic algorithm, even though it is the same action sequence. Still it is not a totally unrealistic situation. In real life, it is, in fact, a bad idea to play rock-paper-scissors using an algorithm your opponent can predict. So are we, in this situation, deriving optimization from noise?
The random rock-paper-scissors player does not play cleverly, racking up lots of points. It does not win more than 1⁄3 of the time, or lose less than 1⁄3 of the time (on average). The randomized player does better because its alternatives perform poorly, not from being smart itself. Moreover, by assumption, the opponent is an intelligence whom we cannot outsmart and who always knows everything about any method we use. There is no move we can make that does not have a possible countermove. By assumption, our own intelligence is entirely useless. The best we can do is to avoid all regularity that the enemy can exploit. In other words, we do best by minimizing the effectiveness of intelligence within the system-as-a-whole, because only the enemy’s intelligence is allowed to be effective. If we can’t be clever ourselves, we might as well think of ourselves as the environment and the enemy as the sole intelligence within that environment. By becoming the maximum-entropy environment for rock-paper-scissors, we render all intelligence useless, and do (by assumption) the best we can do.
When the environment is adversarial, smarter than you are, and informed about your methods, then in a theoretical sense it may be wise to have a quantum noise source handy. (In a practical sense you’re just screwed.) Again, this is not because we’re extracting useful work from a noise source; it’s because the most powerful intelligence in the system is adversarial, and we’re minimizing the power that intelligence can exert in the system-as-a-whole. We don’t do better-than-average, we merely minimize the extent to which the adversarial intelligence produces an outcome that is worse-than-average (from our perspective).
Similarly, cryptographers have a legitimate interest in strong randomness generators because cryptographers are trying to minimize the effectiveness of an intelligent adversary. Certainly entropy can act as an antidote to intelligence.
Now back to the weighted majority algorithm. Blum (1996) remarks:
“Intuitively, the advantage of the randomized approach is that it dilutes the worst case. Previously, the worst case was that slightly more than half of the total weight predicted incorrectly, causing the algorithm to make a mistake and yet only reduce the total weight by 1⁄4. Now there is roughly a 50⁄50 chance that the [randomized] algorithm will predict correctly in this case, and more generally, the probability that the algorithm makes a mistake is tied to the amount that the weight is reduced.”
From the start, we did our analysis for an upper bound on the number of mistakes made. A global upper bound is no better than the worst individual case; thus, to set a global upper bound we must bound the worst individual case. In the worst case, our environment behaves like an adversarial superintelligence.
Indeed randomness can improve the worst-case scenario, if the worst-case environment is allowed to exploit “deterministic” moves but not “random” ones. It is like an environment that can decide to produce a red card whenever you bet on blue—unless you make the same bet using a “random” number generator instead of your creative intelligence.
Suppose we use a quantum measurement device to produce a string of random ones and zeroes; make two copies of this string; use one copy for the weighted majority algorithm’s random number generator; and give another copy to an intelligent adversary who picks our samples. In other words, we let the weighted majority algorithm make exactly the same randomized predictions, produced by exactly the same generator, but we also let the “worst case” environment know what these randomized predictions will be. Then the improved upper bound of the randomized version is mathematically invalidated.
This shows that the improved upper bound proven for the randomized algorithm did not come from the randomized algorithm making systematically better predictions—doing superior cognitive work, being more intelligent—but because we arbitrarily declared that an intelligent adversary could read our mind in one case but not the other.
This is not just a minor quibble. It leads to the testable prediction that on real-world problems, where the environment is usually not an adversarial telepath, the unrandomized weighted-majority algorithm should do better than the randomized version. (Provided that some component experts outperform maximum entropy—are right more than 50% of the time on binary problems.)
Analyzing the worst-case scenario is standard practice in computational learning theory, but it makes the math do strange things. Moreover, the assumption of the worst case can be subtle; it is not always explicitly labeled. Consider the upper bound for the unrandomized weighted-majority algorithm. I did not use the term “worst case”—I merely wrote down some inequalities. In fact, Blum (1996), when initially introducing this bound, does not at first use the phrase “worst case”. The proof’s conclusion says only:
“Theorem 1. The number of mistakes M made by the Weighted Majority algorithm described above is never more than 2.41(m+lg n), where m is the number of mistakes made by the best expert so far.”
Key word: Never.
The worst-case assumption for the unrandomized algorithm was implicit in calculating the right-hand-side of the inequality by assuming that, on each mistake, the total weight went down by a factor of 1⁄4, and the total weight never decreased after any successful prediction. This is the absolute worst possible performance the weighted-majority algorithm can give.
The assumption implicit in those innocent-looking equations is that the environment carefully maximizes the anguish of the predictor: The environment so cleverly chooses its data samples, that on each case where the weighted-majority algorithm is allowed to be successful, it shall receive not a single iota of useful feedback—not a single expert shall be wrong. And the weighted-majority algorithm will be made to err on sensory cases that produce the minimum possible useful feedback, maliciously fine-tuned to the exact current weighting of experts. We assume that the environment can predict every aspect of the predictor and exploit it—unless the predictor uses a “random” number generator which we arbitrarily declare to be unknowable to the adversary.
What strange assumptions are buried in that innocent little inequality,
<=
Moreover, the entire argument in favor of the randomized algorithm was theoretically suspect from the beginning, because it rested on non-transitive inequalities. If I prove an upper bound on the errors of algorithm X, and then prove a smaller upper bound on the errors of algorithm Y, it does not prove that in the real world Y will perform better than X. For example, I prove that X cannot do worse than 1000 errors, and that Y cannot do worse than 100 errors. Is Y a better algorithm? Not necessarily. Tomorrow I could find an improved proof which shows that X cannot do worse than 90 errors. And then in real life, both X and Y could make exactly 8 errors.
4 ⇐ 1,000,000,000
9 ⇐ 10
But that doesn’t mean that 4 > 9.
So the next time you see an author remark, “We can further improve this algorithm using randomization...” you may not know exactly where to find the oddity. If you’d run across the above-referenced example (or any number of others in the machine-learning literature), you might not have known how to deconstruct the randomized weighted-majority algorithm. If such a thing should happen to you, then I hope that I have given you grounds to suspect that an oddity exists somewhere, even if you cannot find it—just as you would suspect a machine that proposed to extract work from a heat bath, even if you couldn’t keep track of all the gears and pipes.
Nominull put it very compactly when he said that, barring an environment which changes based on the form of your algorithm apart from its output, “By adding randomness to your algorithm, you spread its behaviors out over a particular distribution, and there must be at least one point in that distribution whose expected value is at least as high as the average expected value of the distribution.”
As I remarked in Perpetual Motion Beliefs:
I once knew a fellow who was convinced that his system of wheels and gears would produce reactionless thrust, and he had an Excel spreadsheet that would prove this—which of course he couldn’t show us because he was still developing the system. In classical mechanics, violating Conservation of Momentum is provably impossible. So any Excel spreadsheet calculated according to the rules of classical mechanics must necessarily show that no reactionless thrust exists—unless your machine is complicated enough that you have made a mistake in the calculations.
If you ever find yourself mathematically proving that you can do better by randomizing, I suggest that you suspect your calculations, or suspect your interpretation of your assumptions, before you celebrate your extraction of work from a heat bath.
- Model Combination and Adjustment by 17 Jul 2013 20:31 UTC; 102 points) (
- The Power of Noise by 16 Jun 2014 17:26 UTC; 60 points) (
- Can noise have power? by 23 May 2014 4:54 UTC; 19 points) (
- 31 Oct 2012 7:57 UTC; 19 points) 's comment on Beyond Bayesians and Frequentists by (
- 16 Sep 2009 14:57 UTC; 6 points) 's comment on The Absent-Minded Driver by (
- [SEQ RERUN] The Weighted Majority Algorithm by 24 Oct 2012 3:22 UTC; 6 points) (
- 4 Oct 2012 10:21 UTC; 5 points) 's comment on Skill: The Map is Not the Territory by (
- 25 Apr 2011 18:09 UTC; 5 points) 's comment on Attempts to work around Goedel’s theorem by using randomness by (
- 11 Feb 2014 2:12 UTC; 4 points) 's comment on A Fervent Defense of Frequentist Statistics by (
- 20 Feb 2014 16:22 UTC; 3 points) 's comment on A Fervent Defense of Frequentist Statistics by (
- 17 Jun 2014 2:10 UTC; 2 points) 's comment on The Power of Noise by (
- 1 Nov 2012 7:00 UTC; 2 points) 's comment on Beyond Bayesians and Frequentists by (
- 6 Jul 2009 22:20 UTC; 0 points) 's comment on Can chess be a game of luck? by (
- 2 Jun 2015 10:43 UTC; 0 points) 's comment on Approximating Solomonoff Induction by (
What about Monte Carlo methods? There are many problems for which Monte Carlo integration is the most efficient method available.
(you are of course free to suggest and to suspect anything you like; I will, however, point out that suspicion is no substitute for building something that actually works...)
(Emphasis mine.) I know I’m late to the party, but I just noticed this. While what you say it’s true, “available” in this case means “that we know of”, not “that is possible”. I’m not an expert, but IIRC the point of the MCI is that the functions are hard to analyze. You could integrate analogously without randomization (e.g., sample the integration domain on a regular grid), and it very well might work. The problem is that if the function you integrate happens to behave atypically on the regular grid with respect to its global behavior (e.g., if the function has some regular features with a frequency that accidentally matches that of your grid, you might sample only local maxima, and severely overestimate the integral).
But for an individual function there certainly exists (for some values of “certainly”) an ideal sampling that yields an excellent integration, or some even better method that doesn’t need sampling. But, you need to understand the function very well, and search for that perfect evaluation strategy, and do this for every individual function you want to integrate. (Not a math expert here, but I suspect in the general case that grid would be impossible to find, even if it exists, and it’s probably very hard even for those where the solution can be found.)
So what MCI does is exactly what Eliezer mentions above: you randomize the sampling as well as you can, to avoid as much as possible tripping an unexpected “resonance” between the grid and the function’s behavior. In effect, you’re working against an environment you can’t overwhelm by being smart, so you brute force it and do your best to reduce the ways the environment can overwhelm you, i.e. you try to minimize the worst-case consequences.
Note that for functions we really understand, MCI is not by far the best method. If you want to integrate a polynomial, you just compute the antiderivative, evaluate it in two points, subtract, and you’re done. MCI is nice because in general we don’t know how to find out the antiderivative. In analogy with tic-tac-toe, integration would be the game, and each function is a different opponent; a polynomial is an adversary we can anticipate as well as Omega can model a human—in fact, we don’t even need to evaluate the function, we can deduce directly what the result of “its actions” will be; but most functions (probably almost all) are so difficult adversaries that we can’t do better than try to limit how badly they can screw us.
(I have a vague recollection of an anecdote about MCI: Someone somewhere was using it to do a complicated multiple integral, and got very silly results. Turns out, while the pseudo-random generator looked OK by itself, when its output was distributed throughout the complicated multi-dimensional space that was sampled, a non-obvious regularity in the generated numbers happened to match something the function did. So, in effect, the function beat them. In other words (ignoring that this is apocryphal), you don’t even need a very smart opponent to be screwed, a hard function is enough… That quantum noise generator can be very handy.)
Modern PRNGs tend to be less bad than that.
That’s not at all my domain of expertise, so I’ll take your word for it (though I guessed as much).
That said, I meant that as half funny anecdote, half “it’s not just theory, it’s real life, and you can get screwed hard by something that guesses your decision algorithm, even if it’s just a non-sentient function”, not as an argument for/against either MC or quantum noise. (Though, now that I think of it, these days there probably are quantum noise generators available somewhere.)
AFAIK, there are but they’re way too slow for use in MC, so decent-but-fast PRNGs are still preferred.
Also, sometimes you need to run several tests with the same random sequence for debugging purposes, which with true random numbers would require you to store all of them, whereas with pseudo-random numbers you just need to use the same seed.
Too bad that the default PRNG in many systems is still an ancient one.
Yes.
However, programming environments oriented towards numerical computing tend to use the Mersenne Twister as their default PRNG.
Mersenne Twister is the most commonly used algorithm. It is fast, and generates good results. It is not adversarial secure, but in that situation you should be using a cryptographically secure RNG.
Age of the PRNG algorithm has no bearing on this discussion. (If rand() uses something else, it’s for standards-compatability reasons; nobody writing Monti-Carlo algorithms would be using such PRNGs.)
Actually the C and POSIX standard impose very few requirements on rand(); it would be entirely possible in principle for it to be based on the Mersenne Twister while still complying to the standards.
It is getting late, so this may be way off, and have not the time to read the paper.
This is also assuming finite trials right? Because over infinite trials if you have a non-zero probability of siding with the wrong group of classifiers, you will make infinite mistakes. No matter how small the probabilities go.
It seems it is trading off a better expectation for a worse real worse case.
Also are you thinking of formalising an alternative to the infinite SI worst case you describe?
Um, maybe I misunderstood what you said or the proofs, but isn’t there a much simpler way to reject the proof of superiorness of the WMA, namely that in the first proof is, as you say, the worst case scenario.… but the second proof is NOT. It uses expected value rather than worst case, right? So it’s not so much “worst case analysis does really weird things” but “not only that, the two proofs aren’t even validly comparable, because one is worst case analysis and one involves expected value analysis”
Or did I horribly misunderstand either the second proof or your argument against it? (if so, I blame lack of sleep)
Psy, the “worst case expected value” is a bound directly comparable to the “worst case value” because the actual trajectory of expert weights is the same between the two algorithms.
Pearson, as an infinite set atheist I see no reason to drag infinities into this case, and I don’t recall any of the proofs talked about them either.
Has anyone proved a theorem on the uselessness of randomness?
Nominull’s suggestion:
“By adding randomness to your algorithm, you spread its behaviors out over a particular distribution, and there must be at least one point in that distribution whose expected value is at least as high as the average expected value of the distribution.”
Might be a starting point for such a proof. However it does rely on being able to find that “one point” whose expected value is >= the average expected value of the distribution. Perhaps it is easy to prove that good points lie in a certain range, but all obvious specific choices in that range are bad.
So, the randomized algorithm isn’t really better than the unrandomized one because getting a bad result from the unrandomized one is only going to happen when your environment maliciously hands you a problem whose features match up just wrong with the non-random choices you make, so all you need to do is to make those choices in a way that’s tremendously unlikely to match up just wrong with anything the environment hands you because it doesn’t have the same sorts of pattern in it that the environment might inflict on you.
Except that the definition of “random”, in practice, is something very like “generally lacking the sorts of patterns that the environment might inflict on you”. When people implement “randomized” algorithms, they don’t generally do it by introducing some quantum noise source into their system (unless there’s a real adversary, as in cryptography), they do it with a pseudorandom number generator, which precisely is a deterministic thing designed to produce output that lacks the kinds of patterns we find in the environment.
So it doesn’t seem to me that you’ve offered much argument here against “randomizing” algorithms as generally practised; that is, having them make choices in a way that we confidently expect not to match up pessimally with what the environment throws at us.
Or, less verbosely:
Indeed randomness can improve the worst-case scenario, if the worst-case environment is allowed to exploit “deterministic” moves but not “random” ones. What “random” means, in practice, is: the sort of thing that typical environments are not able to exploit. This is not cheating.
No, it means “the sort of thing which is uncorrelated with the environment”. What you want is for your responses to be correlated with the environment in a way that benefits you.
I think this series of posts should contain a disclaimer at some point stating the usability of randomized algorithms in practice (even excluding situations like distributed computing and cryptography). In a broad sense, though randomness offers us no advantage information theoretically (this is what you seem to be saying), randomness does offer a lot of advantages computationally. This fact is not at odds with what you are saying, but deserves to be stated more explicitly.
There are many algorithmic tasks which become feasible via randomized algorithms, for which no feasible deterministic alternatives are known—until recently primality testing was one such problem (even now, the deterministic primality test is unusable in practice). Another crucial advantage of randomized algorithms is that in many cases it is much easier to come up with a randomized algorithm and later, if needed, derandomize the algorithm. Randomized algorithms are extremely useful in tackling algorithmic problems and I think any AI that has to design algorithms would have to be able to reason in this paradigm (cf. Randomized algorithms, Motwani and Raghavan or Probability and Computing, Mitzenmacher and Upfal).
To put my point another way...
Say your experts, above, predict instance frequencies in populations. Huge populations, so you can only test expert accuracy by checking subsets of those populations.
So, you write an algorithm to choose which individuals to sample from populations when testing expert’s beliefs. Aren’t there situations where the best algorithm, the one least likely to bias for any particular expert algorithm, would contain some random element?
I fear randomness is occasionally necessary just to fairly check the accuracy of beliefs about the world and overcome bias.
Eliezer: I recognize that the two algorithms have the exact same trajectory for the weights, that the only thing that’s different between them is the final output, not the weights. However, maybe I’m being clueless here, but I still don’t see that as justifying considering the two different proof results really comparable.
The real worst case analysis for the second algorithm would be more like “the random source just happened to (extremely unlikely, but...) always come up with the wrong answer.”
Obviously this sort of worst case analysis isn’t really fair to randomized algorithms, obviously you’d have to do something like “what’s the worst case probability of getting it wrong, blah blah blah” or some such. But then, clearly (to me, at least) it’s kinda cheating to then turn around and say “okay then, now let’s compare that to the actual worst case analysis of the deterministic algorithm and conclude the random algorithm is superior.”
On the other hand, I guess you could say that what’s being done is “let’s compare the expected value of the number of mistakes given the assumption of the most malicious possible data. In the deterministic version, that happens to be the actual actual worst case, and in the probabilistic version, well, obviously there’s other possibilities so it’s only the expected value for the worst case given the inputs.”
So I guess after all the are comparable after all.
But that only works given that one is creative about what we allow to be considered the “worst case”… that is, if we consider the worst case for the input data, but not the worst case from the random source of bits. Which… I guess was your point all along, that it’s due to the peculiar assumptions implicit in the notion of “worst case analysis”
chuckles well, I guess this is a case in point with regards to the whole “disagreement debate”… seems when I’m in disagreement with you, I end up many times, in the process of arguing my position finding that I was the one in error.
Sure, randomness is the lazy way out. But if computational resources are not infinite, laziness is sometimes the smart way out. And I said “infinite”—even transhuman quantum computers are finite.
I agree with Psy-Kosh, I don’t think the proofs are comparable.
The difference between the two derivations is that #1 uses a weak upper bound for W, while #2 uses an equality. This means that the final bound is much weaker in #1 than in #2.
It’s possible to redo the derivation for the nonrandomized version and get the exact same bound as for the randomized case. The trick is to write M as the number of samples for which F_i > 1⁄2. Under weak assumptions on the performance of the experts you can show that this is less than Sum_i F_i.
Eliezer: thanks for the AI brain-teaser, I hope to see more posts like this in the future.
I agree with Psy-Kosh too. The key is, as Eliezer originally wrote, never. That word appears in Theorem 1 (about the deterministic algorithm), but it does not appear in Theorem 2 (the bound on the randomized algorithm).
Basically, this is the same insight Eliezer suggests, that the environment is being allowed to be a superintelligent entity with complete knowledge in the proof for the deterministic bound, but the environment is not allowed the same powers in the proof for the randomized one.
In other words, Eliezer’s conclusion is correct, but I don’t think the puzzle is as difficult as he suggests. I think Psy-Kosh (And Daniel) are right, that the mistake is in believing that the two theorems are actually about the same, identical, topic. They aren’t.
The question whether randomized algorithms can have an advantage over deterministic algorithms is called the P vs BPP problem. As far as I know, most people believe that P=BPP, that is, randomization doesn’t actually add power. However, nobody knows how to prove this. Furthermore, there exists an oracle relative to which P is not equal to BPP. Therefore any proof that P=BPP has to be nonrelativizing. This rules out a lot of the more obvious methods of proof.
see: complexity zoo
Nicely done. Quite a number of points raised that escape many people and which are rarely raised in practice.
Someone please tell me if I understand this post correctly. Here is my attempt to summarize it:
“The two textbook results are results specifically about the worst case. But you only encounter the worst case when the environment can extract the maximum amount of knowledge it can about your ‘experts’, and exploits this knowledge to worsen your results. For this case (and nearby similar ones) only, randomizing your algorithm helps, but only because it destroys the ability of this ‘adversary’ to learn about your experts. If you instead average over all cases, the non-random algorithm works better.”
Is that the argument?
I am interested in what Scott Aaronson says to this.
I am unconvinced, and I agree with both the commenters g and R above. I would say Eliezer is underestimating the number of problems where the environment gives you correlated data and where the correlation is essentially a distraction. Hash functions are, e.g., widely used in programming tasks and not just by cryptographers. Randomized algorithms often are based on non-trivial insights into the problem at hand. For example, the insight in hashing and related approaches is that “two (different) objects are highly unlikely to give the exact same result when (the same) random function (from a certain class of functions) is applied to both of them, and hence the result of this function can be used to distinguish the two objects.”
@ comingstorm: Quasi Monte Carlo often outperforms Monte Carlo integration for problems of small dimensionality involving “smooth” integrands. This is, however, not yet rigorously understood. (The proven bounds on performance for medium dimensionality seem to be extremely loose.)
Besides, MC doesn’t require randomness in the “Kolmogorov complexity == length” sense, but in the “passes statistical randomness tests” sense. Eliezer has, as far as I can see, not talked about the various definitions of randomness.
Fascinating post. I especially like the observation that the purpose of randomness in all these examples is to defeat intelligence. It seems that this is indeed the purpose for nearly every use of randomness that I can think of. We randomise presidents routes to defeat the intelligence of an assasin, a casino randomises its roulette wheels to defeat the intelligence of the gambler. Even in sampling, we randomise our sample to defeat our own intelligence, which we need to treat as hostile for the purpose of the experiment.
What about in compressed sensing where we want an incoherent basis? I was under the impression that random linear projections was the best you could do here.
kyb, see the discussion of quicksort on the other thread. Randomness is used to protect against worst-case behavior, but it’s not because we’re afraid of intelligent adversaries. It’s because worst-case behavior for quicksort happens a lot. If we had a good description of naturally occurring lists, we could design a deterministic pivot algorithm, but we don’t. We only have the observation simple guess-the-median algorithms perform badly on real data. It’s not terribly surprising that human-built lists resonate with human-designed pivot algorithms; but the opposite scenario, where the simplex method works well in practice is not surprising either.
Well Eliezer seems to be stuck in the mud here the only solutions to a problem that he can accept are ones that fit into bayesian statistics and a logical syllogism. But he seems to be blithely unaware of the vast scope of possible approaches to solving any given problem. Not to mention the post seems to be a straw man since I would find it hard to imagine that any real scientist or engineer would ever claim that randomness is categorically better then an engineered solution. But I guess thats because I associate and work with real engineers and scientists so what would I know. Not to mention the statement that a randomized algorithm can perform better is true in a limited extent. The statement about the randomized algorithm being better is a specific statement about that algorithm and I would imagine the people making it would agree that one could engineer a better solution. I can’t imagine any real engineer making the claim that the randomized algorithm will always perform better then the engineered one.
I agree with Psy, the bounds are not comparable.
In fact, the bound for #2 should be the same as the one for #1. I mean, in the really worst case, the randomized algorithm makes exactly the same predictions as the deterministic one. I advise to blame and demolish your quantum source of random numbers in this case.
Tentative, but what about cases where you don’t trust your own judgement and suspect that you need to be shaken out of your habits?
What about Monte Carlo methods? There are many problems for which Monte Carlo integration is the most efficient method available.
Monte Carlo methods can’t buy you any correctness. They are useful because they allow you to sacrifice an unnecessary bit of correctness in order to give you a result in a much shorter time on otherwise intractable problem. They are also useful to simulate the effects of real world randomness (or at least behavior you have no idea how to systematically predict).
So, for example, I used a Monte Carlo script to determine expected scale economies for print order flow in my business. Why? Because it’s simple and the behavior I am modeling is effectively random to me. I could get enough information to make a simulation that gives me 95% accuracy with a few hours of research and another few hours of programming time. Of course there is somewhere out there a non-randomized algorithm that could do a more accurate job with a faster run time, but the cost of discovering it and coding it would be far more than a day’s work, and 95% accuracy on a few dozen simulations was good enough for me to estimate more accurately than most of my competition, which is all that mattered. But Eliezer’s point stands. Randomness didn’t buy me any accuracy, it was a way of trading accuracy for development time.
Congratulations Eliezer, this subject is very interesting.
Nontheless useful, i.e. Monter Carlo methods with pseudo-random numbers ARE actually deterministic, but we discard the details (very complex and not useful for the application) and only care about some statistical properties. This is a state of subjective partial information, the business of bayesian probability, only this partial information is deliberate! (or better, a pragmatic compromise due to limitations in computational power. Otherwise we’d just use classical numerical integration happily in tens of dimensions.)
This bayesian perspective on “”random”″ algorithms is not something usually found. I think frequentist interpretations is still permeating (and dragging) most research. Indeed, more Jaynes is needed.
As stated the statement “Randomness didn’t buy me any accuracy, it was a way of trading accuracy for development time.” is a complete non-sequitur since nobody actually believes that randomness is a means of increased accuracy in the long term it is used as a performance booster. It merely means you can get the answer faster with less knowledge which we do all the time since there are a plethora of systems that we can’t understand.
Clearly you don’t recognize the significance of Eliezer’s work. He cannot be bound by such trivialities as ‘proof’ or ‘demonstration’. They’re not part of the New Rationality.
Eliezer, can you mathematically characterize those environments where randomization can help. Presumably these are the “much more intelligent than you, and out to get you” environments, but I’d like to see that characterized a bit better.
A slight tangent here, but Blum loses me even before the considerations that Eliezer has so clearly pointed out. Comparing an upper bound for the worst case of one algorithm with an upper bound for the average case of the other is meaningless. Even were like things being compared, mere upper bounds of the true values are only weak evidence of the quality of the algorithms.
FWIW, I ran a few simulations (for the case of complete independence of all experts from themselves and each other), and unsurprisingly, the randomised algorithm performed on average worse than the unrandomised one. In addition, the weighting algorithm converges to placing all of the weight on the best expert. If the experts’ errors are at all independent, this is suboptimal.
Actual performance of both algorithms was far better than the upper bounds, and rather worse than one using optimal weights. I cannot see these upper bounds as anything more than a theoretical irrelevance.
Anyone who evaluates the performance of an algorithm by testing it with random data (e.g., simulating these expert-combining algorithms with randomly-erring “experts”) is ipso facto executing a randomized algorithm...
g: Indeed I was. Easier than performing an analytical calculation and sufficiently rigorous for the present purpose. (The optimal weights, on the other hand, are log(p/(1-p)), by an argument that is both easier and more rigorous than a simulation.)
Richard, I wasn’t suggesting that there’s anything wrong with your running a simulation, I just thought it was amusing in this particular context.
Silas—yeah, that’s about the size of it.
Eliezer, when you come to edit this for popular publication, lose the maths, or at least put it in an endnote. You’re good enough at explaining concepts that if someone’s going to get it, they’re going to get it without the equations. However, a number of those people will switch off there and then. I skipped it and did fine, but algebra is a stop sign for a number of very intelligent people I know.
I am interested in what Scott Aaronson says to this.
I fear Eliezer will get annoyed with me again :), but R and Stephen basically nailed it. Randomness provably never helps in average-case complexity (i.e., where you fix the probability distribution over inputs) -- since given any ensemble of strategies, by convexity there must be at least one deterministic strategy in the ensemble that does at least as well as the average.
On the other hand, if you care about the worst-case running time, then there are settings (such as query complexity) where randomness provably does help. For example, suppose you’re given n bits, you’re promised that either n/3 or 2n/3 of the bits are 1′s, and your task is to decide which. Any deterministic strategy to solve this problem clearly requires looking at 2n/3 + 1 of the bits. On the other hand, a randomized sampling strategy only has to look at O(1) bits to succeed with high probability.
Whether randomness ever helps in worst-case polynomial-time computation is the P versus BPP question, which is in the same league as P versus NP. It’s conjectured that P=BPP (i.e., randomness never saves more than a polynomial). This is known to be true if really good pseudorandom generators exist, and such PRG’s can be constructed if certain problems that seem to require exponentially large circuits, really do require them (see this paper by Impagliazzo and Wigderson). But we don’t seem close to proving P=BPP unconditionally.
Just because you refuse to research probabilistic/random algorithms doesn’t mean they don’t work.
Do some background work for once.
@ Scott Aaronson. Re: your n-bits problem. You’re moving the goalposts. Your deterministic algorithm determines with 100% accuracy which situation is true. Your randomized algorithm only determines with “high probability” which situation is true. These are not the same outputs.
You need to establish goal with a fixed level of probability for the answer, and then compare a randomized algorithm to a deterministic algorithm that only answers to that same level of confidence.
That’s the same mistake that everyone always makes, when they say that “randomness provably does help.” It’s a cheaper way to solve a different goal. Hence, not comparable.
Scott, I don’t dispute what you say. I just suggest that the confusing term “in the worst case” be replaced by the more accurate phrase “supposing that the environment is an adversarial superintelligence who can perfectly read all of your mind except bits designated ‘random’”.
Don: When you fix the goalposts, make sure someone can’t kick the ball straight in! :-) Suppose you’re given an n-bit string, and you’re promised that exactly n/4 of the bits are 1, and they’re either all in the left half of the string or all in the right half. The problem is to decide which. It’s clear that any deterministic algorithm needs to examine at least n/4 + 1 of the bits to solve this problem. On the other hand, a randomized sampling algorithm can solve the problem with certainty after looking at only O(1) bits on average.
Eliezer: I often tell people that theoretical computer science is basically mathematicized paranoia, and that this is the reason why Israelis so dominate the field. You’re absolutely right: we do typically assume the environment is an adversarial superintelligence. But that’s not because we literally think it is one, it’s because we don’t presume to know which distribution over inputs the environment is going to throw at us. (That is, we lack the self-confidence to impose any particular prior on the inputs.) We do often assume that, if we generate random bits ourselves, then the environment isn’t going to magically take those bits into account when deciding which input to throw at us. (Indeed, if we like, we can easily generate the random bits after seeing the input—not that it should make a difference.)
Average-case analysis is also well-established and used a great deal. But in those cases where you can solve a problem without having to assume a particular distribution over inputs, why complicate things unnecessarily by making such an assumption? Who needs the risk?
EDIT: OK, I found the explanation further down, seems I was on the right path.
Sorry Scott, I know I’m late to the party, but this got me curious. Can you explain this, or point me to an explanation? I’m not sure I got the correct numbers.
As far as I can tell, a reasonable deterministic algorithm is to examine, one by one, at most the (n/4 + 1) bits (e.g., the first); but if you find a 1 early, you stop. The nondeterministic one would be to just test bits in random order until you see a 1.
For the deterministic version, in half the cases you’re going to evaluate all those bits when they’re in the “end” half.
If the bits are in the “front” half, or if the algorithm is random, I can’t figure out exactly how to solve the average number of bits to test, but it seems like the deterministic half should do on average about as well* as the nondeterministic one on a problem half the size, and in the worst case scenario it still does (n/4 + 1) bits, while the nondeterministic algorithm has to do (n/2 + 1) tests in the worst-case scenario unless I’m mistaken.
I mean, it does seem like the probability of hitting a bad case is low on average, but the worst case seems twice as bad, even though harder to hit.
I’m not sure how you got O(1); I got to a sum of reciprocals of squares up to k for the probability of the first 1 bit being the k-th tested, which sounds about O(1) on average, but my probabilities are rusty, is that the way?
(* of course, it’s easy to construct the bad case for the deterministic algorithm if you know which bits it tests first.)
the environment is an adversarial superintelligence
Oh, Y’all are trying to outwit Murphy? No wonder you’re screwed.
Could Scott_Aaronson or anyone who knows what he’s talking about, please tell me the name of the n/4 left/right bits problem he’s referring to, or otherwise give me a reference for it? His explanation doesn’t seem to make sense: the deterministic algorithm needs to examine 1+n/4 bits only in the worst case, so you can’t compare that to the average output of the random algorithm. (Average case for the determistic would, it seems, be n/8 + 1) Furthermore, I don’t understand how the random method could average out to a size-independent constant.
Is the randomized algorithm one that uses a quantum computer or something?
Silas: “Solve” = for a worst-case string (in both the deterministic and randomized cases). In the randomized case, just keep picking random bits and querying them. After O(1) queries, with high probability you’ll have queried either a 1 in the left half or a 1 in the right half, at which point you’re done.
As far as I know this problem doesn’t have a name. But it’s how (for example) you construct an oracle separating P from ZPP.
@Scott_Aaronson: Previously, you had said the problem is solved with certainty after O(1) queries (which you had to, to satisfy the objection). Now, you’re saying that after O(1) queries, it’s merely a “high probability”. Did you not change which claim you were defending?
Second, how can the required number of queries not depend on the problem size?
Finally, isn’t your example a special case of exactly the situation Eliezer_Yudkowsky describes in this post? In it, he pointed out that the “worst case” corresponds to an adversary who knows your algorithm. But if you specifically exclude that possibility, then a deterministic algorithm is just as good as the random one, because it would have the same correlation with a randomly chosen string. (It’s just like the case in the lockpicking problem: guessing all the sequences in order has no advantage over randomly picking and crossing off your list.) The apparent success of randomness is again due to, “acting so crazy that a superintelligent opponent can’t predict you”.
Which is why I summarize Eliezer_Yudkowsky’s position as: “Randomness is like poison. Yes, it can benefit you, but only if you use it on others.”
Silas: Look, as soon you find a 1 bit, you’ve solved the problem with certainty. You have no remaining doubt about the answer. And the expected number of queries until you find a 1 bit is O(1) (or 2 to be exact). Why? Because with each query, your probability of finding a 1 bit is 1⁄2. Therefore, the probability that you need t or more queries is (1/2)^(t-1). So you get a geometric series that sums to 2.
(Note: I was careful to say you succeed with certainty after an expected number of queries independent of n—not that there’s a constant c such that you succeed with certainty after c queries, which would be a different claim!)
And yes, I absolutely assume that the adversary knows the algorithm, because your algorithm is a fixed piece of information! Once again, the point is not that the universe is out to get us—it’s that we’d like to succeed even if it is out to get us. In particular, we don’t want to assume that we know the probability distribution over inputs, and whether it might happen to be especially bad for our algorithm. Randomness is useful precisely for “smearing out” over the set of possible environments, when you’re completely ignorant about the environment.
If you’re not to think this way, the price you pay for Bayesian purity is to give up whole theory of randomized algorithms, with all the advances it’s led to even in deterministic algorithms; and to lack the conceptual tools even to ask basic questions like P versus BPP.
It feels strange to be explaining CS101 as if I were defending some embattled ideological claim—but it’s good to be reminded why it took people a long time to get these things right.
Quick check: take algorithm that tests bits 1, n/2+1, 3, n/2+3, 5, n/2+5 … then continues alternating halves on even indices. Of course, it stops at the first 1. And it’s obviously deterministic.
But unless I’m mistaken, averaged over all possible instances of the problem, this algorithm has exactly the same average, best case and worst case as the nondeterministic one. (In fact, I think this holds for any fixed sequence of test indices, as long as it alternates between halves, such as 1, n, 2, n-1, 3, n-2… Basically, my reasoning is that an arbitrary index sequence should have (on average) the same probability of finding the answer in k tests on a random problem that a random index sequence does.)
I know you said it’s adversarial, and it’s clear that (assuming you know the sequence of tested indices) you can construct problem instances that take n/2+1 (worst-case) tests, in which case the problem isn’t random. I just want to test if I reasoned correctly.
If I’m right, then I think what you’re saying is exactly what Eliezer said: If you know the adversary can predict you, you’ve got no better option than to act randomly. (You just don’t do as bad in this case as in Eliezer’s example.) Conversely, if you can predict the adversary, then the random algorithm completely wastes this knowledge. (I.e., if you could guess more about the sequence, you could probably do better than n/4+1 worst case, and if you can predict the adversary then you don’t even need to test the bits, which is basically O(0).)
Isn’t the probability of finding a bit 1⁄4 in each sample? Because you have to sample both sides. If you randomly sample without repetitions you can do even better than that. You can even bound the worse case for the random algorithm at n/4+1 as well by seeing if you ever pick n/4+1 0s from any side. So there is no downside to doing it randomly as such.
However if it is an important problem and you think you might be able to find some regularities, the best bet would to be to do bayesian updates on the most likely positions to be ones and preferentially choose those. You might be bitten by a super intelligence predicting what you think most probable, and doing the opposite, but as you have made the system more complex the world has to be more complex to outsmart you, which may be less likely. You can even check this by seeing whether you are having to do n/4+1 samples or close to it on average.
Will: Yeah, it’s 1⁄4, thanks. I somehow have a blind spot when it comes to constants. ;-)
Let’s say the solver has two input tapes: a “problem instance” input and a “noise source” input.
When Eli says “worst-case time”, he means the maximum time, over all problem instances and all noise sources, taken to solve the problem.
But when Scott says “worst-case time,” he means taking the maximum (over all problem instances X) of the average time (over all noise sources) to solve X.
Scott’s criterion seems relevant to real-world situations where one’s input is being generated by an adversarial human who knows which open-source software you’re using but doesn’t know your random seed (which was perhaps generated by moving your mouse around). Here I don’t think Eli and Scott have any real disagreement, since Eli agrees that randomness can be useful in defeating an adversarial intelligence.
BTW, I think I have a situation where randomness is useful without postulating an adversary (you could argue that the adversary is still present though).
Say we’re quicksorting a very long list of length N, your worst case is pretty bad, and your best case is not much better than average. You have a Kolmogorov distribution for your input, and you have a long string which you believe to be maximally compressed (ie, it’s really really random). If your quicksort code is short enough compared to N, then the worst-case input will have complexity << N, since it can be concisely described as the worst-case input for your sorting algorithm. If you use your random string to permute the list, then it will (probably) no longer have an easy-to-specify worst-case input, so your average case may improve.
It sounds to me like Peter is suggesting to randomly permute all inputs to the sorting algorithm, in hopes of improving the average case runtime. The justification appears to be that the worst case has low complexity because it’s easy to describe, because you can describe it by saying “it’s the worst case for ”.
For example, the worst-case input for Quicksort is an already-sorted list. You could, indeed, get better average performance out of quicksort if you could randomly permute its inputs in zero time, because code is more likely to produce already-sorted lists than lists sorted, say, by { x mod 17 }.
My initial reaction was that the implicit description “that input which is worst case for this algorithm” might specify an input which was not unusually likely to occur in practice. Peter is saying that it is more likely to occur in practice, even if it’s an implicit description—even if we don’t know what the worst-case input is—because it has lower complexity in an absolute sense, and so more processes in the real world will generate that input than one with a higher complexity.
Is that right?
And no fast deterministic permutation can be used instead, because the deterministic permutation could be specified concisely.
In practice, our random number generators use a seed of 16 to 32 bits, so they introduce almost no complexity at all by this criterion. No RNG can be useful for this purpose, for the same reason no deterministic permutation can be useful.
Still, I think this is a conclusive proof that randomness can be useful.
That last paragraph is pretty unsettling—you get penalized for not throwing a messy obfuscatory step into yourself? A Kolmogorov distribution indeed does seem to be a “present adversary”!
@michael e sullivan: re “Monte Carlo methods can’t buy you any correctness”—actually, they can. If you have an exact closed-form solution (or a rapidly-converging series, or whatever) for your numbers, you really want to use it. However not all problems have such a thing; generally, you either simplify (giving a precise, incorrect number that is readily computable and hopefully close), or you can do a numerical evaluation, which might approach the correct solution arbitrarily closely based on how much computation power you devote to it.
Quadrature (the straightforward way to do numerical integration using regularly-spaced samples) is a numeric evaluation method which is efficient for smooth, low-dimensional problems. However, for higher-dimensional problems, the number of samples becomes impractical. For such difficult problems, Monte Carlo integration actually converges faster, and can sometimes be the only feasible method.
Somewhat ironically, one field where Monte Carlo buys you correctness is numeric evaluation of Bayesian statistical models!
Silas is right; Scott keeps changing the definition in the middle, which was exactly my original complaint.
For example, Scott says: In the randomized case, just keep picking random bits and querying them. After O(1) queries, with high probability you’ll have queried either a 1 in the left half or a 1 in the right half, at which point you’re done.
And yet this is no different from a deterministic algorithm. It can also query O(1) bits, and “with high probability” have a certain answer.
I’m really astonished that Scott can’t see the slight-of-hand in his own statement. Here’s how he expresses the challenge: It’s clear that any deterministic algorithm needs to examine at least n/4 + 1 of the bits to solve this problem. On the other hand, a randomized sampling algorithm can solve the problem with certainty after looking at only O(1) bits on average.
Notice that tricky final phrase, on average? That’s vastly weaker than what he is forcing the deterministic algorithm to do. The “proof” that a deterministic algorithm requires n/4+1 queries requires a goal much more difficult than getting a quick answer “on average”.
If you’re willing to consider a goal of answering quickly only “on average”, then a deterministic algorithm can also do it just as fast (on average) as your randomized algorithm.
Don how well a deterministic algorithm does depends upon what the deterministic algorithm is and what the input is, it cannot always query O(1) queries.
E.g. in a 20 bit case
If the deterministic algorithm starts its queries from the beginning, querying each bit in turn and this is pattern is always 00000111110000000000
It will always take n/4+1 queries. Only when the input is random will it on average take O(1) queries.
The random one will take O(1) on average on every type of input.
Don—the “sleight of hand”, as you put it is that (as I think has been said before) we’re examining worst-case. We explicitly are allowing the string of 1s and 0s to be picked by an adversarial superintelligence who knows our strategy. Scott’s algorithm only needs to sample 4 bits on average to answer the question even when the universe it out to get him—the deterministic version will need to sample exactly n/4+1.
Basically, Eliezer seems to be claiming that BPP = P (or possibly something stronger), which most people think is probably true, but, as has been said, no-one can prove. I for one accept his intuitive arguments as, I think, do most people who’ve thought about it, but proving this intuition rigorously is a major outstanding problem in computational complexity.
My supervior’s intuitive argument for why you can never get anything from randomness is that in any particular case where randomness appears to help you can just pick a pseudo-random source which is “random enough” (presumably a formalisation of this intuition is what Scott’s talking about when he says that BPP = P if “realy good pseudorandom generators exist”).
To generalize Peter’s example, a typical deterministic algorithm has low Kolmogorov complexity, and therefore its worst-case input also has low Kolmogorov complexity and therefore a non-negligible probability under complexity-based priors. The only possible solutions to this problem I can see are:
1. add randomization
2. redesign the deterministic algorithm so that it has no worst-case input
3. do a cost-benefit analysis to show that the cost of doing either 1 or 2 is not justified by the expected utility of avoiding the worst-case performance of the original algorithm, then continue to use the original deterministic algorithm
The main argument in favor of 1 is that its cost is typically very low, so why bother with 2 or 3? I think Eliezer’s counterargument is that 1 only works if we assume that in addition to the input string, the algorithm has access to a truly random string with a uniform distribution, but in reality we only have access to one input, i.e., sensory input from the environment, and the so called random bits are just bits from the environment that seem to be random.
My counter-counterargument is to consider randomization as a form of division of labor. We use one very complex and sophisticated algorithm to put a lower bound on the Kolmogorov complexity of a source of randomness in the environment, then after that, this source of randomness can be used by many other simpler algorithms to let them cheaply and dramatically reduce the probability of hitting a worst-case input.
Or to put it another way, before randomization, the environment does not need to be a malicious superintelligence for our algorithms to hit worst-case inputs. After randomization, it does.
(I agree that this is one among many valid cases of when we might want to just throw in randomization to save thinking time, as opposed to doing a detailed derandomized analysis.)
Quantum branching is “truly random” in the sense that branching the universe both ways is an irreducible source of new indexical ignorance. But the more important point is that unless the environment really is out to get you, you might be wiser to exploit the ways that you think the environment might depart from true randomness, rather than using a noise source to throw away this knowledge.
I’m not making any nonstandard claims about computational complexity, only siding with the minority “derandomize things” crowd
Note that I also enthusiastically belong to a “derandomize things” crowd! The difference is, I think derandomizing is hard work (sometimes possible and sometimes not), since I’m unwilling to treat the randomness of the problems the world throws at me on the same footing as randomness I generate myself in the course of solving those problems. (For those watching at home tonight, I hope the differences are now reasonably clear...)
I certainly don’t say “it’s not hard work”, and the environmental probability distribution should not look like the probability distribution you have over your random numbers—it should contain correlations and structure. But once you know what your probability distribution is, then you should do your work relative to that, rather than assuming “worst case”. Optimizing for the worst case in environments that aren’t actually adversarial, makes even less sense than assuming the environment is as random and unstructured as thermal noise.
I would defend the following sort of statement: While often it’s not worth the computing power to take advantage of all the believed-in regularity of your probability distribution over the environment, any environment that you can’t get away with treating as effectively random, probably has enough structure to be worth exploiting instead of randomizing.
(This isn’t based on career experience, it’s how I would state my expectation given my prior theory.)
once you know what your probability distribution is...
I’d merely stress that that’s an enormous “once.” When you’re writing a program (which, yes, I used to do), normally you have only the foggiest idea of what a typical input is going to be, yet you want the program to work anyway. This is not just a hypothetical worry, or something limited to cryptography: people have actually run into strange problems using pseudorandom generators for Monte Carlo simulations and hashing (see here for example, or Knuth vol 2).
Even so, intuition suggests it should be possible to design PRG’s that defeat anything the world is likely to throw at them. I share that intuition; it’s the basis for the (yet-unproved) P=BPP conjecture.
“Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin.”—von Neumann
Even if P=BPP, that just means that giving up randomness causes “only” a polynomial slowdown instead of an exponential one, and in practice we’ll still need to use pseudorandom generators to simulate randomness.
It seems clear to me that noise (in the sense of randomized algorithms) does have power, but perhaps we need to develop better intuitions as to why that is the case.
Eliezer: There are a few classes of situations I can think of in which it seems like the correct solution (given certain restrictions) requires a source of randomness:
One example would be Hofstadter’s Luring Lottery. Assuming all the agents really did have common knowledge of each other’s rationality, and assuming no form of communication is allowed between them in any way, isn’t it true that no deterministic algorithm to decide how many entries to send in, if any at all, has a higher expected return than a randomized one? (that is, the solutions which are better are randomized ones, rather than randomized ones automatically being better solutions)
The reason being that you’ve got a prisoner’s dilemma type situation there with all the agents using the same decision algorithm. So any deterministic algorithm means they all do the same thing, which means many entries are sent in, which means the pot shrinks, while not at all boosting any individual’s agent’s expected winnings.
My shot at improving on randomness for things like the Luring Lottery: http://lesswrong.com/lw/vp/worse_than_random/18yi
The leaders in the NetflixPrize competition for the last couple years have utilized ensembles of large numbers of models with a fairly straightforward integration procedure. You can only get so far with a given model, but if you randomly scramble its hyperparameters or training procedure, and then average multiple runs together, you will improve your performance. The logical path forward is to derandomize this procedure, and figure out how to predict, a priori, which model probabilities become more accurate and which don’t. But of course until you figure out how to do that, random is better than nothing.
As a process methodology, it seems useful to try random variations, find the ones that outperform and THEN seek to explain it.
@ Will: You happen to have named a particular deterministic algorithm; that doesn’t say much about every deterministic algorithm. Moreover, none of you folks seem to notice much that pseudo-random algorithms are actually deterministic too...
Finally, you say: “Only when the input is random will [the deterministic algorithm] on average take O(1) queries. The random one will take O(1) on average on every type of input.”
I can’t tell you how frustrating it is to see you just continue to toss in the “on average” phrase, as though it doesn’t matter, when in fact it’s the critical part.
To be blunt: you have no proof that all deterministic algorithms require n/4+1 steps on average. You only have a proof that deterministic algorithms require n/4+1 steps in the worst case, full stop. Not, “in the worse case, on average”.
DonGeddis is missing the point. Randomness does buy power. The simplest example is sampling in statistics: to estimate the fraction of read-headed people in the world to within 5% precision and with confidence 99%, it is enough to draw a certain number of random samples and compute the fraction of read-headed people in the sample. The number of samples required is independent of the population size. No deterministic algorithm can do this, not even if you try to simulate samples by using good PRNGs, because there are ``orderings″ of the world’s population that would fool your algorithm. But if you use random bits that are independent of the data, you cannot be fooled. (By the way, whether ‘true randomness’ is really possible in the real world is a totally different issue.)
Note that, for decision problems, randomness is not believed to give you a more-than-polynomial speedup (this is the P vs BPP question). But neither sampling nor the promise problem Scott mentioned are decision problems (defined on all inputs).
Regarding your claim that you cannot compare ‘worst-case time’ for deterministic algorithms with ‘worst-case expected time’ for randomized ones: this is totally pointless and shows a total lack of understanding of randomization. No one claims that you can have a randomized algorithm with success probability one that outperforms the deterministic algorithm. So either we use the expected running time or the randomized algorithm, or we allow a small error probability.
We can choose either one and use the same definition of the problem for both deterministic and randomized. Take Scott example and suppose we want algorithms that succeed with probability 2⁄3 at least, on every input. The goal is the same in both cases, so they are comparable. In the case of deterministic algorithms, this implies always getting the correct answer. DonGeddis seems not to realize this, and contrary to what he says, any deterministic algorithm, no matter how clever, needs to look at n/4+1 elements in the worst case. (By the way, maybe DonGeddis problem is that he doesn’t see the proof of this fact, but it easily provable with standard techniques.) However, a randomized algorithm that makes a worst-case constant number of queries will succeed with probability 2⁄3. If we want higher succeed probability, we can just repeat a constant number of times. Note that I’m not talking about the expected complexity but about worst-case complexity, if the former bothers you; this comes at the expense of a small error probability.
As for Blum’s paper, it shows that the best bounds we can prove for randomized algorithms in that particular problem outperform the bounds we can prove for deterministic algorithms. Sure, that does not imply the former are better, but it’s the best bound we can prove. To prove that the randomized algorithm in this case has a better constant, one needs to find a lower bound for the deterministic algorithm.
Wait a minute. There are also orderings of the world population that will randomly fool the random algorithm. (That’s where the 99% comes from.) Or, more precisely, there are samplings generated by your random algorithm that get fooled by the actual ordering of the world population.
If you write a deterministic algorithm that samples the exact number of samples as the random one does, but instead of sampling randomly you sample deterministically, spreading the samples in a way you believe to be appropriately uniform, then, assuming the human population isn’t conspiring to trick you, you should get exactly the same 99% confidence and 5% precision.
(Of course, if you screw up spreading the samples, it means you did worse than random, so there’s something wrong with your model of the world. And if the entire world conspires to mess up your estimate of red hair frequency, and succeeds to do so without you noticing and altering you sampling, you’ve got bigger problems.)
For example, once n is found (the number of tests needed), you list all humans in order of some suitable tuple of properties (e.g., full name name, date and time of birth, population of home city/town/etc, or maybe), and then pick n people equally distanced in this list. You’re still going to find the answer, with the same precision and confidence, unless the world population conspired to distribute itself over the world exactly the right way to mess up your selection. There are all sorts of other selection procedures that are deterministic (they depend only on the population), but are ridiculously unlikely to have any correlation with hair color.
Other example: order all cities/towns/etc by GDP; group them in groups of ~N/n; pick from each group the youngest person; I don’t think n is high enough, but if it happens that a group contains a single city with more than N/n population pick the two, three, etc youngest persons. If you’re concerned that people all over the world recently started planning their babies just to mess with you, then pick the median-by-age person.
@ John, @ Scott: You’re still doing something odd here. As has been mentioned earlier in the comments, you’ve imagined a mind-reading superintelligence … except that it doesn’t get to see the internal random string.
Look, this should be pretty simple. The phrase “worst case” has a pretty clear layman’s meaning, and there’s no reason we need to depart from it.
You’re going to get your string of N bits. You need to write an algorithm to find the 1s. If your algorithm ever gives the wrong answer, we’re going to shoot you in the head with a gun and you die. I can write a deterministic algorithm that will do this in at most n/4+1 steps. So we’ll run it on a computer that will execute at most n/4+1 queries of the input string, and otherwise just halt (with some fixed answer). We can run this trillions of times, and I’m never getting shot in the head.
Now, you have a proposal. You need one additional thing: a source of random bits, as an additional input to your new algorithm. Fine, granted. Now we’re going to point the gun at your head, and run your algorithm trillions of times (against random inputs). I was only able to write a deterministic algorithm; you have the ability to write a randomized algorithm. Apparently you think this gives you more power.
Now then, the important question: are you willing to run your new algorithm on a special computer that halts after fewer than n/4+1 queries of the input string? Do you have confidence that, in the worst case, your algorithm will never need more than, say, n/4 queries?
No? Then stop making false comparisons between the deterministic and the randomized versions.
To look at it one more time … Scott originally said Suppose you’re given an n-bit string, and you’re promised that exactly n/4 of the bits are 1, and they’re either all in the left half of the string or all in the right half.
So we have a whole set of deterministic algorithms for solving the problem over here, and a whole set of randomized algorithms for solving the same problem. Take the best deterministic algorithm, and the best randomized algorithm.
Some people want to claim that the best randomized algorithm is “provably better”. Really? Better in what way?
Is it better in the worst case? No, in the very worst case, any algorithm (randomized or not) is going to need to look at n/4+1 bits to get the correct answer. Even worse! The randomized algorithms people were suggesting, with low average complexity, will—in the very worst case—need to look at more than n/4+1 bits, just because there are 3/4n 0 bits, and the algorithm might get very, very unlucky.
OK, so randomized algorithms are clearly not better in the worst case. What about the average case? To begin with, nobody here has done any average case analysis. But I challenge any of you to prove that every deterministic algorithm on this problem is necessarily worse, on average, that some (one?) randomized algorithm. I don’t believe that is the case.
So what do we have left? You had to invent a bizarre scenario, supposing that the environment is an adversarial superintelligence who can perfectly read all of your mind except bits designated ‘random’, as Eliezer say, in order to find a situation where the randomized algorithm is provably better. OK, that proof works, but why is that scenario at all interesting to any real-world application? The real world is never actually in that situation, so it’s highly misleading to use it as a basis for concluding that randomized algorithms are “provably better”.
No, what you need to do is argue that the pattern of queries used by a (or any?) deterministic algorithm, is more likely to be anti-correlated with where the 1 bits are in the environment’s inputs, than the pattern used by the randomized algorithm. In other words, it seems you have some priors on the environment, that the inputs are not uniformly distributed, nor chosen with any reasonable distribution, but are in fact negatively correlated with the deterministic algorithm’s choices. And the conclusion, then, would be that the average case performance of the deterministic algorithm is actually worse than the average computed assuming a uniform distribution of inputs.
Now, this does happen sometimes. If you implement a bubble sort, it’s not unreasonable to guess that the algorithm might be given a list sorted in reverse order, much more often than picking from a random distribution would suggest.
And similarly, if the n-bits algorithm starts looking at bit #1, then #2, etc. … well, it isn’t at all unreasonable to suppose that around half the inputs will have all the 1 bits in the right half of the string, so the naive algorithm will be forced to exhibit worst-case performance (n/4+1 bits examined) far more often than perhaps necessary.
But this is an argument about average case performance of a particular deterministic algorithm. Especially given some insight into what inputs the environment is likely to provide.
This has not been an argument about worst case performance of all deterministic algorithms vs. (all? any? one?) randomized algorithm.
Which is what other commenters have been incorrectly asserting from the beginning, that you can “add power” to an algorithm by “adding randomness”.
Maybe you can, maybe you can’t. (I happen to think it highly unlikely.) But they sure haven’t shown it with these examples.
Don, you missed my comment saying you could bound the randomized algorithm in the same way to n/4+1 by keeping track of where you pick and if you find n/4+1 0s in one half you conclude it is in the other half.
I wouldn’t say that a randomized algorithm is better per se, just more generally useful than the a particular deterministic one. You don’t have to worry about the types of inputs given to it.
This case matters because I am a software creator and I want to reuse my software components. In most cases I don’t care about performance of every little sub component too much. Sure it is not “optimal”, but me spending time on optimizing a process that only happens once is not worth it in the real world!
Obsessing about optimization is not always the way to win.
Don, if n is big enough then sure, I’m more than willing to play your game (assuming I get something out of it if I win).
My randomised algorithm will get the right answer within 4 tries in expectation. If I’m allowed to sample a thousand bits then the probability that it won’t have found the answer by then is .75^1000 which I think is something like 10^-120. And this applies even if you’re allowed to choose the string after looking at my algorithm.
If you play the game this way with your deterministic algorithm you will always need to sample n/4 + 1 bits—I can put the ones where you’ll never find them. If you don’t like this game, fine don’t do computational complexity, but there’s nothing wrong with Scott’s argument.
Yes, we assume that you can’t predict what random bits I will generate before I generate them, but that’s pretty much what random means, so I don’t see why it’s such a big deal.
@ Will: Yes, you’re right. You can make a randomized algorithm that has the same worst-case performance as the deterministic one. (It may have slightly impaired average case performance compared to the algorithm you folks had been discussing previously, but that’s a tradeoff one can make.) My only point is that concluding that the randomized one is necessarily better, is far too strong a conclusion (given the evidence that has been presented in this thread so far).
But sure, you are correct that adding a random search is a cheap way to have good confidence that your algorithm isn’t accidentally negatively correlated with the inputs. So if you’re going to reuse the algorithm in a lot of contexts, with lots of different input distributions, then randomization can help you achieve average performance more often than (some kinds of) determinism, which might occasionally have the bad luck to settle into worst-case performance (instead of average) for some of those distributions.
But that’s not the same as saying that it has better worst-case complexity. (It’s actually saying that the randomized one has better average case complexity, for the distributions you’re concerned about.)
@ John: can you really not see the difference between “this is guaranteed to succeed”, vs. “this has only a tiny likelihood of failure”? Those aren’t the same statements.
“If you play the game this way”—but why would anyone want to play a game the way you’re describing? Why is that an interesting game to play, an interesting way to compare algorithms? It’s not about worst case in the real world, it’s not about average case in the real world. It’s about performance on a scenario that never occurs. Why judge algorithms on that basis?
As for predicting the random bits … Look, you can do whatever you want inside your algorithm. Your queries on the input bits are like sensors into an environment. Why can’t I place the bits after you ask for them? And then just move the 1 bits away from whereever you happened to decide to ask?
The point is, that you decided on a scenario that has zero relevance to the real world, and then did some math about that scenario, and thought that you learned something about the algorithms which is useful when applying them in the real world.
But you didn’t. Your math is irrelevant to how these things will perform in the real world. Because your scenario has nothing to do with any actual scenario that we see in deployment.
(Just as an example: you still haven’t acknowledged the difference between real random sources—like a quantum counter—vs. PRNGs—which are actually deterministic! Yet if I presented you with a “randomized algorithm” for the n-bit problem, which actually used a PRNG, I suspect you’d say “great! good job! good complexity”. Even though the actual real algorithm is deterministic, and goes against everything you’ve been ostensibly arguing during this whole thread. You need to understand that the real key is: expected (anti-)correlations between the deterministic choices of the algorithm, and the input data. PRNGs are sufficient to drop the expected (anti-)correlations low enough for us to be happy.)
@Scott: would it be fair to say that the P=BPP question boils down to whether one can make a “secure” random source such that the adversarial environment can’t predict it? Is that why the problem is equivalent to having good PRNG? (I’m a physicist, but who thinks he knows some of the terminology of complexity theory—apologies if I’ve misunderstood the terms.)
I can’t help but think that the reason why randomization appear to add information in the context of machine learning classifiers for the dichotomous case is that the actual evaluation of the algorithms is empirical, and, as such, use finite data. Two classes of non-demonic error always exist in the finite case. These are measurement error, and sampling error.
Multiattribute classifiers use individual measurements from many attributes. No empirical measurement is made without error; the more attributes that are measured, the more measurement error intrudes.
No finite sample is an exact sample of the entire population. To the measurement error for each attribute, we must add sampling error. Random selection of individuals for training set samples, and for test set samples, lead to data sets that are approximations of the sample. The sample is an approximation of the population.
Indeed, the entire finite population is but a sample of what the population could be, given time, populations change.
So, the generalizable performance (say, accuracy) of AI algorithms is hard to nail down precisely, unless it either is a perfect classifier (fusion = 0 → planets vs. fusion = 1-> suns, for example, at a specific TIME).
I contend, therefore, that some of the ‘improvement’ observed due to randomization is actually overfit of the prediction model to the finite sample.
I’m in the field of bioinformatics, and in biology, genetics, genomics, proteomics, medicine, we always admit to measurement error.
DonGeddis is missing the point. Randomness does buy power. The simplest example is sampling in statistics: to estimate the fraction of read-headed people in the world to within 5% precision and with confidence 99%, it is enough to draw a certain number of random samples and compute the fraction of read-headed people in the sample. The number of samples required is independent of the population size. No deterministic algorithm can do this, not even if you try to simulate samples by using good PRNGs, because there are ``orderings″ of the world’s population that would fool your algorithm. But if you use random bits that are independent of the data, you cannot be fooled. (By the way, whether ‘true randomness’ is really possible in the real world is a totally different issue.)
Note that, for decision problems, randomness is not believed to give you a more-than-polynomial speedup (this is the P vs BPP question). But neither sampling nor the promise problem Scott mentioned are decision problems (defined on all inputs).
Regarding your claim that you cannot compare ‘worst-case time’ for deterministic algorithms with ‘worst-case expected time’ for randomized ones: this is totally pointless and shows a total lack of understanding of randomization. No one claims that you can have a randomized algorithm with success probability one that outperforms the deterministic algorithm. So either we use the expected running time or the randomized algorithm, or we allow a small error probability.
We can choose either one and use the same definition of the problem for both deterministic and randomized. Take Scott example and suppose we want algorithms that succeed with probability 2⁄3 at least, on every input. The goal is the same in both cases, so they are comparable. In the case of deterministic algorithms, this implies always getting the correct answer. DonGeddis seems not to realize this, and contrary to what he says, any deterministic algorithm, no matter how clever, needs to look at n/4+1 elements in the worst case. (By the way, maybe DonGeddis problem is that he doesn’t see the proof of this fact, but it easily provable with standard techniques.) However, a randomized algorithm that makes a worst-case constant number of queries will succeed with probability 2⁄3. If we want higher succeed probability, we can just repeat a constant number of times. Note that I’m not talking about the expected complexity but about worst-case complexity, if the former bothers you; this comes at the expense of a small error probability.
As for Blum’s paper, it shows that the best bounds we can prove for randomized algorithms in that particular problem outperform the bounds we can prove for deterministic algorithms. Sure, that does not imply the former are better, but it’s the best bound we can prove. To prove that the randomized algorithm in this case has a better constant, one needs to find a lower bound for the deterministic algorithm.
So don’t use those orderings. That randomness isn’t buying you power. It’s acting as though you know nothing about the distribution of red-headed people. If you do, in fact, know something about the distribution of red-headed people (i.e. there are more red-headed people than average in Ireland and fewer in Africa, therefore make sure each sample draws proportionally from Ireland, Africa, and non-Ireland-non-Africa), then you can exploit that to make your prediction more reliable.
It can be handy even if it’s not adversarial, and even if it is equally as smart as you are but not smarter. Suppose you’re playing a game with payoffs (A,A) = (0,0), (A,B) = (1,1), (B,A) = (1,1), (B,B) = (0,0) against a clone of yourself (and you can’t talk to each other); the mixed strategy where you pick A half of the time and B half of the time wins half of the time, but either pure strategy always loses. This even though the other player isn’t adversarial here—in fact, his utility function is exactly the same as yours.
Sometimes the environment really is adversarial, though.
Regular expressions as implemented in many programming languages work fine for almost all inputs, but with terrible upper bounds. This is why libraries like re2 are needed if you want to let anyone on the Internet use regular expressions in your search engine.
Another example that comes to mind is that quicksort runs in n·log n expected time if randomly sorted beforehand.
The Internet can be pretty scary, but it is still very much short of an adversarial superintelligence. Sure, sometimes people will want to bring your website down, and sometimes it could be an organized effort, but even that is nothing like a genuine worst case.
OK, granted, in the case of general regex we’re talking about upper bounds that are exponential in the worst case, and it’s not very difficult to come up with really bad regex cases, so in that case a single query could cause you major issues.
That doesn’t necessarily mean you should switch to a library that avoids those worst cases, though. You could simply reject the kinds of regexes that could potentially cause those issues, or run your regex matcher with a timeout and have the search fail whenever the timer expires.
I wouldn’t mind knowing the reason I was downvoted here.
Seems like noise to me. Only one person cared enough about what you wrote to vote, and they disliked your post. Upvoted back to zero, since the comments seem pretty reasonable to me.
Thanks; it didn’t occur to me to think of it statistically, although it seems obvious now.
That said, individual downvotes / upvotes still have reasons behind them; it’s just that you can’t generally expect to find out what they are except when they come in significant clusters.
If you have an search engine you don’t want to allow people to search your whole corpus anyway. You usually want to search an index.
So let me see if I’ve got this straight.
Computer Scientists: For some problems, there are random algorithms with the property that they succeed with high probability for any possible input. No deterministic algorithm for these problems has this property. Therefore, random algorithms are superior.
Eliezer: But if we knew the probability distribution over possible inputs, we could create a deterministic algorithm with the property.
Computer Scientists: But we do not know the probability distribution over possible inputs.
Eliezer: Never say “I don’t know”! If you are in a state of ignorance, use an ignorance prior.
Now of course the key question is what sort of ignorance prior we should use. In Jaynes’s book, usually ignorance priors with some sort of nice invariance properties are used, which makes calculations simpler. For example if the input is a bit stream then we could assume that the bits are independent coinflips. However, in real life this does not correspond to a state of ignorance, but rather to a state of knowledge where we know that the bit stream does not contain predictable correlations. For example, the probability of 1000 zeros in a row according to this ignorance prior is 10^{-300}, which is not even remotely close to the intuitive probability.
The next step is to try to create an ignorance prior which somehow formalizes Occam’s razor. As anyone familiar with MIRI’s work on the problem of logical priors should know, this is more difficult than it sounds. Essentially, the best solution so far (see “Logical Induction”, Garrabrant et al. 2016) is to create a list of all of the ways the environment could contain predictable correlations (or in the language of this post, resemble an “adversarial telepath”), and then trade them off of each other to get a final probability. One of the main downsides of the algorithm is that it is not possible to list all of the possible ways the environment could be correlated (since there are infinitely many), so you have to limit yourself to taking a feasible sample.
Now, it is worth noting that the above paper is concerned with an “environment” which is really just the truth-values of mathematical statements! It is hard to see how this environment resembles any sort of “adversarial telepath”. But if we want to maintain the ethos of this post, it seems that we are forced to this conclusion. Otherwise, an environment with logical uncertainty could constitute a counterexample to the claim that randomness never helps.
To be precise, let f be a mathematical formula with one free variable representing an integer, and suppose we are given access to an oracle which can tell us the truth-values of the statements f(1),...,f(N). The problem is to compute (up to a fixed accuracy) the proportion of statements which are true, with the restriction that we can only make n queries, where n << N. Monte Carlo methods succeed with failure probability exponential in n, regardless of what f is.
Now suppose that f is determined by choosing m bits randomly, where m << n, and interpreting them as a mathematical formula (throwing out the results and trying again if it is impossible to do so). Then if the minimum failure probability is nonzero, it is exponential in m, not n, and therefore bigger than the failure probability for Monte Carlo methods. However, any algorithm which can be encoded in less than ~m bits fails with nonzero probability for diagonalization reasons.
In fact, the diagonalization failure is one of the least important ones, the main point is that you just don’t know enough about the environment to justify writing any particular algorithm. Any deterministically chosen sequence has a good chance of being correlated with the environment, just because the environment is math and math things are often correlated. Now, we can call this an “adversarial telepath” if we want, but it seems to occur often enough in real life that this designation hardly seems to “carve reality at its joints”.
TL;DR: If even math can be called an “adversarial telepath”, the term seems to have lost all its meaning.