Can noise have power?
One of the most interesting debates on Less Wrong that seems like it should be definitively resolvable is the one between Eliezer Yudkowsky, Scott Aaronson, and others on The Weighted Majority Algorithm. I’ll reprint the debate here in case anyone wants to comment further on it.
In that post, Eliezer argues that “noise hath no power” (read the post for details). Scott disagreed. He replied:
...Randomness provably never helps in average-case complexity (i.e., where you fix the probability distribution over inputs) -- since given any ensemble of strategies, by convexity there must be at least one deterministic strategy in the ensemble that does at least as well as the average.
On the other hand, if you care about the worst-case running time, then there are settings (such as query complexity) where randomness provably does help. For example, suppose you’re given n bits, you’re promised that either n/3 or 2n/3 of the bits are 1′s, and your task is to decide which. Any deterministic strategy to solve this problem clearly requires looking at 2n/3 + 1 of the bits. On the other hand, a randomized sampling strategy only has to look at O(1) bits to succeed with high probability.
Whether randomness ever helps in worst-case polynomial-time computation is the P versus BPP question, which is in the same league as P versus NP. It’s conjectured that P=BPP (i.e., randomness never saves more than a polynomial). This is known to be true if really good pseudorandom generators exist, and such PRG’s can be constructed if certain problems that seem to require exponentially large circuits, really do require them (see this paper by Impagliazzo and Wigderson). But we don’t seem close to proving P=BPP unconditionally.
Eliezer replied:
Scott, I don’t dispute what you say. I just suggest that the confusing term “in the worst case” be replaced by the more accurate phrase “supposing that the environment is an adversarial superintelligence who can perfectly read all of your mind except bits designated ‘random’”.
Scott replied:
I often tell people that theoretical computer science is basically mathematicized paranoia, and that this is the reason why Israelis so dominate the field. You’re absolutely right: we do typically assume the environment is an adversarial superintelligence. But that’s not because we literally think it is one, it’s because we don’t presume to know which distribution over inputs the environment is going to throw at us. (That is, we lack the self-confidence to impose any particular prior on the inputs.) We do often assume that, if we generate random bits ourselves, then the environment isn’t going to magically take those bits into account when deciding which input to throw at us. (Indeed, if we like, we can easily generate the random bits after seeing the input—not that it should make a difference.)
Average-case analysis is also well-established and used a great deal. But in those cases where you can solve a problem without having to assume a particular distribution over inputs, why complicate things unnecessarily by making such an assumption? Who needs the risk?
And later added:
...Note that I also enthusiastically belong to a “derandomize things” crowd! The difference is, I think derandomizing is hard work (sometimes possible and sometimes not), since I’m unwilling to treat the randomness of the problems the world throws at me on the same footing as randomness I generate myself in the course of solving those problems. (For those watching at home tonight, I hope the differences are now reasonably clear...)
Eliezer replied:
I certainly don’t say “it’s not hard work”, and the environmental probability distribution should not look like the probability distribution you have over your random numbers—it should contain correlations and structure. But once you know what your probability distribution is, then you should do your work relative to that, rather than assuming “worst case”. Optimizing for the worst case in environments that aren’t actually adversarial, makes even less sense than assuming the environment is as random and unstructured as thermal noise.
I would defend the following sort of statement: While often it’s not worth the computing power to take advantage of all the believed-in regularity of your probability distribution over the environment, any environment that you can’t get away with treating as effectively random, probably has enough structure to be worth exploiting instead of randomizing.
(This isn’t based on career experience, it’s how I would state my expectation given my prior theory.)
Scott replied:
> “once you know what your probability distribution is...”
I’d merely stress that that’s an enormous “once.” When you’re writing a program (which, yes, I used to do), normally you have only the foggiest idea of what a typical input is going to be, yet you want the program to work anyway. This is not just a hypothetical worry, or something limited to cryptography: people have actually run into strange problems using pseudorandom generators for Monte Carlo simulations and hashing (see here for example, or Knuth vol 2).
Even so, intuition suggests it should be possible to design PRG’s that defeat anything the world is likely to throw at them. I share that intuition; it’s the basis for the (yet-unproved) P=BPP conjecture.
“Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin.”—von Neumann
And that’s where the debate drops off, at least between Eliezer and Scott, at least on that thread.
- 23 Mar 2015 17:38 UTC; 11 points) 's comment on Group Rationality Diary, March 22 to April 4 by (
- 27 Oct 2015 21:37 UTC; 0 points) 's comment on The trouble with Bayes (draft) by (
I’m an expert in a neighboring field: numerical optimization. I’ve seen lots of evidence for Jaynes’s impression that for any algorithm that uses randomness, one can find a deterministic technique (which takes thought) that accomplishes that goal better. (For example, comparing genetic algorithms, simulated annealing, and tabu search, the last has an deterministic memory mechanism that gives it the ability to do both diversification and intensification, with much more control than either of the other two.) Random methods are employed frequently because to do otherwise would require thought.
As for the debate, to me it looks like it was over terminology. To illustrate, let me label three different cases: the ‘benign’ case, where the environment is assumed to be dumb (i.e. maxent priors are reasonable), the ‘adversary’ case, where the environment is assumed to be an agent that knows your algorithm but not your private source of randomness, and the ‘omega’ case, where the environment is assumed to be an agent that knows your algorithm and your private source of randomness.*
Eliezer thinks the phrase ‘worst case analysis’ should refer to the ‘omega’ case. Scott thinks that the ‘adversary’ case is worth doing theoretical analysis on. The first is a preference that I agree with,** the second seems reasonable. I think Silas did a good job of summarizing the power that randomness does have:
*There’s also a subtlety about solving the problem ‘with high probability.’ For a deterministic algorithm to have a chance at doing that, it needs to be the benign case- otherwise, the adversary decides that you fail if you left them any opening. For a randomized algorithm, the benign and adversary cases are equivalent.
**One of the things that Scott linked—when Monte Carlo goes wrong—is something that shows up a lot, and there’s a whole industry built around generating random numbers. For any randomized algorithm, the real worst case is that you’ve got a PRNG that hates you, unless you’ve paid to get your bits from a source that’s genuinely random, and if omega was the case that came to mind when people said ‘worst case,’ rather than adversary, this would have been more obvious. (vN got it, unsurprisingly, but it’s not clear that the median CS researcher did until they noticed the explosions.)
I suppose that there are really 6 different complexities. Let x be the random numbers you pick and let y be the input you get. Let f(x,y) be the time your algorithm takes on that instance. We can maximise or take the expected value for each of x and y, in either order.
E_x E_y f(x,y) (Average case complexity; Eliezer’s favourite.)
max_y E_x f(x,y) (Worst case complexity; Scott’s favourite.)
E_x max_y f(x,y) (facing an adversary who knows what random numbers you are going to get.)
max_x E_y f(x,y) (you get an input at random, your random number generator tries to give you bad numbers but doesn’t know what case you’ll get.)
E_y max_x f(x,y) (you get an input at random, your random number generator gives you the worst numbers for that input.)
max_x max_y f(x,y) (‘Omega’ case complexity; your environment and your random number generator conspire to kill you.)
There’s not one best way to put a distribution on the possible inputs, so computer scientists who don’t want to do this can only look at 2, 3, and 6. Then problems that can be solved in poly time according to 2 are the ones in BPP, and those that can be solved in poly time according to 6 are the ones in P (you don’t use your RNG if it’s trying to kill you). I don’t know if the corresponding class for 3 has a name, but it doesn’t seem very interesting.
EDIT: You also don’t want to use your RNG if you’re being judged by 4 or 5, so these both correspond to the complexity class where you judge by average time and you don’t have a RNG. Then 1 corresponds to the complexity class where you’re judged by average time and you do have an RNG. These complexity classes are known to be the same, since as Scott says
That means that randomness has power, it spares you the cost of thinking. Depending on the amount of thinking needed this can be quite substantial a value.
I’d agree with that part.
It also offers a safeguard against errors in your thinking. If your thinking is wrong you might choose an algorithm that is bad given the evironment. Compare that to the probability of the random algo being matched by the environment.
Indeed. The grandparent is not entirely a condemnation.
“Worst case analysis” is a standard term of art in computer science, that shows up as early as second-semester programming, and Eliezer will be better understood if he uses the standard term in the standard way.
A computer scientist would not describe the “omega” case as random—if the input is correlated with the random number source in a way that is detectable by the algorithm, they’re by definition not random.
Actually, in the context of randomized algorithms, I’ve always seen the term “worst case running time” refer to Oscar’s case 6, and “worst-case expected running time”—often somewhat misleadingly simplified to “expected running time”—refer to Oscar’s case 2.
A system that reliably behaves like the omega case is clearly not random. However, a random system such as case 2 may still occasionally behave like omega, with probability epsilon, and it is not at all unreasonable or uncommon to require your algorithm to work efficiently even in those rare cases. Thus, one might optimize a random system by modelling it as an omega system, and demanding that it works well enough even in that context.
I did not propose that worst case be interpreted as Omega or that it be given any nonstandard referent. I did suggest that “worst case” to describe the Adversary scenario is deceptive to readers, and we should ReplaceTheSymbolWithTheSubstance via a more descriptive phrase like “adversarial superintelligence that knows everything except the bits designated random”. This is what the phrase standardly means in computer science, but calling this “worst case analysis” seems to me deceptive, especially if we’re trying to conduct a meta-ish debate about the benefits of randomization, rather than talking about some particular algorithm.
Agreed.
Right. But I want to repeat the objection here that we often use pseudorandomness instead of actual randomness, and then the real worst case is that we’ve gotten a cursed seed. Somewhat less practically, in situations where a real adversary may have access to our hardware, we may have to assume that they can read (or write to!) our RNG.
I think this is a practical scenario in cryptography where your threat model is state-level actors.
If your adversary can read or write bits in your hardware, then what is the purpose of using cryptography?
They may only be able to access your hardware in limited ways. For example, if a hardware “RNG” actually outputs 1,2,3,… encrypted with some key known only to the NSA, that’s essentially totally undetectable. But if instead they have the hardware send out extra information over the internet, sooner or later someone will notice and the game will be up.
How does the NSA synchs with your “RNG” is no information is exchanged?
But anyway, if you reasonably believe that your RNG may have been compromised, then you just don’t use it.
They don’t need to sync for it to be a serious weakness in a cryptosystem. If a system using Khoth’s PRNG sends out a billion encrypted messages in its lifetime, then an attacker with the PRNG key needs less than 2^30 tries to decrypt a message sent at an unknown point in that sequence—a large number, but more than manageable when you consider that a PRNG with a period of 2^80 would be considered weak in the crypto world.
Agreed.
Side channel attacks on hardware are not rare. For example, an adversary might have a way of measuring the power consumption of your CPU as it does RNG calculations. This is not quite the ability to “read or write bits in… hardware”, but it is a viable attack to gain information about your, ahem, random numbers.
Sure, but at this point they can also gain information on your keys or the data you wish to encrypt.
Not necessarily. Think wider, not only PCs use encrypted communications. Consider a router, for example, or a remote sensor.
Still, if they can compromise the RNG state in the router/sensor/whatever, they could probably compromise its CPU and/or RAM.
That’s not self-evident to me. Passively observing power consumption is much easier than, say, getting inside a SOC in tamper-resistant packaging.
Sometimes you think you’re playing Tetris, when you’re actually playing Bastet.
Sometimes. And sometimes you do not (or can not) understand the environment well enough to commit to a deterministic optimizer because of lack of data and not of thought.
There is another more charitable possible justification, sometimes you’re using such an approach to solve a problem you genuinely don’t know how to solve yourself and clever tricks you use may unwittingly shut off valuable areas of the search space.
When algorithms have bad worst cases, these cases are often the inputs with a particular structure. For example if you use quicksort without randomising your pivots then you have an average case complexity of n.log(n) but a worst case complexity of n^2. The worst inputs are the ones where the list is already partially sorted. The “partially sorted” is the kind of structure I’m talking about.
If we expected to see inputs with some kind of structure (but we didn’t know which kind of structure) then we might well be worried that we would get inputs with precisely the structure that would hit our worst case.
So suppose we do indeed have a prior over our inputs, but that this is a Solomonoff prior. Among all the inputs of length n we expect to see each one with probability proportional to exp(-k) where k is its Kolmogorov complexity. Then most of the strings will have complexity n and so probability ~ exp(-n). However our worst case string will have complexity at most m+log(n) where m is the length of our algorithm’s code. This is by virtue of its description as the worst case for that algorithm among the strings of length n. So we will anticipate getting the worst case input with probability ~ exp(-m)/n. So if our worst case complexity is g(n) then our average case complexity is O(g(n)/n).
Under a Solomonoff prior the worst cases and average cases are the same! (up to a polynomial)
EDIT: So if there are cases where randomisation gives an exponentially better worst case complexity, then we won’t be able to derandomise them under the Solomonoff prior.
reference
Wow, they say exactly the same things I do, right down to using quicksort as an example.
I assumed you were aware of that paper. If you came up with that result independently, then kudos to you ;)
It’s a pretty canonical example actually. I immediately thought of quicksort’s worst case vs average case when I got to the end of the debate. It’s taught in many intro to algorithms courses. (But I sure didn’t know about the Kolmogorov complexity part!)
This is quite possibly the best LW comment I’ve ever read. An excellent point with a really concise explanation. In fact it is one of the most interesting points I’ve seen within Kolmogorov complexity too. Well done on independently deriving the result!
Thanks!
Using a randomized algorithm doesn’t let you get away with making no assumption whatsoever about the distribution of inputs. You still have to make the assumption that your inputs don’t look like the output of your random number generator. This is not always going to be true.
In the first Computer Rock-Paper-Scissors programming competition (in which, in order to win, programs had to both maximally exploit a number of “weak”, predictable opponents and not be exploited themselves by the other entries), one of the “cheater” entries worked by figuring out the state of the random number generator and proceeded to make the choice that would beat the random number generator’s next “throw”. So you can’t always make the assumption that the environment can’t predict your random number generator.
(Another cheater entry used anthropic computing; it used the standard library routine “fork()” to turn the one currently running copy of the tournament program into three, made a different choice in each copy of the program, then killed the copies of the tournament program in which the cheater didn’t win. Video game players use a similar technique all the time: reload an old save whenever you make a bad decision.)
It’s the frequentist vs. bayesian debate again. Frequentist methods, like worst case analyses, give hard guarantees no matter what the probability distribution over the environment is. Bayesian methods, like average case analyses, require a probability distribution over the environment and may mess up horribly if that distribution is wrong.
I’m not a mathematician, but something about this tripped me up. It doesn’t seem fair to compare the cost of succeeding with probability 1 (the 2n/3+1 deterministic strategy) to the cost of succeeding with probability (1 - epsilon). It shouldn’t be surprising that perfect accuracy is more expensive than less-than-perfect accuracy.
Beyond that I’m having trouble following what they’re actually disagreeing about, but it doesn’t sound like they disagree about the mathematical properties of anything. I’m smelling a dissolvable question here.
“Randomization cannot improve matters when dealing with average (random?) inputs.
“Randomization can sometimes efficiently improve matters when dealing with a hostile, but not omnicient, input-producer.” (colloquial worst case, e.g. encryption)
“Randomization can improve matters when dealing with an omnicient hostile input-producer.” (true worst case, like Omega putting a million dollars in whichever box it predicts you will not choose)
Assuming I’m following the argument correctly, I don’t think Eliezer or Scott would disagree with any of the above three.
From the linked post. “Doing better” is ambiguous and could be interpreted as any of the above three. Are they really disagreeing about mathematical properties, or about which environments are worth worrying about?
If you have an intelligent adversary, and you leave a single opening which maxent would choose epsilon% of the time, the adversary will choose it 100% of the time. Randomization allows you to leave lots of openings, each with small probability, so they can only choose the right one epsilon% of the time.
The comparison is not quite fair—sometimes it makes sense to think of the adversary having access to your RNG, in which case you need to leave no openings—but there are cases where this is useful. (This shows up as mixed strategies in game theory, for example.)
[Edit] As an illustration, let’s work through the case with n=3. To simplify, suppose that you identify all the bits you want to probe, and then you get the results.
If you probe all 3 bits, then you get the right answer with certainty always, and the adversary can do nothing.
If you probe 2 bits, then you either see 11 or 10 or 00. In the first case, you can be certain there are two ones; in the last case, you can be certain there is only one one. In the middle case, you don’t know which is true. If your algorithm is deterministic, you preselect which bits you’ll probe—say, the first and second bit—and so the worst case is that you get 10, and if you’re playing against an opponent, they can always choose that move. If your algorithm is random, then the opponent isn’t sure which bits you’ll probe, and so a third of the time you’ll get a 11 or a 00. The adversary can reduce the algorithm’s effectiveness from 2⁄3 to 1⁄2.
If you probe 1 bit, then you either see 1 or 0. In the first case, under maxent, there’s a 2/3rds chance there are two ones; in the second case, there’s a 2/3rds chance there is one one. If you employ that deterministic algorithm and always pick the first bit against an adversary, you will be wrong every time! The adversary can reduce the effectiveness from 2⁄3 to 0.
If you probe 0 bits, then you see nothing, and are right 1⁄2 the time.
What the randomness is doing at each step is counteracting the adversary. You can also see that the deterministic algorithm separates into two cases: ‘no better than doing nothing’ and ’100% correct’, which is typical.
[Edit2]What about the case where you sample a bit, then decide whether or not to keep sampling? The work is already done- just read the above in reverse. If 2/3rds probability is good enough, then the randomized algorithm can stop after one bit, but the deterministic algorithm needs all three. If you want certainty, the randomized algorithm uses, on average, only 2.67 sampled bits, because a third of the time they can stop after two.
One of my old CS teachers defended treating the environment as adversarial and knowing your source code, because of hackers. See median of 3 killers. (I’d link something, but besides a paper, I can’t find a nice link explaining what they are in a small amount of googling).
I don’t see why Yudkowsky makes superintelligence a requirement for this.
Also, it doesn’t even have to be source code they have access to (which they could if it was open-source software anyway). There are such things as disassemblers and decompilers.
[Edit: removed implication that Yudkowsky thought source code was necessary]
Because often when we talk about ‘worst-case’ inputs, it would require something of this order to deliberately give you the worst-case, in theoretical CS, at least. I don’t think Eliezer would object at all to this kind of reasoning where there actually was a plausible possibility of an adversary involved. In fact, one focus of things like cryptography (or systems security?) (where this is assumed) is to structure things so the adversary has to solve as hard a problem as you can make it. Assuming worst-case input is like assuming that the hacker has to do no work to solve any of these problems, and automatically knows the inputs that will screw with your solution most.
Yep! Original article said that this was a perfectly good assumption and a perfectly good reason for randomization in cryptography, paper-scissors-rock, or any other scenario where there is an actual adversary, because it is perfectly reasonable to use randomness to prevent an opponent from being intelligent.
And what do you suggest to assume, instead?
Anyway, most proofs of asymptotic security in cryptography are conditional on conjectures such as “P != NP” or “f is a one-way function”.
While this is an accurate characterization of what the term technically means, it does not really capture all the reasons why worst-case complexity is studied. I would say worst-case complexity is studied because
It allows more general statements, since the statement does not have to be conditioned on an input distribution
For many problems, both efficiently solvable as well as “hard”, the average-case complexity is not really different from the worst case.
Note that complexity theory studies the complexity of problems, not really the complexity of specific algorithms. So while it is true that the naive implementation of quicksort has n log n complexity on average but n^2 complexity in the worst case, what complexity theory studies is the hardness of sorting, and that has complexity n log n in both the worst case and the average case (assuming the uniform distribution over inputs).
Many problems believed to be hard are provably equally hard in both the worst case and the average case. For example, computing discrete logarithms using a randomized algorithm efficiently in the average case implies an algorithm to compute it efficiently in all cases.
This actually does not seem obvious, and I’m not sure it can be proved. The problem is that while it is true that there must exist some particular random string (indeed, we can assume most of them) that works for a given problem size, in the sense that the randomized algorithm when run with that one fixed string will have expected running time equal (to within a constant factor, say) to the original average, it is not easy to see how one can efficiently find such a string using a deterministic algorithm.
This is the same sort of problem one encounters when derandomizing BPP. It can be proved that if you have a BPP algorithm, then for every input size n, there exists one random string (in fact, all but an exponentially small fraction will work) that gives a deterministic algorithm that runs in time polynomial in n. However, only existence is known, and there is no known efficient way to find this string given n. So the proof only shows that BPP is contained in P/poly, not BPP = P.
relevant: https://news.ycombinator.com/item?id=8050405
Random Solutions are Often Good Enough
I read it as the case of a trade-off between an average (expected) cost and a guaranteed bound on that cost in the worst possible case. Such a trade-off sounds “normal” to me and often occurs in practice.
Intuitively, if you have some uncertain structure in your data, you can try to exploit it which leads to better average/expected results, but—if Mr.Murphy is in a particularly bad mood today—can also set you up for major fail. Sometimes you care about the average cost, and you accept the risk of being “unlucky” in your data. But sometimes you care about the average cost less and about the worst-case scenario more. And then you will be interested in reducing the upper bound on your cost, e.g. through randomizing.