Suppose I have a binary function f, with a million input bits and one output bit. The function is uniformly randomly chosen from all such functions—i.e. for each of the 21000000 possible inputs x, we flipped a coin to determine the output f(x) for that particular input.
Now, suppose I know f, and I know all but 50 of the input bits—i.e. I know 999950 of the input bits. How much information do I have about the output?
Answer: almost none. For almost all such functions, knowing 999950 input bits gives us ∼1250 bits of information about the output. More generally, If the function has n input bits and we know all but k, then we have o(12k) bits of information about the output. (That’s “little o” notation; it’s like big O notation, but for things which are small rather than things which are large.) Our information drops off exponentially with the number of unknown bits.
Proof Sketch
With k input bits unknown, there are 2k possible inputs. The output corresponding to each of those inputs is an independent coin flip, so we have 2k independent coin flips. If m of those flips are 1, then we assign a probability of m2k that the output will be 1.
As long as 2k is large, Law of Large Numbers will kick in, and very close to half of those flips will be 1 almost surely—i.e. m≈2k2. The error in this approximation will (very quickly) converge to a normal distribution, and our probability that the output will be 1 converges to a normal distribution with mean 12 and standard deviation 12k/2. So, the probability that the output will be 1 is roughly 12±12k/2.
We can then plug that into Shannon’s entropy formula. Our prior probability that the output bit is 1 is 12, so we’re just interested in how much that ±12k/2 adjustment reduces the entropy. This works out to o(12k) bits.
Why Is This Interesting?
One core idea of my work on abstraction is that noise very quickly wipes out almost all information; only some very-low-dimensional summary is relevant “far away”. This example shows that this sort of thing is not unusual, but rather “the default”: for almost all random functions, information drops off exponentially with the number of unknown bits. In a large system (i.e. a function with many inputs), ignorance of even just a few bits is enough to wipe out essentially-all information. That’s true even if we know the vast majority of the bits.
A good intuitive example of this is the “butterfly effect”: the flap of a butterfly’s wings could change the course of a future hurricane, because chaos. But there’s an awful lot of butterflies in the world, and the hurricane’s path is some complicated function of all of their wing-flaps (and many other variables too). If we’re ignorant of even just a handful of these flaps, then almost all of our information about the hurricane’s path is probably wiped out. And in practice, we’re ignorant of almost all the flaps. This actually makes it much easier to perform Bayesian reasoning about the path of the hurricane: the vast majority of information we have is basically-irrelevant; we wouldn’t actually gain anything from accounting for the butterfly-wing-flaps which we do know.
o(1/2^k) doesn’t vary with n—are you saying that it doesn’t matter how big the input array is, the only determinant is the number of unknown bits, and the number of known bits is irrelevant? That would be quite interesting if so (though I have some question about how likely the function is to be truly random from an even distribution of such functions).
One can enumerate all such 3-bit functions (8 different inputs, each input can return 0 or 1, so 256 functions (one per output-bit-pattern of the 8 possible inputs). But this doesn’t seem to follow your formula—if you have 3 unknown bits, that should be 1⁄8 of a bit about the output, 2 for 1⁄4, and 1 unknown for 1⁄2 a bit about the output. But in fact, the distribution of functions includes both 0 and 1 output for every input pattern, so you actually have no predictive power for the output if you have ANY unknown bits.
o(1/2^k) doesn’t vary with n—are you saying that it doesn’t matter how big the input array is, the only determinant is the number of unknown bits, and the number of known bits is irrelevant?
Yes, that’s correct.
But in fact, the distribution of functions includes both 0 and 1 output for every input pattern, so you actually have no predictive power for the output if you have ANY unknown bits.
The claim is for almost all functions when the number of inputs is large. (Actually what we need is for 2^(# of unknown bits) to be large in order for the law of large numbers to kick in.) Even in the case of 3 unknown bits, we have 256 possible functions, and only 18 of those have less than 1⁄4 1′s or more than 3⁄4 1′s among their output bits.
I’m not sure what context that link is assuming, but in an analysis context I typically see little o used in ways like e.g. “f(x)=f(x0)+dfdx|x0dx+o(dx2)”. The interpretation is that, as dx goes to 0, the o(dx2) terms all fall to zero at least quadratically (i.e. there is some C such that Cdx2 upper bounds the o(dx2) term once dx is sufficiently small). Usually I see engineers and physicists using this sort of notation when taking linear or quadratic approximations, e.g. for designing numerical algorithms.
Suppose I have a binary function f, with a million input bits and one output bit. The function is uniformly randomly chosen from all such functions—i.e. for each of the 21000000 possible inputs x, we flipped a coin to determine the output f(x) for that particular input.
Now, suppose I know f, and I know all but 50 of the input bits—i.e. I know 999950 of the input bits. How much information do I have about the output?
Answer: almost none. For almost all such functions, knowing 999950 input bits gives us ∼1250 bits of information about the output. More generally, If the function has n input bits and we know all but k, then we have o(12k) bits of information about the output. (That’s “little o” notation; it’s like big O notation, but for things which are small rather than things which are large.) Our information drops off exponentially with the number of unknown bits.
Proof Sketch
With k input bits unknown, there are 2k possible inputs. The output corresponding to each of those inputs is an independent coin flip, so we have 2k independent coin flips. If m of those flips are 1, then we assign a probability of m2k that the output will be 1.
As long as 2k is large, Law of Large Numbers will kick in, and very close to half of those flips will be 1 almost surely—i.e. m≈ 2k2. The error in this approximation will (very quickly) converge to a normal distribution, and our probability that the output will be 1 converges to a normal distribution with mean 12 and standard deviation 12k/2. So, the probability that the output will be 1 is roughly 12±12k/2.
We can then plug that into Shannon’s entropy formula. Our prior probability that the output bit is 1 is 12, so we’re just interested in how much that ±12k/2 adjustment reduces the entropy. This works out to o(12k) bits.
Why Is This Interesting?
One core idea of my work on abstraction is that noise very quickly wipes out almost all information; only some very-low-dimensional summary is relevant “far away”. This example shows that this sort of thing is not unusual, but rather “the default”: for almost all random functions, information drops off exponentially with the number of unknown bits. In a large system (i.e. a function with many inputs), ignorance of even just a few bits is enough to wipe out essentially-all information. That’s true even if we know the vast majority of the bits.
A good intuitive example of this is the “butterfly effect”: the flap of a butterfly’s wings could change the course of a future hurricane, because chaos. But there’s an awful lot of butterflies in the world, and the hurricane’s path is some complicated function of all of their wing-flaps (and many other variables too). If we’re ignorant of even just a handful of these flaps, then almost all of our information about the hurricane’s path is probably wiped out. And in practice, we’re ignorant of almost all the flaps. This actually makes it much easier to perform Bayesian reasoning about the path of the hurricane: the vast majority of information we have is basically-irrelevant; we wouldn’t actually gain anything from accounting for the butterfly-wing-flaps which we do know.
o(1/2^k) doesn’t vary with n—are you saying that it doesn’t matter how big the input array is, the only determinant is the number of unknown bits, and the number of known bits is irrelevant? That would be quite interesting if so (though I have some question about how likely the function is to be truly random from an even distribution of such functions).
One can enumerate all such 3-bit functions (8 different inputs, each input can return 0 or 1, so 256 functions (one per output-bit-pattern of the 8 possible inputs). But this doesn’t seem to follow your formula—if you have 3 unknown bits, that should be 1⁄8 of a bit about the output, 2 for 1⁄4, and 1 unknown for 1⁄2 a bit about the output. But in fact, the distribution of functions includes both 0 and 1 output for every input pattern, so you actually have no predictive power for the output if you have ANY unknown bits.
Yes, that’s correct.
The claim is for almost all functions when the number of inputs is large. (Actually what we need is for 2^(# of unknown bits) to be large in order for the law of large numbers to kick in.) Even in the case of 3 unknown bits, we have 256 possible functions, and only 18 of those have less than 1⁄4 1′s or more than 3⁄4 1′s among their output bits.
Little o is just a tighter bound. I don’t know what you are referring to by your statement:
I’m not sure what context that link is assuming, but in an analysis context I typically see little o used in ways like e.g. “f(x)=f(x0)+dfdx|x0dx+o(dx2)”. The interpretation is that, as dx goes to 0, the o(dx2) terms all fall to zero at least quadratically (i.e. there is some C such that Cdx2 upper bounds the o(dx2) term once dx is sufficiently small). Usually I see engineers and physicists using this sort of notation when taking linear or quadratic approximations, e.g. for designing numerical algorithms.