Here’s an argument that I think works in the limit where m→∞. For very large m, each subset S⊆[n] of size s will occur many times as an Si. Moreover, for each set of coodinate values xS:=(xj)j∈S, there will be many xi such that Si=S and xiS=xS. Therefore, using the law of large numbers for any subset S⊆[n] of size s and any set of coordinate values xS, Bob can infer N(xS,f):=#{x′∈{0,1}n:x′S=xS and f(x′)=1}, i.e. the number of inputs x′ that agree with xS at the coordinates indexed by S for which f(x′)=1. Additionally, the random variable Y contains no other information about f that is not contained in this set of statistics.
Given a function f:{0,1}n→{0,1}, let F(f) denote the set of all functions f′ such that for each subset S⊆[n] of size s and each set of coordinate values xS, N(xS,f)=N(xS,f′). The argument of the previous paragraph then implies that upon observing Y Bob can deduce the equivalence class F(f), but no other information about f from Y.
If we have a random variable X and we define another random variable Y as a deterministic function Y(X), then it is easy to show that the set of distributions of X which maximize the mutual information between Y and X are precisely those for which Y has a uniform distribution. Applying this to the above setup, we see that the distributions over f which maximize the channel capacity are those for which F(f) is uniformly distributed over all the values that it can take.
If we suppose in addition that f has the maximum entropy distribution subject to this constraint, we find that:
P[f]∝1/(#{f′:N(xS,f)=N(xS,f′) for all S⊆[n],xS∈{0,1}S}).
Intuitively, this says that the probability of a given function f is inversely proportional to the number of other functions f′ that have the same number of ones on every hyperplane defined by fixing a set of some s coordinates xS. This seems to correspond roughly to sensitivity: we intuitively expect there to be a lot of such functions f′ when the number of ones that f outputs on most hyperplanes is approximately half of the number of total points on that hyperplane, and saying that a function’s output is approximately half ones and half zeros on a given hyperplane is roughly saying that f is sensitive to the remaining unspecified coordinates.
It’s not obvious to me that the above expression for P[f] is feasible to compute in practice, but I think it is fairly naturally interpretable.
The key reason the problem is tractable in the case m→∞ is that the law of large numbers means the probabilities wash out and we get an essentially deterministic problem. In the case where m is finite, this won’t happen and in particular I expect you’ll run into complications that arise from interaction terms where the method for combining the information from two observations Yi and Yj with Si≠Sj is not clean.
I expect you’re more likely to end up with a tractable solution if you rephrase the problem statement slightly so that the subsets of inputs over which you’re agregating outputs when observed (in the case above, these subsets are the hyperplanes defined by the coordinates xS) are disjoint or overlapping. (For example, taking all the Si to be random but equal would ensure this.) It strikes me that there might be a formulation like this that still captures the key features of the problem you’re interested in, but I’ve not thought about this in detail.
This answer clears the bar for at least some prize money to be paid out, though the amount will depend on how far other answers go by the deadline.
One thing which would make it stronger would be to provide a human-interpretable function for each equivalence class (so Alice can achieve the channel capacity by choosing among those functions).
The suggestions for variants of the problem are good suggestions, and good solutions to those variants would probably also qualify for prize money.
Here’s an argument that I think works in the limit where m→∞. For very large m, each subset S⊆[n] of size s will occur many times as an Si. Moreover, for each set of coodinate values xS:=(xj)j∈S, there will be many xi such that Si=S and xiS=xS. Therefore, using the law of large numbers for any subset S⊆[n] of size s and any set of coordinate values xS, Bob can infer N(xS,f):=#{x′∈{0,1}n:x′S=xS and f(x′)=1}, i.e. the number of inputs x′ that agree with xS at the coordinates indexed by S for which f(x′)=1. Additionally, the random variable Y contains no other information about f that is not contained in this set of statistics.
Given a function f:{0,1}n→{0,1}, let F(f) denote the set of all functions f′ such that for each subset S⊆[n] of size s and each set of coordinate values xS, N(xS,f)=N(xS,f′). The argument of the previous paragraph then implies that upon observing Y Bob can deduce the equivalence class F(f), but no other information about f from Y.
If we have a random variable X and we define another random variable Y as a deterministic function Y(X), then it is easy to show that the set of distributions of X which maximize the mutual information between Y and X are precisely those for which Y has a uniform distribution. Applying this to the above setup, we see that the distributions over f which maximize the channel capacity are those for which F(f) is uniformly distributed over all the values that it can take.
If we suppose in addition that f has the maximum entropy distribution subject to this constraint, we find that:
P[f]∝1/(#{f′:N(xS,f)=N(xS,f′) for all S⊆[n],xS∈{0,1}S}).
Intuitively, this says that the probability of a given function f is inversely proportional to the number of other functions f′ that have the same number of ones on every hyperplane defined by fixing a set of some s coordinates xS. This seems to correspond roughly to sensitivity: we intuitively expect there to be a lot of such functions f′ when the number of ones that f outputs on most hyperplanes is approximately half of the number of total points on that hyperplane, and saying that a function’s output is approximately half ones and half zeros on a given hyperplane is roughly saying that f is sensitive to the remaining unspecified coordinates.
It’s not obvious to me that the above expression for P[f] is feasible to compute in practice, but I think it is fairly naturally interpretable.
The key reason the problem is tractable in the case m→∞ is that the law of large numbers means the probabilities wash out and we get an essentially deterministic problem. In the case where m is finite, this won’t happen and in particular I expect you’ll run into complications that arise from interaction terms where the method for combining the information from two observations Yi and Yj with Si≠Sj is not clean.
I expect you’re more likely to end up with a tractable solution if you rephrase the problem statement slightly so that the subsets of inputs over which you’re agregating outputs when observed (in the case above, these subsets are the hyperplanes defined by the coordinates xS) are disjoint or overlapping. (For example, taking all the Si to be random but equal would ensure this.) It strikes me that there might be a formulation like this that still captures the key features of the problem you’re interested in, but I’ve not thought about this in detail.
This answer clears the bar for at least some prize money to be paid out, though the amount will depend on how far other answers go by the deadline.
One thing which would make it stronger would be to provide a human-interpretable function for each equivalence class (so Alice can achieve the channel capacity by choosing among those functions).
The suggestions for variants of the problem are good suggestions, and good solutions to those variants would probably also qualify for prize money.