Optimal predictor schemes
I introduce the concept of “optimal predictor scheme” which differs from (quasi)optimal predictors in depending on an additional parameter representing the amount of computing resources the predictor is allowed to use. For a certain flavor of optimal predictor scheme, I prove existence for arbitrary distributional decision problems.
Results
It is convenient to think of the concepts of “optimal predictor” and “quasi-optimal predictor” as special cases of the more general -optimal predictors corresponding to different choices of the “error space” .
Definition 1
Let be a positive integer. An error space of rank r is a set of bounded functions from to s.t.
(i) If then .
(ii) If and then .
(iii) Given , if then .
(iv) There is a polynomial s.t. .
Example 1.1
is the set of functions s.t. .
Example 1.2
is the set of negligible functions .
Example 1.3
is the set of bounded functions s.t. for any , if
then
It is slightly non-obvious that condition (iii) holds in this example. To see this, note that the expression inside the limit can be regarded as where is a certain probability measure over . implies that for any , since . For any , we have , therefore .
Definition 2
Consider an error space of rank 1 and a distributional decision problem. A -optimal predictor for is a family of polynomial size Boolean circuits s.t. for any family of polynomial size Boolean circuits we have
where .
In particular, optimal predictors are -optimal predictors and quasi-optimal predictors are -optimal predictors.
Definition 3
Consider an error space of rank 2 and a distributional decision problem. A -optimal predictor scheme for is a family of Boolean circuits s.t.
(i) for some polynomial .
(ii) for any family of Boolean circuits that satisfies (i) we have
where .
It is straightforward to generalize Theorems 1-8 about quasi-optimal predictors to -optimal predictors and -optimal predictor schemes. The versions for optimal predictor schemes are stated in Appendix A without proof. The only theorem whose proof requires a slightly non-obvious tweak is Theorem 1. The amended proof is given in Appendix B.
The main technical novelty of this post is the following existence theorem.
Theorem 1
Consider a distributional decision problem. Define the circuit family by
Then, is a -optimal predictor scheme for .
The proof is given in Appendix C.
The definition of is rather trivial, but the fact Theorems A.1-A.7 apply to with regard to the error space is non-trivial. For example, this can be applied to the “classical” setting of logical uncertainty by taking to be the set of true sentences in some formal logic (e.g. , ). For a suitable choice of , will satisfy an approximate version of the coherence conditions because of the considerations here.
Appendix A
Fix an error space of rank 2, the polynomial from condition (iv).
Theorem A.1
Consider a distributional decision problem and a -optimal predictor scheme for . Suppose , are s.t.
Define
Then, .
Theorem A.2
Consider a word ensemble and , disjoint languages. Suppose is a -optimal predictor scheme for and is a -optimal predictor scheme for . Then, is a -optimal predictor scheme for .
Theorem A.3
Consider a word ensemble and , disjoint languages. Suppose is a -optimal predictor scheme for and is a -optimal predictor scheme for . Then, is a -optimal predictor scheme for .
Theorem A.4
Consider , distributional decision problems with respective -optimal predictor schemes and . Define as the family of circuits computing . Then, is a -optimal predictor scheme for .
Theorem A.5
Consider and a word ensemble. Assume is a -optimal predictor scheme for and is a -optimal predictor scheme for . Then is a -optimal predictor scheme for
Theorem A.6
Consider and a word ensemble. Assume . Assume is a -optimal predictor scheme for and is a -optimal predictor scheme for . Define as the circuit family computing
Then, is a -optimal predictor scheme for .
Definition A.1
Consider a word ensemble and two circuit families. We say is -similar to relative to (denoted ) when .
Theorem A.7
Consider a distributional decision problem, a -optimal predictor scheme for and a polynomial size family. Then, is a -optimal predictor scheme for if and only if .
Appendix B
Lemma B.1
Consider a distributional decision problem and a corresponding -optimal predictor scheme. Then, there is a function s.t.
(i) is non-decreasing in the third and fourth arguments.
(ii) For all polynomials , we have .
(iii) for all , and we have
The proof of Lemma B.1 is completely analogous to the proof of Lemma 2 for quasi-optimal predictors and I omit it.
Proof of Theorem A.1
Define as the circuits computing
is bounded by a polynomial since produces binary fractions of polynomial size therefore it is possible to compare them to the fixed numbers using a polynomial size circuit even if the latter have infinite binary expansions.
We have
Define to be truncated to the first significant binary digit. Define as the circuits computing
Denote the set of for which . For , has binary notation of polynomially bounded size, therefore is bounded by a polynomial for such .
Applying Lemma B.1 we get
for .
Obviously , therefore
The expression on the left hand side is a quadratic polynomial in which attains its maximum at and has roots at and . is between and , but not closer to than . Therefore, the inequality is preserved if we replace by .
Substituting the equation for we get
Thus for all we have
In particular, .
Appendix C
The following is a proof of Theorem 1.
Define by
To prove is -optimal, it enough to consider families of the form for polynomial , since
That is, we need to prove that for any polynomial , .
Without loss of generality, assume for . We are interested in the function
This can be rewritten as
The numerator can be recast as an integral
Raising to a power is equivalent to adding a constant to , therefore
Since , we get
Using the assumption on , we conclude that , as needed.
- Logical counterfactuals for random algorithms by 6 Jan 2016 13:29 UTC; 5 points) (
- Optimal predictor schemes pass a Benford test by 30 Aug 2015 13:25 UTC; 4 points) (
- Predictor schemes with logarithmic advice by 27 Mar 2016 8:41 UTC; 2 points) (
- Optimal predictors for global probability measures by 6 Oct 2015 17:40 UTC; 0 points) (
- Improved error space for universal optimal predictor schemes by 30 Sep 2015 15:08 UTC; 0 points) (
If δ(k,j)∈Δ2avg, for what functions g does δ(k,g(k))→0?
Does this require j to grow faster than any quasipolynomial function of k? Because that’s pretty fast. I guess it’s clear that j is going to have to increase faster than any polynomial in k?
There are no functions with this property. You have to do the log-log uniform average over js (up to a superquasipolynomial function) in order to guarantee convergence to 0 (however if you have a perfect predictor then you can amplify so that the error behaves like you described). I think it’s possible to change the formalism in a way which replaces quasipolynomials by polynomials and superquasipolynomial functions by superpolynomial functions but this requires introducing assumptions about the computational model so I avoided it for now.
Note though that superquasipolynomial is still “far from exponential” in some sense because there is a natural infinite tower of complexities between polynomial and exponential where 0-th level is polynomial functions and the n+1-st level consists of functions of the form 2f(logn) where f is a function of the n-th level (so quasipolynomials are level 1).
Btw, if you’re trying to catch up on optimal predictors then I have a nearly finished draft of a paper with an orderly presentation and consistent notation that I can send you.