We introduce a variant of optimal predictor schemes where optimality holds within the space of random algorithms with logarithmic advice. These objects are also guaranteed to exist for the error space Δ2avg. We introduce the class of generatable problems and construct a uniform universal predictor scheme for this class which is optimal in the new sense with respect to the Δ2avg error space. This is achieved by a construction similar to Levin’s universal search.
Results
New notation
Given n∈N, evn:N×{0,1}∗n+1alg−→{0,1}∗ is the following algorithm. When evkn(Q,x1…xn) is computed, Q is interpreted as a program and Q(x1…xn) is executed for time k. The resulting output is produced.
The notation evk(Q,x1…xn) means evkn(Q,x1…xn).
β:{0,1}∗→[0,1] is the mapping from a binary expansion to the corresponding real number.
Given μ a word ensemble, X a set, Q:{0,1}∗2alg−→X, TμQ(k,s) stands for the maximal runtime of Q(x,y) for x∈suppμk, y∈{0,1}s.
Previous posts focused on prediction of distributional decision problems, which is the “computational uncertainty” analogue of probability. Here, we use the broader concept of predicting distributional estimation problems (functions), which is analogous to expectation value.
Definition 1
A distributional estimation problem is a pair (f,μ) where f:{0,1}∗→[0,1] is an arbitrary function (even irrational values are allowed) and μ is a word ensemble.
Definition 2
Given an appropriate set X, consider P:N2×{0,1}∗3alg−→X, r:N2→N polynomial and a:N2→{0,1}∗. The triple (P,r,a) is called an X-valued (poly,log)-bischeme when
(i) The runtime of P(k,j,x,y,z) is bounded by p(k,j) with p polynomial.
(ii) |a(k,j)|≤c1+c2log(k+1)+c3log(j+1) for some c1,c2,c3∈N.
A [0,1]-valued (poly,log)-bischeme will also be called a (poly,log)-predictor scheme.
We think of P as a random algorithm where the second word parameter represents its internal coin tosses. The third word parameter represents the advice and we usually substitute a there.
We will use the notations Pkj(x,y,z):=P(k,j,x,y,z), akj:=a(k,j).
Definition 3
Fix Δ an error space of rank 2 and (f,μ) a distributional estimation problem. Consider (P,r,a) a (poly,log)-predictor scheme.(P,r,a) is called a Δ(poly,log)-optimal predictor scheme for (f,μ) when for any (poly,log)-predictor scheme (Q,s,b), there is δ∈Δ s.t.
The notation (poly,log) is meant to remind us that we allow a polynomial quantity of random bits r(k,j) and a logarithmic quantity of advice |akj|. In fact, the definitions and some of the theorems can be generalized to other quantities of random and advice (see also Note B.1). Thus, predictor schemes from previous posts are (poly,poly)-predictor schemes, (poly,O(1))-predictor schemes are limited to O(1) advice, (log,0)-predictor schemes use a logarithmic number of random bits and no advice and so on. As usual in complexity theory, it is redundant to consider more advice than random since advice is strictly more powerful.
Δ(poly,log)-optimal predictor scheme satisfy properties analogical to Δ-optimal predictor schemes. These properties are listed in Appendix A. The proofs of Theorem A.1 and A.4 are given in Appendix B. The other proofs are straightforward adaptions of corresponding proofs with polynomial advice.
We also have the following existence result:
Theorem 1
Consider (f,μ) a distributional estimation problem. Define Υ:N2×{0,1}∗3alg−→[0,1] by
Υkj(x,y,Q):=β(evj(Q,x,y))
Define υf,μ:N2→{0,1}∗ by
υkjf,μ:=argmin|Q|≤logjEμk×Uj[(Υkj(x,y,Q)−f(x))2]
Then, (Υ,j,υf,μ) is a Δ2avg(poly,log)-optimal predictor scheme for (f,μ).
Note 2
Consider a distributional decision problem (D,μ). Assume (D,μ) admits n∈N, A:N×{0,1}∗3alg−→{0,1}, a:N→{0,1}∗ and a function r:N→N s.t.
(i) A(k,x,y,z) runs in quasi-polynomial time (O(2lognk)).
(ii) |a(k)|=O(lognk)
(iii) limk→∞Prμk×Ur(k)[A(k,x,y,a(k))≠χD(x)]=0
Then it is easy to see we can construct a (poly,log)-predictor scheme PA taking values in {0,1} s.t. E[(PA−f)2]∈Δ2avg. The implication doesn’t work for larger sizes of time or advice. Therefore, the uncertainty represented by Δ2avg(poly,log)-optimal predictor schemes is associated with the resource gap between quasi-polynomial time plus advice O(lognk) and the resources needed to (heuristically) solve the problem in question.
The proof of Theorem 1 is given in Appendix C: it is a straightforward adaptation of the corresponding proof for polynomial advice. Evidently, the above scheme is non-uniform. We will now describe a class of problems which admits uniform Δ2avg(poly,log)-optimal predictor schemes.
Definition 4
Consider Δ1 an error space of rank 1. A word ensemble μ is called Δ1(log)-sampleable when there is S:N×{0,1}∗2alg−→{0,1}∗ that runs in polynomial time in the 1st argument, aS:N→{0,1}∗ of logarithmic size and rS:N→N a polynomial such that
∑x∈{0,1}∗|μk(x)−PrUrS(k)[Sk(y,aS(k))=x]|∈Δ1
(S,rS,aS) is called a Δ1(log)-sampler for μ.
Definition 5
Consider Δ1 an error space of rank 1. A distributional estimation problem (f,μ) is called Δ1(log)-generatable when there are S:N×{0,1}∗2alg−→{0,1}∗ and F:N×{0,1}∗2alg−→[0,1] that run in polynomial time in the 1st argument, aS:N→{0,1}∗ of logarithmic size and rS:N→N a polynomial such that
(i) (S,rS,aS) is a Δ1(log)-sampler for μ.
(ii) EUrS(k)[(Fk(y,aS(k))−f(Sk(y,aS(k))))2]∈Δ1
(S,F,rS,aS) is called a Δ1(log)-generator for (f,μ).
When aS is the empty string, (S,F,rS) is called a Δ1(0)-generator for (f,μ). Such (f,μ) is called Δ1(0)-generatable.
Note 3
The class of Δ1(0)-generatable problems can be regarded as an average-case analogue of NP∩coNP. If f is a decision problem (i.e. its range is {0,1}), words y∈{0,1}rS(k) s.t. Sk(y)=x, Fk(y)=1 can be regarded as “proofs” of f(x)=1 and words y∈{0,1}rS(k) s.t. Sk(y)=x, Fk(y)=0 can be regarded as “proofs” of f(x)=0.
Theorem 2
There is an oracle machine Λ that accepts an oracle of signature SF:N×{0,1}∗→{0,1}∗×[0,1] and a polynomial r:N→N where the allowed oracle calls are SFk(x) for |x|=r(k) and computes a function of signature N2×{0,1}∗2→[0,1] s.t. for any (f,μ) a distributional estimation problem and G:=(S,F,rS,aS) a corresponding Δ10(log)-generator, Λ[G] is a Δ2avg(poly,log)-optimal predictor scheme for (f,μ).
In particular if (f,μ) is Δ10(0)-generatable, we get a uniform Δ2avg(poly,log)-optimal predictor scheme.
The following is the description of Λ. Consider SF:N×{0,1}∗→{0,1}∗×[0,1] and a polynomial r:N→N. We describe the computation of Λ[SF,r]kj(x) where the extra argument of Λ is regarded as internal coin tosses.
We loop over the first j words in lexicographic order. Each word is interpreted as a program Q:{0,1}∗2alg−→[0,1]. We loop over jk “test runs”. At test run i, we generate (xi∈{0,1}∗,ti∈[0,1]) by evaluating SFk(yi) for yi sampled from Ur(k). We then sample zi from Uj and compute si:=evj(Q,xi,zi). At the end of the test runs, we compute the average error ϵ(Q):=1jk∑i(si−ti)2. At the end of the loop over programs, the program Q∗ with the lowest error is selected and the output evj(Q∗,x) is produced.
The proof that this construction is Δ2avg(poly,log)-optimal is given in Appendix C.
Appendix A
Fix Δ an error space of rank 2.
Theorem A.1
Suppose there is a polynomial h:N2→N s.t. h−1∈Δ. Consider (f,μ) a distributional estimation problem and (P,r,a) a Δ(poly,log)-optimal predictor scheme for (f,μ). Suppose {pkj∈[0,1]}k,j∈N, {qkj∈[0,1]}k,j∈N are s.t.
Assume that either pkj,qkj have a number of digits logarithmically bounded in k,j or Pkj produces outputs with a number of digits logarithmically bounded in k,j (by Theorem A.7 if any Δ(poly,log)-optimal predictor scheme exists for (f,μ) then a Δ(poly,log)-optimal predictor scheme with this property exists as well). Then, |ϕ|∈Δ.
Theorem A.2
Consider μ a word ensemble and f1,f2:{0,1}∗→[0,1] s.t. f1+f2≤1. Suppose (P1,r1,a1) is a Δ(poly,log)-optimal predictor scheme for (f1,μ) and (P2,r2,a2) is a Δ(poly,log)-optimal predictor scheme for (f2,μ). Define P:N2×{0,1}∗3alg−→[0,1] by Pkj(x,y1y2,(z1,z2)):=η(Pkj1(x,y1,z1)+Pkj2(x,y2,z2)) for |yi|=ri(k,j). Then, (P,r1+r2,a1a2) is a Δ(poly,log)-optimal predictor scheme for (f1+f2,μ).
Theorem A.3
Consider μ a word ensemble and f1,f2:{0,1}∗→[0,1] s.t. f1+f2≤1. Suppose (P1,r1,a1) is a Δ(poly,log)-optimal predictor scheme for (f1,μ) and (P,r2,a2) is a Δ(poly,log)-optimal predictor scheme for (f1+f2,μ). Define P2:N2×{0,1}∗3alg−→[0,1] by Pkj2(x,y1y2,(z1,z2)):=η(Pkj(x,y1,z1)−Pkj1(x,y2,z2)) for |yi|=ri(k,j). Then, (P2,r1+r2,a1a2) is a Δ(poly,log)-optimal predictor scheme for (f2,μ).
Theorem A.4
Fix Δ1 an error space of rank 1 s.t. given δ1∈Δ1, the function δ(k,j):=δ1(k) lies in Δ. Consider (f1,μ1), (f2,μ2) distributional estimation problems with respective Δ(poly,log)-optimal predictor schemes (P1,r1,a1) and (P2,r2,a2). Assume μ1 is Δ1(log)-sampleable and (f2,μ2) is Δ1(log)-generatable. Define f1×f2:{0,1}∗→[0,1] by (f1×f2)(x1,x2)=f1(x1)f2(x2) and (f1×f2)(y)=0 for y not of this form. Define P:N2×{0,1}∗3alg−→[0,1] by Pkj((x1,x2),y1y2,(z1,z2)):=Pkj1(x1,y1,z1)Pkj2(x2,y2,z2) for |yi|=ri(k,j). Then, (P,r1+r2,(a1,a2)) is a Δ(poly,log)-optimal predictor scheme for (f1×f2,μ1×μ2).
Theorem A.5
Consider f:{0,1}∗→[0,1], D⊆{0,1}∗ and μ a word ensemble. Assume (PD,rD,aD) is a Δ(poly,log)-optimal predictor scheme for (D,μ) and (Pf∣D,rf∣D,af∣D) is a Δ(poly,log)-optimal predictor scheme for (f,μ∣D). Define P:N2×{0,1}∗3alg−→[0,1] by Pkj(x,y1y2,(z1,z2)):=PkjD(x,y1,z1)Pkjf∣D(x,y2,z2) for |yi|=ri(k,j). Then (P,rD+rf∣D,(aD,af∣D)) is a Δ(poly,log)-optimal predictor scheme for (χDf,μ).
Theorem A.6
Fix h a polynomial s.t. 2−h∈Δ. Consider f:{0,1}∗→[0,1], D⊆{0,1}∗ and μ a word ensemble. Assume ∃ϵ>0∀k:μk(D)≥ϵ. Assume (PD,rD,aD) is a Δ(poly,log)-optimal predictor scheme for (D,μ) and (PχDf,rχDf,aχDf) is a Δ(poly,log)-optimal predictor scheme for (χDf,μ). Define Pf∣D:N2×{0,1}∗3alg−→[0,1] by
Pkjf∣D(x,y1y2,(z1,z2)):=⎧⎪
⎪⎨⎪
⎪⎩1if PkjD(x,y2,z2)=0η(PkjχDf(x,y1,z1)PkjD(x,y2,z2))rounded to h(k,j) binary places if PkjD(x,y2,z2)>0
Then, (Pf∣D,rD+rχDf,(aχDf,aD)) is a Δ(poly,log)-optimal predictor scheme for (f,μ∣D).
Definition A.1
Consider μ a word ensemble and ^Q1:=(Q1,s1,b1), ^Q2:=(Q2,s2,b2)(poly,log)-predictor schemes. We say ^Q1 is Δ-similar to ^Q2 relative to μ (denoted ^Q1μ≃Δ^Q2) when Eμk×Us1(k,j)×Us2(k,j)[(Qkj1(x,y1,bkj1)−Qkj2(x,y2,bkj2))2]∈Δ.
Theorem A.7
Consider (f,μ) a distributional estimation problem, ^P a Δ(poly,log)-optimal predictor scheme for (f,μ) and ^Q a (poly,log)-predictor scheme. Then, ^Q is a Δ(poly,log)-optimal predictor scheme for (f,μ) if and only if ^Pμ≃Δ^Q.
Note A.1
Δ-similarity is not an equivalence relation on the set of arbitrary (poly,log)-predictor schemes. However, it is an equivalence relation on the set of (poly,log)-predictor schemes ^Q satisfying ^Qμ≃Δ^Q (i.e. the μ-expectation value of the intrinsic variance of ^Q is in Δ). In particular, for any f:{0,1}∗→[0,1] any Δ(poly,log)-optimal predictor scheme for (f,μ) has this property.
Appendix B
Definition B.1
Given n∈N, a function δ:N2+n→R≥0 is called Δ-moderate when
(i) δ is non-decreasing in arguments 3 to 2+n.
(ii) For any collection of polynomials {pi:N2→N}i<n, δ(k,j,p0(k,j)…pn−1(k,j))∈Δ
Lemma B.1
Fix (f,μ) a distributional estimation problem and ^P:=(P,r,a) a (poly,log)-predictor scheme. Then, ^P is Δ(poly,log)-optimal iff there is a Δ-moderate function δ:N4→[0,1] s.t. for any k,j,s∈N, Q:{0,1}∗2alg−→[0,1]
Lemma B.1 shows that the error bound for Δ(poly,log)-optimal predictor scheme is in some sense uniform with respect to Q. This doesn’t generalize to e.g. Δ(poly,O(1))-optimal predictor schemes. The latter still admit a weaker version of Theorems A.1 and direct analogues of Theorems A.2, A.3, A.5, A.6 and A.7. Theorem A.4 doesn’t seem to generalize.
Lemma B.2
Suppose there is a polynomial h:N2→N s.t. h−1∈Δ. Fix (f,μ) a distributional estimation problem and (P,r,a) a corresponding Δ(poly,log)-optimal predictor scheme. Consider (Q,s,b) a (poly,log)-predictor scheme, M>0, w:N2×{0,1}∗3alg−→Q∩[0,M] with runtime bounded by a polynomial in the first two arguments, and u:N2→{0,1}∗ of logarithmic size. Then there is δ∈Δ s.t.
Given t∈[0,M], define αkj(t) to be t rounded within error h(k,j)−1. Thus, the number of digits in αkj(t) is logarithmic in k and j. Denote q(k,j):=max(r(k,j),s(k,j)). Consider ^Qt:=(Qt,r+s,bt) the (poly,log)-predictor scheme defined by
In the following proofs we will use shorthand notations that omit most of the symbols that are clear for the context. That is, we will use P to mean Pkj(x,y,akj), f to mean f(x), E[…] to mean Eμk×Ur(k,j)[…] etc.
Proof of Theorem A.1
Define w:N2×{0,1}∗3alg−→{0,1} and u:N2→{0,1}∗ by
w:=θ(P−p)θ(q−P)
We have
ϕ=E[w(f−P)]E[w]
Define ψ to be ϕ truncated to the first significant binary digit. Denote I⊆N2 the set of (k,j) for which |ϕkj|>h(k,j)−1. Consider (Q,s,b) a (poly,log)-predictor scheme satisfying
∀(k,j)∈I:Qkj=η(Pkj+ψkj)
Such Q exists since for (k,j)∈I, ψkj has binary notation of logarithmically bounded size.
Applying Lemma B.2 we get
∀(k,j)∈I:E[wkj(Pkj−f)2]≤E[wkj(Qkj−f)2]+δ(k,j)
for δ∈Δ.
∀(k,j)∈I:E[wkj((Pkj−f)2−(Qkj−f)2)]≤δ(k,j)
∀(k,j)∈I:E[wkj((Pkj−f)2−(η(Pkj+ψkj)−f)2)]≤δ(k,j)
Obviously (η(Pkj+ψkj)−f)2≤(Pkj+ψkj−f)2, therefore
∀(k,j)∈I:E[wkj((Pkj−f)2−(Pkj+ψkj−f)2)]≤δ(k,j)
∀(k,j)∈I:ψkjE[wkj(2(f−Pkj)−ψkj)]≤δ(k,j)
The expression on the left hand side is a quadratic polynomial in ψkj which attains its maximum at ϕkj and has roots at 0 and 2ϕkj. ψkj is between 0 and ϕkj, but not closer to 0 than ϕkj2. Therefore, the inequality is preserved if we replace ψkj by ϕkj2.
Consider (f,μ) a distributional estimation problem and (P,r,a) a Δ(poly,log)-optimal predictor scheme for (f,μ). Then there are c1,c2∈R and a Δ-moderate function δ:N4→[0,1] s.t. for any k,j,s∈N, Q:{0,1}∗2alg−→Q
Conversely, consider M∈Q and (P,r,a) a Q∩[−M,+M]-valued (poly,log)-bischeme. Suppose that for any Q∩[−M−1,+M]-valued (poly,log)-bischeme (Q,s,b) we have |E[Q(P−f)]|∈Δ.
Define ~P to be s.t. computing ~Pkj is equivalent to computing η(Pkj) rounded to h(k,j) digits after the binary point, where 2−h∈Δ. Then, ~P is a Δ(poly,log)-optimal predictor scheme for (f,μ).
Proof of Lemma B.3
Assume P is a Δ(poly,log)-optimal predictor scheme. Consider k,j,s∈N, Q:{0,1}∗2alg−→Q. Define t:=σ2−a where σ∈{±1} and a∈N. Define R:{0,1}∗2alg−→[0,1] to compute η(P+tQ) rounded within error 2−h. By Lemma B.1
By Lemma B.3, it is sufficient to show an appropriate bound for each of the terms on the right hand side. Suppose (S2,F2,rS2,aS2) is a Δ1(log)-generator for (f2,μ2). For the first term, we have
Consider (P,r,a) a (poly,log)-predictor scheme. Choose p:N2→N a polynomial s.t. evaluating Λ[G]k,p(k,j) involves running Pkj until it halts “naturally” (such p exists because P runs in at most polynomial time and has at most logarithmic advice). Given i,j,k∈N, consider the execution of Λ[G]k,p(k,j)+i. The standard deviation of ϵ(Pkj) with respect to the internal coin tosses of Λ is at most ((p(k,j)+i)k)−12. The expectation value is E[(Pkj−f)2]+γP where |γP|≤δ(k) for δ∈Δ10 which doesn’t depend on i,k,j,P. By Chebyshev’s inequality,
The standard deviation of ϵ(Q) for any Q is also at most ((p(k,j)+i)k)−12. The expectation value is E[(evp(k,j)+i(Q)−f)2]+γQ where |γQ|≤δ(k). Therefore
Predictor schemes with logarithmic advice
We introduce a variant of optimal predictor schemes where optimality holds within the space of random algorithms with logarithmic advice. These objects are also guaranteed to exist for the error space Δ2avg. We introduce the class of generatable problems and construct a uniform universal predictor scheme for this class which is optimal in the new sense with respect to the Δ2avg error space. This is achieved by a construction similar to Levin’s universal search.
Results
New notation
Given n∈N, evn:N×{0,1}∗n+1alg−→{0,1}∗ is the following algorithm. When evkn(Q,x1…xn) is computed, Q is interpreted as a program and Q(x1…xn) is executed for time k. The resulting output is produced.
The notation evk(Q,x1…xn) means evkn(Q,x1…xn).
β:{0,1}∗→[0,1] is the mapping from a binary expansion to the corresponding real number.
Given μ a word ensemble, X a set, Q:{0,1}∗2alg−→X, TμQ(k,s) stands for the maximal runtime of Q(x,y) for x∈suppμk, y∈{0,1}s.
Previous posts focused on prediction of distributional decision problems, which is the “computational uncertainty” analogue of probability. Here, we use the broader concept of predicting distributional estimation problems (functions), which is analogous to expectation value.
Definition 1
A distributional estimation problem is a pair (f,μ) where f:{0,1}∗→[0,1] is an arbitrary function (even irrational values are allowed) and μ is a word ensemble.
Definition 2
Given an appropriate set X, consider P:N2×{0,1}∗3alg−→X, r:N2→N polynomial and a:N2→{0,1}∗. The triple (P,r,a) is called an X-valued (poly,log)-bischeme when
(i) The runtime of P(k,j,x,y,z) is bounded by p(k,j) with p polynomial.
(ii) |a(k,j)|≤c1+c2log(k+1)+c3log(j+1) for some c1,c2,c3∈N.
A [0,1]-valued (poly,log)-bischeme will also be called a (poly,log)-predictor scheme.
We think of P as a random algorithm where the second word parameter represents its internal coin tosses. The third word parameter represents the advice and we usually substitute a there.
We will use the notations Pkj(x,y,z):=P(k,j,x,y,z), akj:=a(k,j).
Definition 3
Fix Δ an error space of rank 2 and (f,μ) a distributional estimation problem. Consider (P,r,a) a (poly,log)-predictor scheme.(P,r,a) is called a Δ(poly,log)-optimal predictor scheme for (f,μ) when for any (poly,log)-predictor scheme (Q,s,b), there is δ∈Δ s.t.
Eμk×Ur(k,j)[(Pkj(x,y,akj)−f(x))2]≤Eμk×Us(k,j)[(Qkj(x,y,bkj)−f(x))2]+δ(k,j)
Note 1
The notation (poly,log) is meant to remind us that we allow a polynomial quantity of random bits r(k,j) and a logarithmic quantity of advice |akj|. In fact, the definitions and some of the theorems can be generalized to other quantities of random and advice (see also Note B.1). Thus, predictor schemes from previous posts are (poly,poly)-predictor schemes, (poly,O(1))-predictor schemes are limited to O(1) advice, (log,0)-predictor schemes use a logarithmic number of random bits and no advice and so on. As usual in complexity theory, it is redundant to consider more advice than random since advice is strictly more powerful.
Δ(poly,log)-optimal predictor scheme satisfy properties analogical to Δ-optimal predictor schemes. These properties are listed in Appendix A. The proofs of Theorem A.1 and A.4 are given in Appendix B. The other proofs are straightforward adaptions of corresponding proofs with polynomial advice.
We also have the following existence result:
Theorem 1
Consider (f,μ) a distributional estimation problem. Define Υ:N2×{0,1}∗3alg−→[0,1] by
Υkj(x,y,Q):=β(evj(Q,x,y))
Define υf,μ:N2→{0,1}∗ by
υkjf,μ:=argmin|Q|≤logjEμk×Uj[(Υkj(x,y,Q)−f(x))2]
Then, (Υ,j,υf,μ) is a Δ2avg(poly,log)-optimal predictor scheme for (f,μ).
Note 2
Consider a distributional decision problem (D,μ). Assume (D,μ) admits n∈N, A:N×{0,1}∗3alg−→{0,1}, a:N→{0,1}∗ and a function r:N→N s.t.
(i) A(k,x,y,z) runs in quasi-polynomial time (O(2lognk)).
(ii) |a(k)|=O(lognk)
(iii) limk→∞Prμk×Ur(k)[A(k,x,y,a(k))≠χD(x)]=0
Then it is easy to see we can construct a (poly,log)-predictor scheme PA taking values in {0,1} s.t. E[(PA−f)2]∈Δ2avg. The implication doesn’t work for larger sizes of time or advice. Therefore, the uncertainty represented by Δ2avg(poly,log)-optimal predictor schemes is associated with the resource gap between quasi-polynomial time plus advice O(lognk) and the resources needed to (heuristically) solve the problem in question.
The proof of Theorem 1 is given in Appendix C: it is a straightforward adaptation of the corresponding proof for polynomial advice. Evidently, the above scheme is non-uniform. We will now describe a class of problems which admits uniform Δ2avg(poly,log)-optimal predictor schemes.
Definition 4
Consider Δ1 an error space of rank 1. A word ensemble μ is called Δ1(log)-sampleable when there is S:N×{0,1}∗2alg−→{0,1}∗ that runs in polynomial time in the 1st argument, aS:N→{0,1}∗ of logarithmic size and rS:N→N a polynomial such that
∑x∈{0,1}∗|μk(x)−PrUrS(k)[Sk(y,aS(k))=x]|∈Δ1
(S,rS,aS) is called a Δ1(log)-sampler for μ.
Definition 5
Consider Δ1 an error space of rank 1. A distributional estimation problem (f,μ) is called Δ1(log)-generatable when there are S:N×{0,1}∗2alg−→{0,1}∗ and F:N×{0,1}∗2alg−→[0,1] that run in polynomial time in the 1st argument, aS:N→{0,1}∗ of logarithmic size and rS:N→N a polynomial such that
(i) (S,rS,aS) is a Δ1(log)-sampler for μ.
(ii) EUrS(k)[(Fk(y,aS(k))−f(Sk(y,aS(k))))2]∈Δ1
(S,F,rS,aS) is called a Δ1(log)-generator for (f,μ).
When aS is the empty string, (S,F,rS) is called a Δ1(0)-generator for (f,μ). Such (f,μ) is called Δ1(0)-generatable.
Note 3
The class of Δ1(0)-generatable problems can be regarded as an average-case analogue of NP∩coNP. If f is a decision problem (i.e. its range is {0,1}), words y∈{0,1}rS(k) s.t. Sk(y)=x, Fk(y)=1 can be regarded as “proofs” of f(x)=1 and words y∈{0,1}rS(k) s.t. Sk(y)=x, Fk(y)=0 can be regarded as “proofs” of f(x)=0.
Theorem 2
There is an oracle machine Λ that accepts an oracle of signature SF:N×{0,1}∗→{0,1}∗×[0,1] and a polynomial r:N→N where the allowed oracle calls are SFk(x) for |x|=r(k) and computes a function of signature N2×{0,1}∗2→[0,1] s.t. for any (f,μ) a distributional estimation problem and G:=(S,F,rS,aS) a corresponding Δ10(log)-generator, Λ[G] is a Δ2avg(poly,log)-optimal predictor scheme for (f,μ).
In particular if (f,μ) is Δ10(0)-generatable, we get a uniform Δ2avg(poly,log)-optimal predictor scheme.
The following is the description of Λ. Consider SF:N×{0,1}∗→{0,1}∗×[0,1] and a polynomial r:N→N. We describe the computation of Λ[SF,r]kj(x) where the extra argument of Λ is regarded as internal coin tosses.
We loop over the first j words in lexicographic order. Each word is interpreted as a program Q:{0,1}∗2alg−→[0,1]. We loop over jk “test runs”. At test run i, we generate (xi∈{0,1}∗,ti∈[0,1]) by evaluating SFk(yi) for yi sampled from Ur(k). We then sample zi from Uj and compute si:=evj(Q,xi,zi). At the end of the test runs, we compute the average error ϵ(Q):=1jk∑i(si−ti)2. At the end of the loop over programs, the program Q∗ with the lowest error is selected and the output evj(Q∗,x) is produced.
The proof that this construction is Δ2avg(poly,log)-optimal is given in Appendix C.
Appendix A
Fix Δ an error space of rank 2.
Theorem A.1
Suppose there is a polynomial h:N2→N s.t. h−1∈Δ. Consider (f,μ) a distributional estimation problem and (P,r,a) a Δ(poly,log)-optimal predictor scheme for (f,μ). Suppose {pkj∈[0,1]}k,j∈N, {qkj∈[0,1]}k,j∈N are s.t.
∃ϵ>0∀k,j:(μk×Ur(k,j)){(x,y)∈{0,1}∗2∣pkj≤Pkj(x,y,akj)≤qkj}≥ϵ
Define
ϕkj:=Eμk×Ur(k,j)[f(x)−Pkj(x,y,akj)∣pkj≤Pkj(x,y,akj)≤qkj]
Assume that either pkj,qkj have a number of digits logarithmically bounded in k,j or Pkj produces outputs with a number of digits logarithmically bounded in k,j (by Theorem A.7 if any Δ(poly,log)-optimal predictor scheme exists for (f,μ) then a Δ(poly,log)-optimal predictor scheme with this property exists as well). Then, |ϕ|∈Δ.
Theorem A.2
Consider μ a word ensemble and f1,f2:{0,1}∗→[0,1] s.t. f1+f2≤1. Suppose (P1,r1,a1) is a Δ(poly,log)-optimal predictor scheme for (f1,μ) and (P2,r2,a2) is a Δ(poly,log)-optimal predictor scheme for (f2,μ). Define P:N2×{0,1}∗3alg−→[0,1] by Pkj(x,y1y2,(z1,z2)):=η(Pkj1(x,y1,z1)+Pkj2(x,y2,z2)) for |yi|=ri(k,j). Then, (P,r1+r2,a1a2) is a Δ(poly,log)-optimal predictor scheme for (f1+f2,μ).
Theorem A.3
Consider μ a word ensemble and f1,f2:{0,1}∗→[0,1] s.t. f1+f2≤1. Suppose (P1,r1,a1) is a Δ(poly,log)-optimal predictor scheme for (f1,μ) and (P,r2,a2) is a Δ(poly,log)-optimal predictor scheme for (f1+f2,μ). Define P2:N2×{0,1}∗3alg−→[0,1] by Pkj2(x,y1y2,(z1,z2)):=η(Pkj(x,y1,z1)−Pkj1(x,y2,z2)) for |yi|=ri(k,j). Then, (P2,r1+r2,a1a2) is a Δ(poly,log)-optimal predictor scheme for (f2,μ).
Theorem A.4
Fix Δ1 an error space of rank 1 s.t. given δ1∈Δ1, the function δ(k,j):=δ1(k) lies in Δ. Consider (f1,μ1), (f2,μ2) distributional estimation problems with respective Δ(poly,log)-optimal predictor schemes (P1,r1,a1) and (P2,r2,a2). Assume μ1 is Δ1(log)-sampleable and (f2,μ2) is Δ1(log)-generatable. Define f1×f2:{0,1}∗→[0,1] by (f1×f2)(x1,x2)=f1(x1)f2(x2) and (f1×f2)(y)=0 for y not of this form. Define P:N2×{0,1}∗3alg−→[0,1] by Pkj((x1,x2),y1y2,(z1,z2)):=Pkj1(x1,y1,z1)Pkj2(x2,y2,z2) for |yi|=ri(k,j). Then, (P,r1+r2,(a1,a2)) is a Δ(poly,log)-optimal predictor scheme for (f1×f2,μ1×μ2).
Theorem A.5
Consider f:{0,1}∗→[0,1], D⊆{0,1}∗ and μ a word ensemble. Assume (PD,rD,aD) is a Δ(poly,log)-optimal predictor scheme for (D,μ) and (Pf∣D,rf∣D,af∣D) is a Δ(poly,log)-optimal predictor scheme for (f,μ∣D). Define P:N2×{0,1}∗3alg−→[0,1] by Pkj(x,y1y2,(z1,z2)):=PkjD(x,y1,z1)Pkjf∣D(x,y2,z2) for |yi|=ri(k,j). Then (P,rD+rf∣D,(aD,af∣D)) is a Δ(poly,log)-optimal predictor scheme for (χDf,μ).
Theorem A.6
Fix h a polynomial s.t. 2−h∈Δ. Consider f:{0,1}∗→[0,1], D⊆{0,1}∗ and μ a word ensemble. Assume ∃ϵ>0∀k:μk(D)≥ϵ. Assume (PD,rD,aD) is a Δ(poly,log)-optimal predictor scheme for (D,μ) and (PχDf,rχDf,aχDf) is a Δ(poly,log)-optimal predictor scheme for (χDf,μ). Define Pf∣D:N2×{0,1}∗3alg−→[0,1] by
Pkjf∣D(x,y1y2,(z1,z2)):=⎧⎪ ⎪⎨⎪ ⎪⎩1if PkjD(x,y2,z2)=0η(PkjχDf(x,y1,z1)PkjD(x,y2,z2))rounded to h(k,j) binary places if PkjD(x,y2,z2)>0
Then, (Pf∣D,rD+rχDf,(aχDf,aD)) is a Δ(poly,log)-optimal predictor scheme for (f,μ∣D).
Definition A.1
Consider μ a word ensemble and ^Q1:=(Q1,s1,b1), ^Q2:=(Q2,s2,b2) (poly,log)-predictor schemes. We say ^Q1 is Δ-similar to ^Q2 relative to μ (denoted ^Q1μ≃Δ^Q2) when Eμk×Us1(k,j)×Us2(k,j)[(Qkj1(x,y1,bkj1)−Qkj2(x,y2,bkj2))2]∈Δ.
Theorem A.7
Consider (f,μ) a distributional estimation problem, ^P a Δ(poly,log)-optimal predictor scheme for (f,μ) and ^Q a (poly,log)-predictor scheme. Then, ^Q is a Δ(poly,log)-optimal predictor scheme for (f,μ) if and only if ^Pμ≃Δ^Q.
Note A.1
Δ-similarity is not an equivalence relation on the set of arbitrary (poly,log)-predictor schemes. However, it is an equivalence relation on the set of (poly,log)-predictor schemes ^Q satisfying ^Qμ≃Δ^Q (i.e. the μ-expectation value of the intrinsic variance of ^Q is in Δ). In particular, for any f:{0,1}∗→[0,1] any Δ(poly,log)-optimal predictor scheme for (f,μ) has this property.
Appendix B
Definition B.1
Given n∈N, a function δ:N2+n→R≥0 is called Δ-moderate when
(i) δ is non-decreasing in arguments 3 to 2+n.
(ii) For any collection of polynomials {pi:N2→N}i<n, δ(k,j,p0(k,j)…pn−1(k,j))∈Δ
Lemma B.1
Fix (f,μ) a distributional estimation problem and ^P:=(P,r,a) a (poly,log)-predictor scheme. Then, ^P is Δ(poly,log)-optimal iff there is a Δ-moderate function δ:N4→[0,1] s.t. for any k,j,s∈N, Q:{0,1}∗2alg−→[0,1]
Eμk×Ur(k,j)[(Pkj(x,y,akj)−f(x))2]≤Eμk×Us[(Q(x,y)−f(x))2]+δ(k,j,TμQ(k,s),2|Q|)
Proof of Lemma B.1
Define
δ(k,j,t,u):=maxTμQ(k,s)≤t|Q|≤logu{Eμk×Ur(k,j)[(Pkj(x,y,akj)−f(x))2]−Eμk×Us[(Q(x,y)−f(x))2]}
Note B.1
Lemma B.1 shows that the error bound for Δ(poly,log)-optimal predictor scheme is in some sense uniform with respect to Q. This doesn’t generalize to e.g. Δ(poly,O(1))-optimal predictor schemes. The latter still admit a weaker version of Theorems A.1 and direct analogues of Theorems A.2, A.3, A.5, A.6 and A.7. Theorem A.4 doesn’t seem to generalize.
Lemma B.2
Suppose there is a polynomial h:N2→N s.t. h−1∈Δ. Fix (f,μ) a distributional estimation problem and (P,r,a) a corresponding Δ(poly,log)-optimal predictor scheme. Consider (Q,s,b) a (poly,log)-predictor scheme, M>0, w:N2×{0,1}∗3alg−→Q∩[0,M] with runtime bounded by a polynomial in the first two arguments, and u:N2→{0,1}∗ of logarithmic size. Then there is δ∈Δ s.t.
Eμk×Umax(r(k,j),s(k,j))[wkj(x,y,ukj)(Pkj(x,y≤r(k,j),akj)−f(x))2]≤Eμk×Umax(r(k,j),s(k,j))[wkj(x,y,ukj)(Qkj(x,y≤s(k,j),bkj)−f(x))2]+δ(k,j)
Proof of Lemma B.2
Given t∈[0,M], define αkj(t) to be t rounded within error h(k,j)−1. Thus, the number of digits in αkj(t) is logarithmic in k and j. Denote q(k,j):=max(r(k,j),s(k,j)). Consider ^Qt:=(Qt,r+s,bt) the (poly,log)-predictor scheme defined by
Qkjt(x,y,bkjt):={Qkj(x,y≤s(k,j),bkj)if wkj(x,y≤q(k,j),ukj)≥αkj(t)Pkj(x,y≤r(k,j),akj)if wkj(x,y≤q(k,j),ukj)<αkj(t)
^Qt satisfies bounds on runtime and advice size uniform in t. Therefore, Lemma B.1 implies that there is δ∈Δ s.t.
Eμk×Ur(k,j)[(Pkj(x,y,akj)−f(x))2]≤Eμk×Ur(k,j)+s(k,j)[(Qkjt(x,y,bkj)−f(x))2]+δ(k,j)
Eμk×Ur(k,j)+s(k,j)[(Pkj(x,y≤r(k,j),akj)−f(x))2−(Qkjt(x,y,bkj)−f(x))2]≤δ(k,j)
Eμk×Uq(k,j)[θ(wkj(x,y,ukj)−αkj(t))((Pkj(x,y≤r(k,j),akj)−f(x))2−(Qkj(x,y≤s(k,j),bkj)−f(x))2)]≤δ(k,j)
Eμk×Uq(k,j)[∫M0θ(wkj(x,y,z,ukj)−αkj(t))dt((Pkj(x,y≤r(k,j),akj)−f(x))2−(Qkj(x,y≤s(k,j),bkj)−f(x))2)]≤Mδ(k,j)
Eμk×Uq(k,j)[wkj(x,y,z,ukj)((Pkj(x,y≤r(k,j),akj)−f(x))2−(Qkj(x,y≤s(k,j),bkj)−f(x))2)]≤Mδ(k,j)+h(k,j)−1
In the following proofs we will use shorthand notations that omit most of the symbols that are clear for the context. That is, we will use P to mean Pkj(x,y,akj), f to mean f(x), E[…] to mean Eμk×Ur(k,j)[…] etc.
Proof of Theorem A.1
Define w:N2×{0,1}∗3alg−→{0,1} and u:N2→{0,1}∗ by
w:=θ(P−p)θ(q−P)
We have
ϕ=E[w(f−P)]E[w]
Define ψ to be ϕ truncated to the first significant binary digit. Denote I⊆N2 the set of (k,j) for which |ϕkj|>h(k,j)−1. Consider (Q,s,b) a (poly,log)-predictor scheme satisfying
∀(k,j)∈I:Qkj=η(Pkj+ψkj)
Such Q exists since for (k,j)∈I, ψkj has binary notation of logarithmically bounded size.
Applying Lemma B.2 we get
∀(k,j)∈I:E[wkj(Pkj−f)2]≤E[wkj(Qkj−f)2]+δ(k,j)
for δ∈Δ.
∀(k,j)∈I:E[wkj((Pkj−f)2−(Qkj−f)2)]≤δ(k,j)
∀(k,j)∈I:E[wkj((Pkj−f)2−(η(Pkj+ψkj)−f)2)]≤δ(k,j)
Obviously (η(Pkj+ψkj)−f)2≤(Pkj+ψkj−f)2, therefore
∀(k,j)∈I:E[wkj((Pkj−f)2−(Pkj+ψkj−f)2)]≤δ(k,j)
∀(k,j)∈I:ψkjE[wkj(2(f−Pkj)−ψkj)]≤δ(k,j)
The expression on the left hand side is a quadratic polynomial in ψkj which attains its maximum at ϕkj and has roots at 0 and 2ϕkj. ψkj is between 0 and ϕkj, but not closer to 0 than ϕkj2. Therefore, the inequality is preserved if we replace ψkj by ϕkj2.
∀(k,j)∈I:ϕkj2E[wkj(2(f−Pkj)−ϕkj2)]≤δ(k,j)
Substituting the equation for ϕkj we get
∀(k,j)∈I:12E[wkj(f−Pkj)]E[wkj]E[wkj(2(f−Pkj)−12E[wkj(f−Pkj)]E[wkj])]≤δ(k,j)
∀(k,j)∈I:34E[wkj(f−Pkj)]2E[wkj]≤δ(k,j)
∀(k,j)∈I:34E[wkj]ϕ2kj≤δ(k,j)
∀(k,j)∈I:ϕ2kj≤43E[wkj]−1δ(k,j)
∀(k,j)∈I:ϕ2kj≤43(μk×Ur(k,j)){pkj≤Pkj≤qkj}−1δ(k,j)
Thus for all k,j∈N we have
|ϕkj|≤h(k,j)−1+√43(μk×Ur(k,j)){pkj≤Pkj≤qkj}−1δ(k,j)
In particular, |ϕ|∈Δ.
Lemma B.3
Consider (f,μ) a distributional estimation problem and (P,r,a) a Δ(poly,log)-optimal predictor scheme for (f,μ). Then there are c1,c2∈R and a Δ-moderate function δ:N4→[0,1] s.t. for any k,j,s∈N, Q:{0,1}∗2alg−→Q
|Eμk×Us×Ur(k,j)[Q(Pkj−f)]|≤(c1+c2Eμk×Us[Q2])δ(k,j,TμQ(k,s),2|Q|)
Conversely, consider M∈Q and (P,r,a) a Q∩[−M,+M]-valued (poly,log)-bischeme. Suppose that for any Q∩[−M−1,+M]-valued (poly,log)-bischeme (Q,s,b) we have |E[Q(P−f)]|∈Δ.
Define ~P to be s.t. computing ~Pkj is equivalent to computing η(Pkj) rounded to h(k,j) digits after the binary point, where 2−h∈Δ. Then, ~P is a Δ(poly,log)-optimal predictor scheme for (f,μ).
Proof of Lemma B.3
Assume P is a Δ(poly,log)-optimal predictor scheme. Consider k,j,s∈N, Q:{0,1}∗2alg−→Q. Define t:=σ2−a where σ∈{±1} and a∈N. Define R:{0,1}∗2alg−→[0,1] to compute η(P+tQ) rounded within error 2−h. By Lemma B.1
Eμk×Ur(k,j)[(Pkj−f)2]≤Eμk×Ur(k,j)×Us[(R−f)2]+~δ(k,j,TμR(k,r(k,j)+s),2|R|)
where ~δ is Δ-moderate. It follows that
Eμk×Ur(k,j)[(Pkj−f)2]≤Eμk×Ur(k,j)×Us[(η(P+tQ)−f)2]+δ(k,j,TμQ(k,s),2|Q|)
where δ is Δ-moderate (a doesn’t enter the error bound because of the 2−h rounding). As in the proof of Theorem A.1, η can be dropped.
Eμk×Ur(k,j)×Us[(Pkj−f)2−(Pkj+tQ−f)2]≤δ(k,j,TμQ(k,s),2|Q|)
The expression on the left hand side is a quadratic polynomial in t. Explicitly:
−Eμk×Us[Q2]t2−2Eμk×Ur(k,j)×Us[Q(Pkj−f)]t≤δ(k,j,TμQ(k,s),2|Q|)
Moving E[Q]t2 to the right hand side and dividing both sides by 2|t|=21−a we get
−Eμk×Ur(k,j)×Us[Q(Pkj−f)]σ≤2a−1δ(k,j,TμQ(k,s),2|Q|)+Eμk×Us[Q2]2−a−1
|Eμk×Ur(k,j)×Us[Q(Pkj−f))]|≤2a−1δ(k,j,TμQ(k,s),2|Q|)+Eμk×Us[Q2]2−a−1
Take a:=−12logδ(k,j,TμQ(k,s),2|Q|)+ϕ(k,j) where ϕ(k,j)∈[−12,+12] is the rounding error. We get
|Eμk×Ur(k,j)×Us[Q(Pkj−f)]|≤2ϕ(k,j)−1δ(k,j,TμQ(k,s),2|Q|)12+Eμk×Us[Q2]2−ϕ(k,j)−1δ(k,j,TμQ(k,s),2|Q|)12
Conversely, assume that for any Q∩[−M−1,+M]-valued (poly,log)-bischeme (R,t,c)
|E[R(P−f)]|≤δ
Consider (Q,b,s) a (poly,log)-predictor scheme. We have
E[(Q−f)2]=E[(Q−P+P−f)2]
E[(Q−f)2]=E[(Q−P)2]+E[(P−f)2]+2E[(Q−P)(P−f)]
2E[(P−Q)(P−f)]=E[(P−f)2]−E[(Q−f)2]+E[(Q−P)2]
Taking R to be P−Q we get
E[(P−f)2]−E[(Q−f)2]+E[(Q−P))2]≤δ
where δ∈Δ. Noting that E[(Q−P)2]≥0 and (η(P)−f)2≤(P−f)2 we get
E[(η(P)−f)2]−E[(Q−f)2]≤δ
Observing that ~P−η(P) is bounded by a function in Δ, we get the desired result.
Theorems A.2 and A.3 follow trivially from Lemma B.3 and we omit the proof.
Proof of Theorem A.4
We have
P(x1,x2)−(f1×f2)(x1,x2)=(P1(x1)−f1(x1))f2(x2)+P1(x1)(P2(x2)−f2(x2))
Therefore, for any Q∩[−1,+1]-valued (poly,log)-bischeme (Q,s,b)
|E[Q(P−f1×f2)]|≤|E[Q(x1,x2)(P1(x1)−f1(x1))f2(x2)]|+|E[Q(x1,x2)P1(x1)(P2(x2)−f2(x2))]|
By Lemma B.3, it is sufficient to show an appropriate bound for each of the terms on the right hand side. Suppose (S2,F2,rS2,aS2) is a Δ1(log)-generator for (f2,μ2). For the first term, we have
|Eμk1×μk2×Us(k,j)+r1(k,j)[Qkj(x1,x2)(Pkj1(x1)−f1(x1))f2(x2)]|≤|Eμk1×UrS2(k)×Us(k,j)+r1(k,j)[Qkj(x1,Sk2)(Pkj1(x1)−f1(x1))Fk2]|+δ12(k)
where δ12∈Δ1. Applying Lemma B.3 for P1, we get
|Eμk1×μk2×Us(k,j)+r1(k,j)[Qkj(x1,x2)(Pkj1(x1)−f1(x1))f2(x2)]|≤δ1(k,j)+δ12(k)
where δ1∈Δ.
Suppose (S1,rS1,aS1) is a Δ1(log)-sampler for μ1. For the second term, we have
|Eμk1×μk2×Us(k,j)+r1(k,j)[Qkj(x1,x2)P1(x1)(Pkj2(x2)−f2(x2))]|≤|EUrS1(k)×μk2×Us(k,j)+r1(k,j)[Qkj(Sk1,x2)P1(Sk1)(Pkj2(x2)−f2(x2))]|+δ11(k)
where δ11∈Δ1. Applying Lemma B.3 for P2, we get
|Eμk1×μk2×Us(k,j)+r1(k,j)[Qkj(x1,x2)P1(x1)(Pkj2(x2)−f2(x2))]|≤δ2(k,j)+δ11(k)
where δ2∈Δ. Again, we got the required bound.
Appendix C
Proposition C.1
Consider a polynomial q:N2→N. There is a function λq:N3→[0,1] s.t.
(i) ∀k,j∈N:∑i∈Nλq(k,j,i)=1
(ii) For any function ϵ:N2→[0,1] we have
ϵ(k,j)−∑i∈Nλq(k,j,i)ϵ(k,q(k,j)+i)∈Δ2avg
Proof of Proposition C.1
Given functions q1,q2:N2→N s.t.q1(k,j)≥q2(k,j) for k,j≫0, the proposition for q1 implies the proposition for q2 by setting
λq2(k,j,i):={λq1(k,j,i−q1(k,j)+q2(k,j))if i−q1(k,j)+q2(k,j)≥00if i−q1(k,j)+q2(k,j)<0
Therefore, it is enough to prove to proposition for functions of the form q(k,j)=jm+nlogklog3 for m>0.
Consider F:N→N s.t.
limk→∞loglogkloglogF(k)=0
Observe that
limk→∞log(m+nlogklog3)loglogF(k)−loglog3=0
limk→∞3m+nlogklog3∫x=3d(loglogx)loglogF(k)−loglog3=0
Since ϵ takes values in [0,1]
limk→∞3m+nlogklog3∫x=3ϵ(k,⌊x⌋)d(loglogx)loglogF(k)−loglog3=0
Similarly
limk→∞F(k)m+nlogklog3∫x=F(k)ϵ(k,⌊x⌋)d(loglogx)loglogF(k)−loglog3=0
The last two equations imply that
limk→∞F(k)∫x=3ϵ(k,⌊x⌋)d(loglogx)−F(k)m+nlogklog3∫x=3m+nlogklog3ϵ(k,⌊x⌋)d(loglogx)loglogF(k)−loglog3=0
Raising x to a power is equivalent to adding a constant to loglogx, therefore
limk→∞F(k)∫x=3ϵ(k,⌊x⌋)d(loglogx)−F(k)∫x=3ϵ(k,⌊xm+nlogklog3⌋)d(loglogx)loglogF(k)−loglog3=0
limk→∞F(k)∫x=3(ϵ(k,⌊x⌋)−ϵ(k,⌊xm+nlogklog3⌋))d(loglogx)loglogF(k)−loglog3=0
Since ⌊xm+nlogklog3⌋≥⌊x⌋m+nlogklog3 we can choose λq satisfying condition (i) so that
j+1∫x=jϵ(k,⌊xm+nlogklog3⌋)d(loglogx)=(loglog(j+1)−loglogj)∑iλq(k,j,i)ϵ(k,jm+nlogklog3+i)
It follows that
j+1∫x=jϵ(k,⌊xm+nlogklog3⌋)d(loglogx)=j+1∫x=j∑iλq(k,⌊x⌋,i)ϵ(k,⌊x⌋m+nlogklog3+i)d(loglogx)
limk→∞F(k)∫x=3(ϵ(k,⌊x⌋)−∑iλq(k,⌊x⌋,i)ϵ(k,⌊x⌋m+nlogklog3+i))d(loglogx)loglogF(k)−loglog3=0
limk→∞∑F(k)−1j=3(loglog(j+1)−loglogj)(ϵ(k,j)−∑iλq(k,j,i)ϵ(k,jm+nlogklog3+i))loglogF(k)−loglog3=0
ϵ(k,j)−∑i∈Nλq(k,j,i)ϵ(k,q(k,j)+i)∈Δ2avg
Lemma C.1
Consider (f,μ) a distributional estimation problem, (P,r,a), (Q,s,b) (poly,log)-predictor schemes. Suppose p:N2→N a polynomial and δ∈Δ2avg are s.t.
∀i,k,j∈N:E[(Pk,p(k,j)+i−f)2]≤E[(Qkj−f)2]+δ(k,j)
Then ∃δ′∈Δ2avg s.t.
E[(Pkj−f)2]≤E[(Qkj−f)2]+δ′(k,j)
Proof of Lemma C.1
By Proposition C.1 we have
~δ(k,j):=E[(Pkj−f)2]−∑iλp(k,j,i)E[(Pk,p(k,j)+i−f)2]∈Δ2avg
E[(Pkj−f)2]=∑iλp(k,j,i)E[(Pk,p(k,j)+i−f)2]+~δ(k,j)
E[(Pkj−f)2]≤∑iλp(k,j,i)(E[(Qkj−f)2]+δ(k,j))+~δ(k,j)
E[(Pkj−f)2]≤E[(Qkj−f)2]+δ(k,j)+~δ(k,j)
Proof of Theorem 1
Define ϵ(k,j) by
ϵ(k,j):=Eμk×Uj[(Υkj(x,y,υkjf,μ)−f(x))2]
It is easily seen that
ϵ(k,j)≤min|Q|≤logjTμQ(k,j)≤jEμk×Uj[(Q(x,y)−f(x))2]
Therefore, there is a polynomial p:N3→N s.t. for any (poly,log)-predictor scheme (Q,s,b)
∀i,j,k∈N:ϵ(k,p(s(k,j),TμQkj(k,s(k,j)),2|Q|+|bkj|)+i)≤Eμk×Us(k,j)[(Qkj−f)2]
Applying Lemma C.1, we get the desired result.
Proof of Theorem 2
Consider (P,r,a) a (poly,log)-predictor scheme. Choose p:N2→N a polynomial s.t. evaluating Λ[G]k,p(k,j) involves running Pkj until it halts “naturally” (such p exists because P runs in at most polynomial time and has at most logarithmic advice). Given i,j,k∈N, consider the execution of Λ[G]k,p(k,j)+i. The standard deviation of ϵ(Pkj) with respect to the internal coin tosses of Λ is at most ((p(k,j)+i)k)−12. The expectation value is E[(Pkj−f)2]+γP where |γP|≤δ(k) for δ∈Δ10 which doesn’t depend on i,k,j,P. By Chebyshev’s inequality,
Pr[ϵ(Pkj)≥E[(Pkj−f)2]+δ(k)+((p(k,j)+i)k)−14]≤((p(k,j)+i)k)−12
Hence
Pr[ϵ(Q∗)≥E[(Pkj−f)2]+δ(k)+((p(k,j)+i)k)−14]≤((p(k,j)+i)k)−12
The standard deviation of ϵ(Q) for any Q is also at most ((p(k,j)+i)k)−12. The expectation value is E[(evp(k,j)+i(Q)−f)2]+γQ where |γQ|≤δ(k). Therefore
Pr[∃Q<p(k,j)+i:ϵ(Q)≤E[(evp(k,j)+i(Q)−f)2]−δ(k)−k−14]≤(p(k,j)+i)(p(k,j)+i)−1k−12=k−12
The extra p(k,j)+i factor comes from summing probabilities over p(k,j)+i programs. Combining we get
Pr[E[(evp(k,j)+i(Q∗)−f)2]≥E[(Pkj−f)2]+2δ(k)+((p(k,j)+i)−14+1)k−14]≤((p(k,j)+i)−12+1)k−12
E[(Λ[G]k,p(k,j)+i−f)2]≤E[(Pkj−f)2]+2δ(k)+((p(k,j)+i)−14+1)k−14+((p(k,j)+i)−12+1)k−12
E[(Λ[G]k,p(k,j)+i−f)2]≤E[(Pkj−f)2]+2δ(k)+(p(k,j)−14+1)k−14+(p(k,j)−12+1)k−12
Applying Lemma C.1 we get the desired result.