I introduce the concept of “optimal predictor scheme” which differs from (quasi)optimal predictors in depending on an additional parameter representing the amount of computing resources the predictor is allowed to use. For a certain flavor of optimal predictor scheme, I prove existence for arbitrary distributional decision problems.
Results
It is convenient to think of the concepts of “optimal predictor” and “quasi-optimal predictor” as special cases of the more general Δ-optimal predictors corresponding to different choices of the “error space” Δ.
Definition 1
Let r be a positive integer. An error space of rank rΔ is a set of bounded functions from Nr to R≥0 s.t.
(i) If δ1,δ2∈Δ then δ1+δ2∈Δ.
(ii) If δ1∈Δ and δ2≤δ1 then δ2∈Δ.
(iii) Given α∈(0,1], if δ∈Δ then δα∈Δ.
(iv) There is a polynomial h:Nr→R s.t. 2−h∈Δ.
Example 1.1
Δ10 is the set of functions δ:N→R≥0 s.t. limk→∞δ(k)=0.
Example 1.2
Δ1neg is the set of negligible functions δ:N→R≥0.
Example 1.3
Δ2avg is the set of bounded functions δ:N2→R≥0 s.t. for any f:N→N, if
It is slightly non-obvious that condition (iii) holds in this example. To see this, note that the expression inside the limit can be regarded as Eλkf[δ(k,j)] where λkf is a certain probability measure over j. limk→∞Eλkf[δ(k,j)]=0 implies that for any ϵ>0, limk→∞Prλkf[δ(k,j)>ϵ]=0 since Eλkf[δ(k,j)]≥ϵPrλkf[δ(k,j)>ϵ]. For any ϵ, we have Eλkf[δ(k,j)α]≤ϵ+Prλkf[δ(k,j)>ϵα−1]supδ, therefore limk→∞Eλkf[δ(k,j)α]=0.
Definition 2
Consider Δ an error space of rank 1 and (D,μ) a distributional decision problem. A Δ-optimal predictor for (D,μ) is a family of polynomial size Boolean circuits {Pk:suppμkcirc−−→[0,1]}k∈N s.t. for any family of polynomial size Boolean circuits {Qk:suppμkcirc−−→[0,1]}k∈N we have
Eμk[(Pk(x)−χD(x))2]≤Eμk[(Qk(x)−χD(x))2]+δ(k)
where δ∈Δ.
In particular, optimal predictors are Δ1neg-optimal predictors and quasi-optimal predictors are Δ10-optimal predictors.
Definition 3
Consider Δ an error space of rank 2 and (D,μ) a distributional decision problem. A Δ-optimal predictor scheme for (D,μ) is a family of Boolean circuits {Pkj:suppμkcirc−−→[0,1]}k,j∈N s.t.
(i) |Pkj|≤p(k,j) for some polynomial p.
(ii) for any family of Boolean circuits {Qkj:suppμkcirc−−→[0,1]}k,j∈N that satisfies (i) we have
Eμk[(Pkj(x)−χD(x))2]≤Eμk[(Qkj(x)−χD(x))2]+δ(k,j)
where δ∈Δ.
It is straightforward to generalize Theorems 1-8 about quasi-optimal predictors to Δ-optimal predictors and Δ-optimal predictor schemes. The versions for optimal predictor schemes are stated in Appendix A without proof. The only theorem whose proof requires a slightly non-obvious tweak is Theorem 1. The amended proof is given in Appendix B.
The main technical novelty of this post is the following existence theorem.
Theorem 1
Consider (D,μ) a distributional decision problem. Define the circuit family P∗D,μ by
P∗kjD,μ:=argmin|Q|≤jEμk[(Q(x)−χD(x))2]
Then, P∗D,μ is a Δ2avg-optimal predictor scheme for (D,μ).
The proof is given in Appendix C.
The definition of P∗ is rather trivial, but the fact Theorems A.1-A.7 apply to P∗ with regard to the error space Δ2avg is non-trivial. For example, this can be applied to the “classical” setting of logical uncertainty by taking D to be the set of true sentences in some formal logic F (e.g. PA, ZFC). For a suitable choice of μ, P∗D,μ will satisfy an approximate version of the coherence conditions because of the considerations here.
Appendix A
Fix Δ an error space of rank 2, h the polynomial from condition (iv).
Theorem A.1
Consider (D,μ) a distributional decision problem and P a Δ-optimal predictor scheme for (D,μ). Suppose {pkj∈[0,1]}k,j∈N, {qkj∈[0,1]}k,j∈N are s.t.
∃ϵ>0∀k,j:μk{x∈{0,1}∗∣pkj≤Pkj(x)≤qkj}≥ϵ
Define
ϕkj:=Eμk[χD(x)−Pkj(x)∣pkj≤Pkj(x)≤qkj]
Then, |ϕ|∈Δ.
Theorem A.2
Consider μ a word ensemble and D1, D2 disjoint languages. Suppose P1 is a Δ-optimal predictor scheme for (D1,μ) and P2 is a Δ-optimal predictor scheme for (D2,μ). Then, P:=η(P1+P2) is a Δ-optimal predictor scheme for (D1∪D2,μ).
Theorem A.3
Consider μ a word ensemble and D1, D2 disjoint languages. Suppose P1 is a Δ-optimal predictor scheme for (D1,μ) and P is a Δ-optimal predictor scheme for (D1∪D2,μ). Then, P2:=η(P−P1) is a Δ-optimal predictor scheme for (D2,μ).
Theorem A.4
Consider (D1,μ1), (D2,μ2) distributional decision problems with respective Δ-optimal predictor schemes P1 and P2. Define {Pkj:suppμkcirc−−→[0,1]}k∈N as the family of circuits computing Pkj((x1,x2)):=Pkj1(x1)Pkj2(x2). Then, P is a Δ-optimal predictor scheme for (D1×D2,μ1×μ2).
Theorem A.5
Consider C,D⊆{0,1}∗ and μ a word ensemble. Assume PD is a Δ-optimal predictor scheme for (D,μ) and PC∣D is a Δ-optimal predictor scheme for (C,μ∣D). Then PDPC∣D is a Δ-optimal predictor scheme for (C∩D,μ).
Theorem A.6
Consider C,D⊆{0,1}∗ and μ a word ensemble. Assume ∃ϵ>0∀k:μk(D)≥ϵ. Assume PD is a Δ-optimal predictor scheme for (D,μ) and PC∩D is a Δ-optimal predictor scheme for (C∩D,μ). Define PC∣D as the circuit family computing
PkjC∣D(x):=⎧⎪
⎪⎨⎪
⎪⎩1if PkjD(x)=0η(PkjC∩D(x)PkjD(x))rounded to h(k,j) binary places if PkjD(x)>0
Then, PC∣D is a Δ-optimal predictor scheme for (C,μ∣D).
Definition A.1
Consider μ a word ensemble and {Qkj1,2:suppμkcirc−−→[0,1]}k,j∈N two circuit families. We say Q1 is Δ-similar to Q2 relative to μ (denoted Q1μ≃ΔQ2) when Eμk[(Qk1(x)−Qk2(x))2]∈Δ.
Theorem A.7
Consider (D,μ) a distributional decision problem, P a Δ-optimal predictor scheme for (D,μ) and {Qkj:suppμkcirc−−→[0,1]}k∈N a polynomial size family. Then, Q is a Δ-optimal predictor scheme for (D,μ) if and only if Pμ≃ΔQ.
Appendix B
Lemma B.1
Consider (D,μ) a distributional decision problem and P a corresponding Δ-optimal predictor scheme. Then, there is a function δ:N4→[0,1] s.t.
(i) δ is non-decreasing in the third and fourth arguments.
(ii) For all polynomials p,q:N2→N, we have δ(k,j,p(k,j),q(k,j))∈Δ.
(iii) for all k∈N, Q:suppμkcirc−−→[0,1] and w:suppμkcirc−−→Q≥0 we have
Define {wkj:suppμkcirc−−→{0,1}}k,j∈N as the circuits computing
wkj(x):=θ(Pkj(x)−pkj)θ(qkj−Pkj(x))
|wkj| is bounded by a polynomial since Pkj produces binary fractions of polynomial size therefore it is possible to compare them to the fixed numbers pkj,qkj using a polynomial size circuit even if the latter have infinite binary expansions.
We have
ϕkj=Eμk[wkj(x)(χD(x)−Pkj(x))]Eμk[wkj(x)]
Define ψkj to be ϕkj truncated to the first significant binary digit. Define {Qkj:suppμkcirc−−→[0,1]}k,j∈N as the circuits computing
Qkj(x):=η(Pkj(x)+ψkj)
Denote I⊆N2 the set of (k,j) for which ϕkj>2−h(k,j). For (k,j)∈I, ψkj has binary notation of polynomially bounded size, therefore |Qkj| is bounded by a polynomial for such (k,j).
The expression on the left hand side is a quadratic polynomial in ψkj which attains its maximum at ϕkj and has roots at 0 and 2ϕkj. ψkj is between 0 and ϕkj, but not closer to 0 than ϕkj2. Therefore, the inequality is preserved if we replace ψkj by ϕkj2.
Optimal predictor schemes
I introduce the concept of “optimal predictor scheme” which differs from (quasi)optimal predictors in depending on an additional parameter representing the amount of computing resources the predictor is allowed to use. For a certain flavor of optimal predictor scheme, I prove existence for arbitrary distributional decision problems.
Results
It is convenient to think of the concepts of “optimal predictor” and “quasi-optimal predictor” as special cases of the more general Δ-optimal predictors corresponding to different choices of the “error space” Δ.
Definition 1
Let r be a positive integer. An error space of rank r Δ is a set of bounded functions from Nr to R≥0 s.t.
(i) If δ1,δ2∈Δ then δ1+δ2∈Δ.
(ii) If δ1∈Δ and δ2≤δ1 then δ2∈Δ.
(iii) Given α∈(0,1], if δ∈Δ then δα∈Δ.
(iv) There is a polynomial h:Nr→R s.t. 2−h∈Δ.
Example 1.1
Δ10 is the set of functions δ:N→R≥0 s.t. limk→∞δ(k)=0.
Example 1.2
Δ1neg is the set of negligible functions δ:N→R≥0.
Example 1.3
Δ2avg is the set of bounded functions δ:N2→R≥0 s.t. for any f:N→N, if
limk→∞loglogkloglogf(k)=0
then
limk→∞f(k)−1∑j=3(loglog(j+1)−loglogj)δ(k,j)loglogf(k)−loglog3=0
It is slightly non-obvious that condition (iii) holds in this example. To see this, note that the expression inside the limit can be regarded as Eλkf[δ(k,j)] where λkf is a certain probability measure over j. limk→∞Eλkf[δ(k,j)]=0 implies that for any ϵ>0, limk→∞Prλkf[δ(k,j)>ϵ]=0 since Eλkf[δ(k,j)]≥ϵPrλkf[δ(k,j)>ϵ]. For any ϵ, we have Eλkf[δ(k,j)α]≤ϵ+Prλkf[δ(k,j)>ϵα−1]supδ, therefore limk→∞Eλkf[δ(k,j)α]=0.
Definition 2
Consider Δ an error space of rank 1 and (D,μ) a distributional decision problem. A Δ-optimal predictor for (D,μ) is a family of polynomial size Boolean circuits {Pk:suppμkcirc−−→[0,1]}k∈N s.t. for any family of polynomial size Boolean circuits {Qk:suppμkcirc−−→[0,1]}k∈N we have
Eμk[(Pk(x)−χD(x))2]≤Eμk[(Qk(x)−χD(x))2]+δ(k)
where δ∈Δ.
In particular, optimal predictors are Δ1neg-optimal predictors and quasi-optimal predictors are Δ10-optimal predictors.
Definition 3
Consider Δ an error space of rank 2 and (D,μ) a distributional decision problem. A Δ-optimal predictor scheme for (D,μ) is a family of Boolean circuits {Pkj:suppμkcirc−−→[0,1]}k,j∈N s.t.
(i) |Pkj|≤p(k,j) for some polynomial p.
(ii) for any family of Boolean circuits {Qkj:suppμkcirc−−→[0,1]}k,j∈N that satisfies (i) we have
Eμk[(Pkj(x)−χD(x))2]≤Eμk[(Qkj(x)−χD(x))2]+δ(k,j)
where δ∈Δ.
It is straightforward to generalize Theorems 1-8 about quasi-optimal predictors to Δ-optimal predictors and Δ-optimal predictor schemes. The versions for optimal predictor schemes are stated in Appendix A without proof. The only theorem whose proof requires a slightly non-obvious tweak is Theorem 1. The amended proof is given in Appendix B.
The main technical novelty of this post is the following existence theorem.
Theorem 1
Consider (D,μ) a distributional decision problem. Define the circuit family P∗D,μ by
P∗kjD,μ:=argmin|Q|≤jEμk[(Q(x)−χD(x))2]
Then, P∗D,μ is a Δ2avg-optimal predictor scheme for (D,μ).
The proof is given in Appendix C.
The definition of P∗ is rather trivial, but the fact Theorems A.1-A.7 apply to P∗ with regard to the error space Δ2avg is non-trivial. For example, this can be applied to the “classical” setting of logical uncertainty by taking D to be the set of true sentences in some formal logic F (e.g. PA, ZFC). For a suitable choice of μ, P∗D,μ will satisfy an approximate version of the coherence conditions because of the considerations here.
Appendix A
Fix Δ an error space of rank 2, h the polynomial from condition (iv).
Theorem A.1
Consider (D,μ) a distributional decision problem and P a Δ-optimal predictor scheme for (D,μ). Suppose {pkj∈[0,1]}k,j∈N, {qkj∈[0,1]}k,j∈N are s.t.
∃ϵ>0∀k,j:μk{x∈{0,1}∗∣pkj≤Pkj(x)≤qkj}≥ϵ
Define
ϕkj:=Eμk[χD(x)−Pkj(x)∣pkj≤Pkj(x)≤qkj]
Then, |ϕ|∈Δ.
Theorem A.2
Consider μ a word ensemble and D1, D2 disjoint languages. Suppose P1 is a Δ-optimal predictor scheme for (D1,μ) and P2 is a Δ-optimal predictor scheme for (D2,μ). Then, P:=η(P1+P2) is a Δ-optimal predictor scheme for (D1∪D2,μ).
Theorem A.3
Consider μ a word ensemble and D1, D2 disjoint languages. Suppose P1 is a Δ-optimal predictor scheme for (D1,μ) and P is a Δ-optimal predictor scheme for (D1∪D2,μ). Then, P2:=η(P−P1) is a Δ-optimal predictor scheme for (D2,μ).
Theorem A.4
Consider (D1,μ1), (D2,μ2) distributional decision problems with respective Δ-optimal predictor schemes P1 and P2. Define {Pkj:suppμkcirc−−→[0,1]}k∈N as the family of circuits computing Pkj((x1,x2)):=Pkj1(x1)Pkj2(x2). Then, P is a Δ-optimal predictor scheme for (D1×D2,μ1×μ2).
Theorem A.5
Consider C,D⊆{0,1}∗ and μ a word ensemble. Assume PD is a Δ-optimal predictor scheme for (D,μ) and PC∣D is a Δ-optimal predictor scheme for (C,μ∣D). Then PDPC∣D is a Δ-optimal predictor scheme for (C∩D,μ).
Theorem A.6
Consider C,D⊆{0,1}∗ and μ a word ensemble. Assume ∃ϵ>0∀k:μk(D)≥ϵ. Assume PD is a Δ-optimal predictor scheme for (D,μ) and PC∩D is a Δ-optimal predictor scheme for (C∩D,μ). Define PC∣D as the circuit family computing
PkjC∣D(x):=⎧⎪ ⎪⎨⎪ ⎪⎩1if PkjD(x)=0η(PkjC∩D(x)PkjD(x))rounded to h(k,j) binary places if PkjD(x)>0
Then, PC∣D is a Δ-optimal predictor scheme for (C,μ∣D).
Definition A.1
Consider μ a word ensemble and {Qkj1,2:suppμkcirc−−→[0,1]}k,j∈N two circuit families. We say Q1 is Δ-similar to Q2 relative to μ (denoted Q1μ≃ΔQ2) when Eμk[(Qk1(x)−Qk2(x))2]∈Δ.
Theorem A.7
Consider (D,μ) a distributional decision problem, P a Δ-optimal predictor scheme for (D,μ) and {Qkj:suppμkcirc−−→[0,1]}k∈N a polynomial size family. Then, Q is a Δ-optimal predictor scheme for (D,μ) if and only if Pμ≃ΔQ.
Appendix B
Lemma B.1
Consider (D,μ) a distributional decision problem and P a corresponding Δ-optimal predictor scheme. Then, there is a function δ:N4→[0,1] s.t.
(i) δ is non-decreasing in the third and fourth arguments.
(ii) For all polynomials p,q:N2→N, we have δ(k,j,p(k,j),q(k,j))∈Δ.
(iii) for all k∈N, Q:suppμkcirc−−→[0,1] and w:suppμkcirc−−→Q≥0 we have
Eμk[w(x)(Pkj(x)−χD(x))2]≤Eμk[w(x)(Q(x)−χD(x))2]+(maxw)δ(k,j,|Q|,|w|)
The proof of Lemma B.1 is completely analogous to the proof of Lemma 2 for quasi-optimal predictors and I omit it.
Proof of Theorem A.1
Define {wkj:suppμkcirc−−→{0,1}}k,j∈N as the circuits computing
wkj(x):=θ(Pkj(x)−pkj)θ(qkj−Pkj(x))
|wkj| is bounded by a polynomial since Pkj produces binary fractions of polynomial size therefore it is possible to compare them to the fixed numbers pkj,qkj using a polynomial size circuit even if the latter have infinite binary expansions.
We have
ϕkj=Eμk[wkj(x)(χD(x)−Pkj(x))]Eμk[wkj(x)]
Define ψkj to be ϕkj truncated to the first significant binary digit. Define {Qkj:suppμkcirc−−→[0,1]}k,j∈N as the circuits computing
Qkj(x):=η(Pkj(x)+ψkj)
Denote I⊆N2 the set of (k,j) for which ϕkj>2−h(k,j). For (k,j)∈I, ψkj has binary notation of polynomially bounded size, therefore |Qkj| is bounded by a polynomial for such (k,j).
Applying Lemma B.1 we get
∀(k,j)∈I:Eμk[wkj(x)(Pkj(x)−χD(x))2]≤Eμk[wkj(x)(Qkj(x)−χD(x))2]+δ(k,j)
for δ∈Δ.
∀(k,j)∈I:Eμk[wkj(x)((Pkj(x)−χD(x))2−(Qkj(x)−χD(x))2)]≤δ(k,j)
∀(k,j)∈I:Eμk[wkj(x)((Pkj(x)−χD(x))2−(η(Pkj(x)+ψkj)−χD(x))2)]≤δ(k,j)
Obviously (η(Pkj(x)+ψkj)−χD(x))2≤(Pkj(x)+ψkj−χD(x))2, therefore
∀(k,j)∈I:Eμk[wkj(x)((Pkj(x)−χD(x))2−(Pkj(x)+ψkj−χD(x))2)]≤δ(k,j)
∀(k,j)∈I:ψkjEμk[wkj(x)(2(χD(x)−Pkj(x))−ψkj)]≤δ(k,j)
The expression on the left hand side is a quadratic polynomial in ψkj which attains its maximum at ϕkj and has roots at 0 and 2ϕkj. ψkj is between 0 and ϕkj, but not closer to 0 than ϕkj2. Therefore, the inequality is preserved if we replace ψkj by ϕkj2.
∀(k,j)∈I:ϕkj2Eμk[wkj(x)(2(χD(x)−Pkj(x))−ϕkj2)]≤δ(k,j)
Substituting the equation for ϕkj we get
∀(k,j)∈I:12Eμk[wkj(x)(χD(x)−Pkj(x))]Eμk[wkj(x)]Eμk[wkj(x)(2(χD(x)−Pkj(x))−12Eμk[wkj(x)(χD(x)−Pkj(x))]Eμk[wkj(x)])]≤δ(k,j)
∀(k,j)∈I:34Eμk[wkj(x)(χD(x)−Pkj(x))]2Eμk[wkj(x)]≤δ(k,j)
∀(k,j)∈I:34Eμk[wkj(x)]ϕ2kj≤δ(k,j)
∀(k,j)∈I:ϕ2kj≤43Eμk[wkj(x)]−1δ(k,j)
∀(k,j)∈I:ϕ2kj≤43μk{x∈{0,1}∗∣pkj≤Pkj(x)≤qkj}−1δ(k,j)
Thus for all k,j∈N we have
|ϕkj|≤2−h(k,j)+√43μk{x∈{0,1}∗∣pkj≤Pkj(x)≤qkj}−1δ(k,j)
In particular, |ϕ|∈Δ.
Appendix C
The following is a proof of Theorem 1.
Define ϵ(k,j) by
ϵ(k,j):=Eμk[(P∗kjD,μ(x)−χD(x))2]=min|Q|≤jEμk[(Q(x)−χD(x))2]
To prove P∗D,μ is Δ2avg-optimal, it enough to consider families Q of the form Qkj=P∗k,q(k,j)D,μ for polynomial q, since
ϵ(k,|Qkj|)≤Eμk[(Qkj(x)−χD(x))2]
That is, we need to prove that for any polynomial q, ϵ(k,j)−ϵ(k,max(q(k,j),j))∈Δ2avg.
Without loss of generality, assume q(k,j)=knjm for m>0. We are interested in the function
δ(k):=f(k)−1∑j=3(loglog(j+1)−loglogj)(ϵ(k,j)−ϵ(k,knjm))loglogf(k)−loglog3
This can be rewritten as
δ(k)=f(k)−1∑j=3(loglog(j+1)−loglogj)(ϵ(k,j)−ϵ(k,jm+nlogklogj))loglogf(k)−loglog3
δ(k)≤f(k)−1∑j=3(loglog(j+1)−loglogj)(ϵ(k,j)−ϵ(k,jm+nlogklog3))loglogf(k)−loglog3
The numerator can be recast as an integral
δ(k)≤f(k)∫x=3(ϵ(k,⌊x⌋)−ϵ(k,⌊x⌋m+nlogklog3))d(loglogx)loglogf(k)−loglog3
δ(k)≤f(k)∫x=3(ϵ(k,⌊x⌋)−ϵ(k,⌊xm+nlogklog3⌋))d(loglogx)loglogf(k)−loglog3
δ(k)≤f(k)∫x=3ϵ(k,⌊x⌋)d(loglogx)−f(k)∫x=3ϵ(k,⌊xm+nlogklog3⌋)d(loglogx)loglogf(k)−loglog3
Raising x to a power is equivalent to adding a constant to loglogx, therefore
δ(k)≤f(k)∫x=3ϵ(k,⌊x⌋)d(loglogx)−f(k)m+nlogklog3∫x=3m+nlogklog3ϵ(k,⌊x⌋)d(loglogx)loglogf(k)−loglog3
δ(k)≤3m+nlogklog3∫x=3ϵ(k,⌊x⌋)d(loglogx)loglogf(k)−loglog3
Since ϵ≤1, we get
δ(k)≤log(m+nlogklog3)loglogf(k)−loglog3
Using the assumption on f, we conclude that limk→∞δ(k)=0, as needed.