We formulate a concept analogous to Garrabrant’s irreducible pattern in the complexity theoretic language underlying optimal predictor schemes and prove a formal relation to Garrabant’s original definition. We prove that optimal predictor schemes pass the corresponding version of the Benford test (namely, on irreducible patterns they are Δ-similar to a constant).
Results
All the proofs are given in the appendix.
Definition 1
Fix Δ an error space of rank 2. Consider (f,μ) a distributional estimation problem, α∈[0,1]. (f,μ) is called Δ(poly,log)-irreducible with expectation α when for any {0,1}-valued (poly,log)-bischeme ^A=(A,rA,aA) we have
|Eμk×UrA(k,j)[(f−α)^Akj]|∈Δ
Theorem 1
Fix Δ an error space of rank 2. Assume h:N2→N is a polynomial s.t. h−1∈Δ. Given t∈[0,1], define πkjh(t) to be t rounded within error h(k,j)−1. Consider (f,μ) a distributional estimation problem, α∈[0,1]. Define the (poly,log)-predictor scheme ^Pα,h by ^Pkjα,h(x):=πkjh(α). Then, (f,μ) is Δ(poly,log)-irreducible with expectation α if and only if ^Pα,h is a Δ(poly,log)-optimal predictor scheme for (f,μ).
Corollary 1
Consider Δ1, Δ2 error spaces of rank 2 s.t. Δ1⊆Δ2. Assume there is a polynomial h:N2→N s.t. h−1∈Δ2. Consider (f,μ) a distributional estimation problem, α∈[0,1] and ^P a Δ1(poly,log)-optimal predictor scheme for (f,μ). Suppose (f,μ) is Δ2(poly,log)-irreducible with expectation α. Then, ^Pμ≃Δ2α.
The following definition is adapted from Garrabrant with relatively minor modifications
Definition 2
Denote U≤k the uniform probability measure on {0,1}≤k. Given r:N→N and A:{0,1}∗2alg−→{0,1}, denote mA,r(k):=∑x∈{0,1}≤kPrUr(|x|)[A(x,y)=1]. Fix t:N→N. f:{0,1}∗→[0,1] is called a t(poly)-irreducible pattern with expectation α∈[0,1] when for any polynomial p:N→N there is cp>0 s.t. for any W:{0,1}∗2alg−→{0,1} if W runs within time p(t(|x|)) on any input (x,y) s.t.|y|=p(t(|x|)) then
Definition 2 differs from Garrabrant’s original definition in several respects:
We consider an arbitrary function rather than the characteristic function of the set of provable sentences.
Instead of a single time bound, we consider a family of time bounds differing by a polynomial.
We allow W to be a random algorithm rather than requiring it to be deterministic.
Definition 3
Fix t:N→N. Δ2G,t is the set of bounded functions δ:N2→R≥0 for which there are c,d>0 s.t.
∀k:∀j≤t(k):δ(k,j)d≤c2−klog(k+2)(log(j+2))2
Proposition 1
Δ2G,t is an error space of rank 2.
Theorem 2
Assume t:N→N is s.t. for some polynomial p:N→N we have p(t(k))≥k. Consider f:{0,1}∗→[0,1], α∈[0,1]. Assume f is a t(poly)-irreducible pattern with expectation α. Then, (f,U) is Δ2G,t(poly,log)-irreducible with expectation α.
Definition 4
Denote Φ the set of functions ϕ:N→R≥0 s.t. limk→∞ϕ(k)=∞. Given ϕ∈Φ, define tϕ(k):=⌊2(logk)ϕ(k)⌋. Fix ϕ∈Φ. Δ2avg,ϕ is the set of bounded functions δ:N2→R≥0 s.t. for any ϕ′∈Φ, if ϕ′≤ϕ then
Consider ϕ∈Φ s.t. limk→∞2−k(logk)2ϕ(k)+1=0. Then Δ2G,tϕ⊆Δ2avg,ϕ.
Corollary 2
Fix ϕ∈Φ s.t. limk→∞2−k(logk)2ϕ(k)+1=0. Cosnider f:{0,1}∗→[0,1], α∈[0,1] and ^P a Δ2avg-optimal predictor scheme for (f,U). Suppose f is a tϕ(poly)-irreducible pattern with expectation α. Then ^PU≃Δ2avg,ϕα.
Note 2
It is possible to repeat this theory without random and thus relate the deterministic version of irreducible patterns to Δ(log)-optimal predictor schemes (which have logarithmic advice and no coin flips or equivalently logarithmic number of coin flips).
Appendix
We will refer to the previously established results about Δ(poly,log)-optimal predictor schemes by L.N where N is the number in the linked post. Thus Theorem 1 there becomes Theorem L.1 here and so on.
Proposition 4
Fix Δ an error space of rank 2. Consider (f,μ) a distributional estimation problem, α∈[0,1]. Suppose (f,μ) is Δ(poly,log)-irreducible with expectation α. Then, there is a Δ-moderate function δ:N4→[0,1] s.t. for any k,j,s∈N and A:{0,1}∗2alg−→{0,1}
If δ is not Δ-moderate then there is a polynomial s:N2→N and a family of programs {Akj:{0,1}∗2alg−→{0,1}}k,j∈N s.t.TμAkj(k,s(k,j)) is bounded by a polynomial in k,j and |Akj| is logarithmically bounded in k,j but
|Eμk×Us(k,j)[(f(x)−α)Akj(x,y)]|∉Δ
We can unite the Akj into a single (poly,log)-predictor scheme ^A by providing them as advice for a universal program. This contradicts the assumption on (f,μ).
Proof of Theorem 1
If ^Pα,h is a Δ(poly,log)-optimal predictor scheme for (f,μ) then (f,μ) is Δ(poly,log)-irreducible with expectation α by Lemma L.B.3.
Assume (f,μ) is Δ(poly,log)-irreducible with expectation α. Consider ^Q=(Q,rQ,aQ) any Q∩[−1,+1]-valued (poly,log)-bischeme. By Lemma L.B.3, it is sufficient to prove that
Applying Proposition 4 to θ(^Q−πkj(t)), we conclude that there is δ∈Δ s.t.
∀t∈[0,1]:|Eμk×UrQ(k,j)[(α−f)θ(^Q−πkj(t))]|≤δ(k,j)
Summing up
|Eμk×UrQ(k,j)[(^Pα,h−f)^Q]|≤δ(k,j)+2h(k,j)−1
Proof of Corollary 1
Trivially follows from Theorem 1 and Theorem L.A.7.
Proposition 5
Fix t:N→N. Assume δ:N2→R≥0 is bounded and c,d>0 are s.t.
∀k:∀j≤t(k):δ(k,j)d≤c2−klog(k+2)(log(j+2))2
Consider d′≥d. Then, there is c′>0 s.t.
∀k:∀j≤t(k):δ(k,j)d′≤c′2−klog(k+2)(log(j+2))2
Proof of Proposition 5
δd′≤(supδ)d′−dδd
Proof of Proposition 1
The only not entirely obvious part is additivity. Consider δ1,δ2∈Δ2G,t. Suppose d1,c1,d2,c2∈R>0 are s.t.
∀k:∀j≤t(k):δ1(k,j)d1≤c12−klog(k+2)(log(j+2))2
∀k:∀j≤t(k):δ2(k,j)d2≤c22−klog(k+2)(log(j+2))2
For sufficiently large d∈N, (δ1+δ2)d can be written as a sum of terms of the form δd′11δd′22 for d′1≥d1, d′2≥d2. Applying Proposition 5, we conclude that there is c>0 s.t.
Optimal predictor schemes pass a Benford test
We formulate a concept analogous to Garrabrant’s irreducible pattern in the complexity theoretic language underlying optimal predictor schemes and prove a formal relation to Garrabant’s original definition. We prove that optimal predictor schemes pass the corresponding version of the Benford test (namely, on irreducible patterns they are Δ-similar to a constant).
Results
All the proofs are given in the appendix.
Definition 1
Fix Δ an error space of rank 2. Consider (f,μ) a distributional estimation problem, α∈[0,1]. (f,μ) is called Δ(poly,log)-irreducible with expectation α when for any {0,1}-valued (poly,log)-bischeme ^A=(A,rA,aA) we have
|Eμk×UrA(k,j)[(f−α)^Akj]|∈Δ
Theorem 1
Fix Δ an error space of rank 2. Assume h:N2→N is a polynomial s.t. h−1∈Δ. Given t∈[0,1], define πkjh(t) to be t rounded within error h(k,j)−1. Consider (f,μ) a distributional estimation problem, α∈[0,1]. Define the (poly,log)-predictor scheme ^Pα,h by ^Pkjα,h(x):=πkjh(α). Then, (f,μ) is Δ(poly,log)-irreducible with expectation α if and only if ^Pα,h is a Δ(poly,log)-optimal predictor scheme for (f,μ).
Corollary 1
Consider Δ1, Δ2 error spaces of rank 2 s.t. Δ1⊆Δ2. Assume there is a polynomial h:N2→N s.t. h−1∈Δ2. Consider (f,μ) a distributional estimation problem, α∈[0,1] and ^P a Δ1(poly,log)-optimal predictor scheme for (f,μ). Suppose (f,μ) is Δ2(poly,log)-irreducible with expectation α. Then, ^Pμ≃Δ2α.
The following definition is adapted from Garrabrant with relatively minor modifications
Definition 2
Denote U≤k the uniform probability measure on {0,1}≤k. Given r:N→N and A:{0,1}∗2alg−→{0,1}, denote mA,r(k):=∑x∈{0,1}≤kPrUr(|x|)[A(x,y)=1]. Fix t:N→N. f:{0,1}∗→[0,1] is called a t(poly)-irreducible pattern with expectation α∈[0,1] when for any polynomial p:N→N there is cp>0 s.t. for any W:{0,1}∗2alg−→{0,1} if W runs within time p(t(|x|)) on any input (x,y) s.t.|y|=p(t(|x|)) then
∀k∈N:mW,p(t)(k)≥3⟹|EU≤k×Up(t(k))[f(x)∣W(x,y≤p(t(|x|)))=1]−α|≤cp|W|√loglogmW,p(t)(k)√mW,p(t)(k)
Note 1
Definition 2 differs from Garrabrant’s original definition in several respects:
We consider an arbitrary function rather than the characteristic function of the set of provable sentences.
Instead of a single time bound, we consider a family of time bounds differing by a polynomial.
We allow W to be a random algorithm rather than requiring it to be deterministic.
Definition 3
Fix t:N→N. Δ2G,t is the set of bounded functions δ:N2→R≥0 for which there are c,d>0 s.t.
∀k:∀j≤t(k):δ(k,j)d≤c2−klog(k+2)(log(j+2))2
Proposition 1
Δ2G,t is an error space of rank 2.
Theorem 2
Assume t:N→N is s.t. for some polynomial p:N→N we have p(t(k))≥k. Consider f:{0,1}∗→[0,1], α∈[0,1]. Assume f is a t(poly)-irreducible pattern with expectation α. Then, (f,U) is Δ2G,t(poly,log)-irreducible with expectation α.
Definition 4
Denote Φ the set of functions ϕ:N→R≥0 s.t. limk→∞ϕ(k)=∞. Given ϕ∈Φ, define tϕ(k):=⌊2(logk)ϕ(k)⌋. Fix ϕ∈Φ. Δ2avg,ϕ is the set of bounded functions δ:N2→R≥0 s.t. for any ϕ′∈Φ, if ϕ′≤ϕ then
limk→∞tϕ′(k)−1∑j=2(loglog(j+1)−loglogj)δ(k,j)loglogtϕ′(k)=0
Proposition 2
Δ2avg,ϕ is an error space of rank 2.
Note that Δ2avg=⋂ϕ∈ΦΔ2avg,ϕ.
Proposition 3
Consider ϕ∈Φ s.t. limk→∞2−k(logk)2ϕ(k)+1=0. Then Δ2G,tϕ⊆Δ2avg,ϕ.
Corollary 2
Fix ϕ∈Φ s.t. limk→∞2−k(logk)2ϕ(k)+1=0. Cosnider f:{0,1}∗→[0,1], α∈[0,1] and ^P a Δ2avg-optimal predictor scheme for (f,U). Suppose f is a tϕ(poly)-irreducible pattern with expectation α. Then ^PU≃Δ2avg,ϕα.
Note 2
It is possible to repeat this theory without random and thus relate the deterministic version of irreducible patterns to Δ(log)-optimal predictor schemes (which have logarithmic advice and no coin flips or equivalently logarithmic number of coin flips).
Appendix
We will refer to the previously established results about Δ(poly,log)-optimal predictor schemes by L.N where N is the number in the linked post. Thus Theorem 1 there becomes Theorem L.1 here and so on.
Proposition 4
Fix Δ an error space of rank 2. Consider (f,μ) a distributional estimation problem, α∈[0,1]. Suppose (f,μ) is Δ(poly,log)-irreducible with expectation α. Then, there is a Δ-moderate function δ:N4→[0,1] s.t. for any k,j,s∈N and A:{0,1}∗2alg−→{0,1}
|Eμk×Us[(f(x)−α)A(x,y)]|≤δ(k,j,TμA(k,s),2|A|)
Proof of Proposition 4
Take δ to be
δ(k,j,t,u):=maxTμA(k,s)≤t|A|≤logu|Eμk×Us[(f(x)−α)A(x,y)]|
If δ is not Δ-moderate then there is a polynomial s:N2→N and a family of programs {Akj:{0,1}∗2alg−→{0,1}}k,j∈N s.t.TμAkj(k,s(k,j)) is bounded by a polynomial in k,j and |Akj| is logarithmically bounded in k,j but
|Eμk×Us(k,j)[(f(x)−α)Akj(x,y)]|∉Δ
We can unite the Akj into a single (poly,log)-predictor scheme ^A by providing them as advice for a universal program. This contradicts the assumption on (f,μ).
Proof of Theorem 1
If ^Pα,h is a Δ(poly,log)-optimal predictor scheme for (f,μ) then (f,μ) is Δ(poly,log)-irreducible with expectation α by Lemma L.B.3.
Assume (f,μ) is Δ(poly,log)-irreducible with expectation α. Consider ^Q=(Q,rQ,aQ) any Q∩[−1,+1]-valued (poly,log)-bischeme. By Lemma L.B.3, it is sufficient to prove that
|Eμk×UrQ(k,j)[(^Pα,h−f)^Q]|∈Δ
We have
|Eμk×UrQ(k,j)[(^Pα,h−f)^Q]|=|Eμk×UrQ(k,j)[(πkj(α)−f)^Q]|
|Eμk×UrQ(k,j)[(^Pα,h−f)^Q]|≤|Eμk×UrQ(k,j)[(α−f)^Q]|+h(k,j)−1
|Eμk×UrQ(k,j)[(^Pα,h−f)^Q]|≤|Eμk×UrQ(k,j)[(α−f)πkj(^Q)]|+2h(k,j)−1
|Eμk×UrQ(k,j)[(^Pα,h−f)^Q]|≤|Eμk×UrQ(k,j)[(α−f)∫10θ(^Q−πkj(t))dt]|+2h(k,j)−1
|Eμk×UrQ(k,j)[(^Pα,h−f)^Q]|≤∫10|Eμk×UrQ(k,j)[(α−f)θ(^Q−πkj(t))]|dt+2h(k,j)−1
Applying Proposition 4 to θ(^Q−πkj(t)), we conclude that there is δ∈Δ s.t.
∀t∈[0,1]:|Eμk×UrQ(k,j)[(α−f)θ(^Q−πkj(t))]|≤δ(k,j)
Summing up
|Eμk×UrQ(k,j)[(^Pα,h−f)^Q]|≤δ(k,j)+2h(k,j)−1
Proof of Corollary 1
Trivially follows from Theorem 1 and Theorem L.A.7.
Proposition 5
Fix t:N→N. Assume δ:N2→R≥0 is bounded and c,d>0 are s.t.
∀k:∀j≤t(k):δ(k,j)d≤c2−klog(k+2)(log(j+2))2
Consider d′≥d. Then, there is c′>0 s.t.
∀k:∀j≤t(k):δ(k,j)d′≤c′2−klog(k+2)(log(j+2))2
Proof of Proposition 5
δd′≤(supδ)d′−dδd
Proof of Proposition 1
The only not entirely obvious part is additivity. Consider δ1,δ2∈Δ2G,t. Suppose d1,c1,d2,c2∈R>0 are s.t.
∀k:∀j≤t(k):δ1(k,j)d1≤c12−klog(k+2)(log(j+2))2
∀k:∀j≤t(k):δ2(k,j)d2≤c22−klog(k+2)(log(j+2))2
For sufficiently large d∈N, (δ1+δ2)d can be written as a sum of terms of the form δd′11δd′22 for d′1≥d1, d′2≥d2. Applying Proposition 5, we conclude that there is c>0 s.t.
∀k:∀j≤t(k):(δ1+δ2)d≤c4−k(log(k+2))2(log(j+2))4
∀k:∀j≤t(k):(δ1+δ2)d2≤c2−klog(k+2)(log(j+2))2
Proposition 6
For any t:N→N, n∈N, ϵ∈(0,2], M>0
min(2−k(log(k+2))n(log(j+2))2−ϵ,M)∈Δ2G,t
Proof of Proposition 6
(2−k(log(k+2))n(log(j+2))2−ϵ)22−ϵ=(2−k(log(k+2))n)22−ϵ(log(j+2))2
Since 22−ϵ>1 we can choose c>0 s.t. (2−k(log(k+2))n)22−ϵ≤c2−k. We get
(2−k(log(k+2))n(log(j+2))2−ϵ)22−ϵ≤c2−k(log(j+2))2
Proof of Theorem 2
Consider ^A=(A,rA,aA) a {0,1}-valued (poly,log)-bischeme. Define {Wkj:{0,1}∗2alg−→{0,1}}k,j∈N by
Wkj(x,y)={^Akj(x,y≤rA(k,j))if |x|=k0if |x|≠k
Choose a polynomial p:N→N s.t.Wkj(x,y) runs within time p(t(|x|)) for j≤t(k), |x|≤k and |y|=p(t(|x|)) and that rA(k,j)≤p(t(k)) for j≤t(k). We have
∀k:∀j≤t(k):mWkj,p(t)(k)≥3⟹|EU≤k×Up(t(k))[f(x)∣Wkj(x,y≤p(t(|x|)))=1]−α|≤cp|Wkj|√loglogmWkj,p(t)(k)√mWkj,p(t)(k)
∀k:∀j≤t(k):mWkj,p(t)(k)≥3⟹|EUk×UrA(k,j)[f(x)∣^Akj(x,y)=1]−α|≤(c1log(k+2)+c2log(j+2))√loglogmWkj,p(t)(k)√mWkj,p(t)(k)
for some c1,c2>0. mWkj,p(t)(k)≤2k+1 therefore
∀k:∀j≤t(k):mWkj,p(t)(k)≥3⟹|EUk×UrA(k,j)[f∣^Akj=1]−α|≤(c1log(k+2)+c2log(j+2))√log(k+1)√mWkj,p(t)(k)
∀k:∀j≤t(k):mWkj,p(t)(k)≥3⟹|EUk×UrA(k,j)[f−α∣^Akj=1]|≤(c1log(k+2)+c2log(j+2))√log(k+1)√mWkj,p(t)(k)
∀k:∀j≤t(k):mWkj,p(t)(k)≥3⟹|EUk×UrA(k,j)[f−α∣^Akj=1]|≤(c1log(k+2)+c2log(j+2))√log(k+1)√PrUk×UrA(k,j)[^Akj=1]2k
∀k:∀j≤t(k):mWkj,p(t)(k)≥3⟹|EUk×UrA(k,j)[(f−α)^Akj]|PrUk×UrA(k,j)[^Akj=1]≤(c1log(k+2)+c2log(j+2))√log(k+1)√PrUk×UrA(k,j)[^Akj=1]2k
∀k:∀j≤t(k):mWkj,p(t)(k)≥3⟹|EUk×UrA(k,j)[(f−α)^Akj]|≤√PrUk×UrA(k,j)[^Akj=1](c1log(k+2)+c2log(j+2))√log(k+1)√2k
∀k:∀j≤t(k):mWkj,p(t)(k)≥3⟹|EUk×UrA(k,j)[(f−α)^Akj]|≤(c1log(k+2)+c2log(j+2))√log(k+1)√2k
∀k:∀j≤t(k):mWkj,p(t)(k)≥3⟹EUk×UrA(k,j)[(f−α)^Akj]2≤2−k(c1log(k+2)+c2log(j+2))2log(k+1)
On the other hand
∀k:∀j≤t(k):mWkj,p(t)(k)<3⟹PrUk×UrA(k,j)[^Akj=1]2k<3
∀k:∀j≤t(k):mWkj,p(t)(k)<3⟹PrUk×UrA(k,j)[^Akj=1]<3⋅2−k
∀k:∀j≤t(k):mWkj,p(t)(k)<3⟹PrUk×UrA(k,j)[^Akj=1]2<3⋅2−k
∀k:∀j≤t(k):mWkj,p(t)(k)<3⟹EUk×UrA(k,j)[(f−α)^Akj]2<3⋅2−k
Combining the two cases
∀k:∀j≤t(k):EUk×UrA(k,j)[(f−α)^Akj]2≤2−kmax((c1log(k+2)+c2log(j+2))2log(k+1),3)
Using Proposition 6, we conclude that
|EUk×UrA(k,j)[(f−α)^Akj]|∈Δ2G,t
Proof of Proposition 2
Proven exactly the same way as for Δ2avg.
Proof of Proposition 3
Consider δ∈Δ2G,tϕ. Take c,d>0 s.t.
∀k:∀j≤t(k):δ(k,j)d≤c2−klog(k+2)(log(j+2))2
Given ϕ′∈Φ s.t. ϕ′≤ϕ, we have
tϕ′(k)−1∑j=2(loglog(j+1)−loglogj)δ(k,j)dloglogtϕ′(k)≤tϕ′(k)−1∑j=2(loglog(j+1)−loglogj)c2−klog(k+2)(log(j+2))2loglogtϕ′(k)
tϕ′(k)−1∑j=2(loglog(j+1)−loglogj)δ(k,j)dloglogtϕ′(k)≤c2−klog(k+2)tϕ′(k)−1∑j=2(loglog(j+1)−loglogj)(log(j+2))2loglogtϕ′(k)
tϕ′(k)−1∑j=2(loglog(j+1)−loglogj)δ(k,j)dloglogtϕ′(k)≤c2−klog(k+2)tϕ′(k)−1∑j=2(loglog(j+1)−loglogj)(log(tϕ′(k)+1))2loglogtϕ′(k)
tϕ′(k)−1∑j=2(loglog(j+1)−loglogj)δ(k,j)dloglogtϕ′(k)≤c2−klog(k+2)(log(tϕ′(k)+1))2
limk→∞tϕ′(k)−1∑j=2(loglog(j+1)−loglogj)δ(k,j)dloglogtϕ′(k)≤climk→∞2−klog(k+2)(log(tϕ′(k)+1))2
limk→∞tϕ′(k)−1∑j=2(loglog(j+1)−loglogj)δ(k,j)dloglogtϕ′(k)≤climk→∞2−klog(k+2)(logtϕ′(k))2
limk→∞tϕ′(k)−1∑j=2(loglog(j+1)−loglogj)δ(k,j)dloglogtϕ′(k)≤climk→∞2−klog(k+2)(logk)2ϕ′(k)
limk→∞tϕ′(k)−1∑j=2(loglog(j+1)−loglogj)δ(k,j)dloglogtϕ′(k)≤climk→∞2−k(logk)2ϕ′(k)+1
limk→∞tϕ′(k)−1∑j=2(loglog(j+1)−loglogj)δ(k,j)dloglogtϕ′(k)=0
We conclude that δd∈Δ2avg,ϕ and therefore δ∈Δ2avg,ϕ.
Proof of Corollary 2
Follows trivially from Theorem 2, Proposition 3 and Corollary 1.