Reflection with optimal predictors

A change in terminology: It is convenient when important concepts have short names. The concept of an “optimal predictor scheme” seems much more important than its historical predecessor, the “optimal predictor”. Therefore “optimal predictor schemes” will be henceforth called just “optimal predictors” while the previous concept of “optimal predictor” might be called “flat optimal predictor”.

We study systems of computations which have access to optimal predictors for each other. We expect such systems to play an important role in decision theory (where self-prediction is required to define logical counterfactuals and mutual prediction is required for a collection of agents in a game) and Vingean reflection (where the different computations correspond to different successor agents). The previously known existence theorems for optimal predictors are not directly applicable to this case. To overcome this we prove new, specifically tailored existence theorems.

The Results section states the main novelties, Appendix A contains adaptations of old theorems, Appendix B proves selected claims from Appendix A and Appendix C proves the novel results.

Results

Notation

Given sets and , will denote the set of mappings from to .


Before taking on reflection, we introduce a stronger concept of optimal predictor, to which the previous existence theorems still apply.

Definition 1

Let be a positive integer. A proto-error space of rank is a set of bounded functions from to s.t.

(i) If then .

(ii) If and then .

(iii) There is a polynomial s.t. .

Proposition 1

If is a proto-error space of rank and , then is also a proto-error space of rank .

Proposition 2

If is a proto-error space of rank , and then .

Definition 3

Fix a proto-error space of rank 2 and a distributional estimation problem. Consider a -predictor.

is called an -optimal predictor for when for any -predictor , there is s.t.

is called an -optimal predictor for when there is s.t. is an -optimal predictor for .


-optimal predictors have properties closely parallel to -optimal predictors. The corresponding theorems are listed in Appendix A. Most theorems are given without proof, as the proofs are closely analogous to before, with the exception of Theorem A.4 which is proven in Appendix B.

We now consider a generalization in which the advice is allowed to be random.

Definition 4

Given appropriate sets and , consider , polynomial and a family of probability measures. The triple is called -bischeme of signature when

(i) The runtime of is bounded by with polynomial.

(ii) for some .

We will use the notation to stand for the obvious -valued -random variable and the notation to stand for the obvious -valued -random variable.

A -bischeme of signature will also be called a -valued -bischeme.

A -valued -bischeme will also be called a -predictor.

Note 1

Conceptually advice corresponds to precomputation which might be expensive or even “hyperprecomputation”. Random advice corresponds to random precomputation. Random logarithmic advice can be always replaced by deterministic polynomial advice at the cost of a small rounding error.

Definition 5

Fix a proto-error space of rank 2 and a distributional estimation problem. Consider a -predictor. is called an -optimal predictor for when for any -predictor , there is s.t.


The concept of an -optimal predictor is essentially of the same strength as an -optimal predictors, as seen in the following two Propositions.

Proposition 3

Fix a proto-error space of rank 2 and a distributional estimation problem. Consider an -optimal predictor. Then it defines a -optimal predictor by setting .

Proposition 4

Fix a proto-error space of rank 2 and a distributional estimation problem. If admits a -optimal predictor then it admits a -optimal predictor.


We are now ready to introduce the key abstraction.

Definition 6

Given a set , denote equipped with the product topology.

A reflective system is a triple where is a set, is a collection of probability measures where we regard each as a word ensemble and is a collection of continuous functions (here has the discrete topology or, equivalently, we only require continuity in the second variable).


The motivation behind this definition is regarding as the space of possible predictor programs, where the first factor of is the program itself (including advice) and the second factor is the number of intrinsic coin flips. Thus the represent a system of computations in which each has access to the source code of predictors for the entire system.

Definition 7

Consider a reflective system and a collection of -predictors .

Denote equipped with the product -algebra. Denote , a probability measure on .

Given , and denote to be the program that computes given input . Given , define by .

Given , is defined as follows

The expectation value is well-defined thanks to the continuity of .

Definition 8

Fix a proto-error space of rank 2 and a reflective system. A collection of -predictors is called an -optimal predictor system for when for any , is a -optimal predictor for .

Construction 1

Given , denote the set of bounded functions s.t.

Given , denote the set of bounded functions s.t.

Denote

Proposition 5

If is s.t. , is a proto-error space.

For any , is a proto-error space.

is a proto-error space.

Note 2

-optimality is strictly stronger than -optimality. We exploit this strength in Theorem 2 below which is the main motivation for introducing -optimality at this point.

Theorem 1 (general existence theorem)

Any reflective system has a -optimal predictor system.


We also prove a more restricted existence theorem which allows deterministic advice.

Definition 9

Consider a set and a collection of probability measures. Denote . Denote .

Fix a set and a collection . A -reflective system is a pair where is as above and is a collection of functions s.t. there are collections , , and probability measures satisfying

Note 3

The condition on is a kind of Hoelder condition, uniform over .

Proposition 6

Fix a set and a collection . For any -reflective system and , is continuous with respect to the product topology on .

Definition 10

Fix a set and a collection .

Given , , we let stand for the evaluation of program on inputs without a time limit (we assume that on the output tape the machine head moves right iff it produces a symbol in and cannot be moved left). We also extend to be defined on in the obvious way.

We define by .

Consider a -reflective system and a collection of -predictors . Given , is defined as follows

The expectation value is well-defined due to Proposition 5.

Definition 11

Fix a set , a collection of proto-error spaces of rank 2, a collection and a -reflective system. A collection of -predictors is called an -optimal predictor system for when for any there is s.t. is an -optimal predictor for .

Theorem 2

Consider a finite set and a collection . Any -reflective system has an -optimal predictor system.

Appendix A

Definition A.1

, a proto-error space of rank , is called ample when there is a polynomial s.t. .


Fix , a proto-error space of rank 2.

Theorem A.1

Assume is ample. Consider a distributional estimation problem, an -optimal predictor for and , . Define by

Define

Assume that either have a number of digits logarithmically bounded in or produces outputs with a number of digits logarithmically bounded in . Then, .

Theorem A.2

Consider a word ensemble and s.t. . Suppose is an -optimal predictor for and is an -optimal predictor for . Then, is an -optimal predictor for .

Theorem A.3

Consider a word ensemble and s.t. . Suppose is an -optimal predictor for and is an -optimal predictor for . Then, is an -optimal predictor for .

Definition A.2

Given a proto-error space of rank , the associated error space is .

Theorem A.4

Consider , distributional estimation problems with respective -optimal predictors and . Assume is -sampleable and is -generatable. Then, is an -optimal predictor for .

Theorem A.5

Consider a word ensemble, and . Assume is an -optimal predictor for and is an -optimal predictor for . Then is an -optimal predictor for .

Definition A.3

We define the stabilizer of , denoted , to be the set of functions s.t. for any we have .

Theorem A.6

Fix a polynomial s.t. . Consider a word ensemble, and . Assume . Assume is an -optimal predictor for and is an -optimal predictor for . Define by

Then, is an -optimal predictor for .

Definition A.4

Consider a word ensemble, , -predictors.

We say is -similar to relative to (denoted ) when .

Let be an error space.

We say is -similar to relative to (denoted ) when .

Theorem A.7 (uniqueness theorem)

Consider a distributional estimation problem, an -optimal predictor for and an -predictor.

If is a -optimal predictor for then .

Conversely, if then is a -optimal predictor for .

Definition A.5

is called stable when for any non-constant polynomial there is s.t. for any , the function is in .

Proposition A.1

is stable.

Theorem A.8

Assume is ample and stable. Consider , distributional estimation problems, a -pseudo-invertible reduction of to and an -optimal predictor for . Define by . Then, is an -optimal predictor for .

Theorem A.9

Assume is ample and stable. Consider , distributional estimation problems, a -pseudo-invertible weak reduction of to and an -optimal predictor for . Choose a polynomial with and define by

Here, the ellipses signify that each term corresponds to an independent sampling of . Then, is an -optimal predictor for .

Theorem A.10

Consider a distributional estimation problem, and a weak -generator for . Then, is an -optimal predictor for .

Appendix B

Definition B.1

Given , a function is called -moderate when

(i) is non-decreasing in arguments to .

(ii) For any collection of polynomials ,


Lemmas B.1 and B.2 below are given only for future reference (and as an aid in spelling out the proofs of other Theorems in Appendix A).

Lemma B.1

Fix a distributional estimation problem and a -predictor. Then, is -optimal iff there is a -moderate function s.t. for any ,

Proof of Lemma B.1

Define

Lemma B.2

Assume is ample. Fix a distributional estimation problem and a corresponding -optimal predictor. Consider a -predictor, , a -valued -bischeme. Assume and are s.t. and . Then there is s.t.

Proof of Lemma B.2

Suppose is a polynomial s.t. . Given , define to be rounded within error . Thus, the number of digits in is logarithmic in and . Consider the -predictor defined by

satisfies bounds on runtime and advice size uniform in . Therefore, Lemma B.1 implies that there is s.t.

Lemma B.3 (orthogonality lemma)

Consider a distributional estimation problem and an -optimal predictor for . Then there are and an -moderate function s.t. for any ,

Conversely, consider and a -valued -bischeme. Suppose that for any -valued -bischeme we have .

Define to be s.t. computing is equivalent to computing rounded to digits after the binary point, where . Then, is an -optimal predictor for .


We ommit the proofs of Lemma B.3 and Lemma B.4 below since they are closely analogous to before.

Lemma B.4

Consider a distributional estimation problem, , -predictors. Suppose a polynomial, and are s.t.

Then s.t.

Proof of Theorem A.4

Denote . We have

Therefore, for any -valued -bischeme

By Lemma B.3, it is sufficient to show an appropriate bound for each of the terms on the right hand side. Suppose is a -generator for . For the first term, we have

where doesn’t depend on . Applying Lemma B.3 for , we get

where for some that doesn’t depend on .

Suppose is a -sampler for . For the second term, we have

where doesn’t depend on . Applying Lemma B.3 for , we get

where for some that doesn’t depend on . Again, we got the required bound.

Proof of Proposition A.1

Consider a non-constant polynomial and . Define . To get the desired condition for and , consider any s.t. for sufficiently large we have . For any we have

In particular

Appendix C

Proof of Proposition 1

To check condition (i), consider .

If , hence and .

If , hence and .

Conditions (ii) and (iii) are obvious.

Proof of Proposition 2

Consider . We need to show that i.e. that . But hence .

Proof of Proposition 3

Follows immediately from Lemma B.1.

Proof of Proposition 4

Suppose is a -optimal predictor. Set . Replacing by we get the desired -optimal predictor.

Proposition C.1

For any ,

In particular, this implies so is ample.

Proof of Proposition C.1

Denote . Consider , . We have

Proof of Proposition 5

The only not entirely obvious part is condition (iii) for which follows from Proposition C.1 (since ).

Construction C.1

Fix a reflective system.

For any , denote the set of the first words in lexicographic order. Denote the space of probability distributions on . Denote . Denote . Let be the locally convex topological vector space , where the topology is the product topology.

is a compact (by Tychonoff’s theorem) convex subset of . Each can be regarded as a probability measure on .

Given , we define by .

Given and , define as follows

Define by

Proposition C.2

is a Kakutani map.

Proof of Proposition C.2

is continuous in the 2nd argument and is compact for any . Therefore, for any , is non-empty by the extreme value theorem. It is compact by Tychonoff’s theorem and it is obviously convex.

Given finite, define by

is continuous since is continuous. Regarding the s as a net by set inclusion, uniformly converges to , therefore is continuous.

Given , , define by

is closed since is continuous. Therefore is also closed.

Proof of Theorem 1

Using Proposition C.2, we apply the Kakutani-Glicksberg-Fan theorem to get . Define .

Consider , a -predictor. Choose a polynomial and s.t. and

By definition of , we have

Here we implicitly used the natural injection .

Denoting , The left hand side satisfies

Similarly, for the right hand side we have

Combining the two together, we get

Applying Lemma B.4 we conclude that there is s.t.

Proof of Proposition 6

We need to show that for every , , and , there is a finite set and s.t. for every with we have .

Choose s.t. , finite and finite. Define . We get

By choosing with sufficiently small and sufficiently small we get the desired condition.

Proposition C.3

Fix a finite set and a collection . Consider a -reflective system and two collections of -predictors and . Assume that for some , . Then

Proof of Proposition C.3

Using the similarity of and there are s.t.

hence . This implies and

Definition C.1

Fix a set and a collection . Given a -reflective system, the associated reflective system is defined by

is continuous thanks to Proposition 6 since is continuous.

Proof of Theorem 2

Fix a finite set and a collection . Consider a -reflective system. By Theorem 1, there is an -optimal predictor system for . For each we can use Proposition 4 to choose , an -optimal predictor for . By Theorem A.7, we have . By Proposition C.3 this implies . This means is an -optimal predictor for and is an -optimal predictor system for .