Minimax forecasting
%&latex
% operators that are separated from the operand by a space
% operators that require brackets
% operators that require parentheses
% Paper specific
This post continues the research programme of attacking the grain of truth problem by departure from the Bayesian paradigm. In the previous post, I suggested using Savage’s minimax regret decision rule, but here I fall back to the simple minimax decision rule. This is because the mathematics is considerably simpler, and minimax should be sufficient to get IUD play in general games and Nash equilibrium in zero-sum two-player games. I hope to build on these results to get analogous results for minimax regret in the future.
We consider “semi-Bayesian” agents following the minimax expected utility decision rule, in oblivious environments with full monitoring (a setting that we will refer to as “forecasting”). This setting is considered in order to avoid the need to enforce exploration, as a preparation for analysis of general environments. We show that such agents satisfy a certain asymptotic optimality theorem. Intuitively, this theorem means that whenever the environment satisfies an incomplete model that is included in the prior, the agent will eventually learn this model i.e. extract at least as much utility as can be guaranteed for this model.
Appendix A contains the proofs of all results. Appendix B contains some theorems from the literature together with a corollary used in Appendix A.
##Notation
Given a topological space, will denote the space of Borel probability measures on . We regard it as a topological space using the weak topology. Given , is defined by . Given , stands for the total variation distance between and . We denote the set of non-empty convex compact subsets of .
Given a finite set, denotes the set of finite strings in alphabet , i.e. . denotes the set of infinite strings in alphabet . Given , is the length of string. Given , is the -th symbol in . Given , is the prefix of of length . Given , the notations , , and mean ” is a proper prefix of ”, ” is a prefix of ” and their negations. is the empty string and .
##Elementary properties of minimax
Fix and compact Polish spaces, representing the agent’s pure policies and representing the pure environments. Let be a continuous utility function.
#Proposition 1
Consider . There exists
Moreover, let be the closure of the convex hull of . Then, the same satisfies
We will say that such a is a minimax policy for .
#Proposition 2
Consider and . Then, there is s.t. any minimax policy for is a minimax policy for .
In particular, Proposition 2 implies that a minimax policy for is the optimal policy for some .
Now we ask what policy is implemented by “subagents” created by a minimax policy. Suppose , where represents the pure policies of the subagent. Assume that there is a Borel set (the condition for the creation of the subagent), a finite set (the actions taken before the creation of the subagent), a Borel measurable mapping and continuous functions and s.t.
Consider and minimax for . Denote the projection mappings. Define by
Define by
If for some we have , then can be arbitrary.
#Proposition 3
It should also be possible to make do without and the associated decomposition of by instead decomposing into and a Markov kernel from to .
##Asymptotic optimality
Fix finite sets (actions) and (observations). We now assume that and with the product topology. We fix a time discount function satisfying and a bounded reward function. Given and , we define by
The utility function is then given by
We will need a notation for a combination of two policies where the agent switches from one to another when observing some . Given and , we define by
Consider any and . We define as follows. For any Borel measurable :
It is easy to see the above indeed defines a probability measure.
Note that changing the policy after a certain event cannot change utility more than times the probability of the event times the corresponding time discount factor. That is, we have the following:
#Proposition 4
We are now ready to formulate the optimality theorem. Consider and . Denote . We think of as the prior, as one of the models in the prior (assigned probability ) and as the convex combination of all other models. Let be a minimax policy for . Define by
That is, is the expected utility that can be guaranteed by changing the policy following , assuming that the true environment is in . We claim that if the true environment is in , then after sufficient time will achieve at least as much utility as can be guaranteed for (guaranteed under the constraint of following until the point of making the current observations). That is, can only be greater than by a quantity that goes to 0 with probability 1, even after normalizing according to Proposition~5:
#Theorem 1
##Appendix A
We will repeatedly use the standard fact that, given a compact Polish space, is also a compact Polish space (it is metrizable by the Levy-Prokhorov metric, compact as a consequence of the Banach-Alaoglu theorem and the fact that any probability measure on a Polish space is Radon, and separability is also not hard to prove).
#Proposition A.1
Consider compact Polish spaces, continuous, and a sequence . Define by and by . Then, converges to uniformly.
#Proof of Proposition A.1
Assume to the contrary that there is and a subsequence of s.t.
This implies we can choose s.t. . Let be an increasing sequence s.t. for some . We get , which is a contradiction because and .
#Proposition A.2
Consider compact Polish spaces and continuous. Define by
Then, is continuous.
#Proof of Proposition A.2
Consider any , and sequences , . We have
On the right hand side, the first term goes to 0 by Proposition A.1 and the second term goes to 0 by definition of the weak* topology.
#Proposition A.3
Consider any compact Polish and continuous. Define by
Then, is continuous.
#Proof of Proposition A.3
Follows by applying Fubini’s theorem and using Proposition A.2 twice.
#Proposition A.4
Consider any compact Polish, and continuous. Define by . Then, is continuous.
#Proof of Proposition A.4
Consider any and a sequence
By Proposition A.1, the right hand converges to 0, therefore the left hand side also converges to 0.
In the following, we will use the notation , defined by . By Proposition A.3, is continuous.
#Proof of Proposition 1
Define by . By Proposition A.4, is continuous and therefore, exists.
is continuous and affine in the 2nd argument (i.e. it sends convex linear combinations to convex linear combinations), and therefore , implying the second part of the proposition.
#Proof of Proposition 2
Define by . Denote . By Proposition A.4, is continuous and therefore, there exists
Choose , s.t. .
and are compact convex sets in the spaces of finite signed Borel measures on and respectively. Therefore, we can apply the minimax theorem to get
Denote the above number . We have
On the other hand
Combining the inequalities, we get the desired result.
#Definition A.1
In the setting of Proposition 3, consider any and . We define as follows. For any Borel measurable :
It is easy to see the above indeed defines a probability measure.
#Proposition A.5
Consider any and and Borel measurable and bounded. Then:
#Proof of Proposition A.5
#Proposition A.6
Given , and
#Proof of Proposition A.6
Applying Proposition A.5:
#Proposition A.7
Consider any . Define and by
If for some we have , then can be arbitrary.
Then,
#Proof of Proposition A.7
Applying Proposition A.6, we get the desired result.
#Proof of Proposition 3
Consider any . By definition of
Applying Proposition A.7 to the right hand side
Applying Proposition A.6 to both sides, we get the desired result.
Given any , we denote and .
#Proof of Proposition 4
We have
Denote , . We have
The desired result now follows from .
#Definition A.2
A splitting of is s.t. , is Borel, is a finite set, is Borel measurable, and are continuous and
#Proposition A.8
Consider , and , and denote . Suppose that is a splitting of . Assume . Fix . Let satisfy
Denote . Then, for any s.t.
#Proof of Proposition A.8
Let satisfy
Denote . Consider any and denote . We have
Applying Proposition A.6 to the right hand side
Also, we have
Applying Proposition A.6 again
Combining the inequalities
The above inequality holds for any , therefore
By definition of , it follows that
Denote and . We have
Applying Proposition A.6 to the right hand side
On the other hand, Proposition A.6 implies that
Substituting in the inequality, we get
Observing that we can combine the inequalities, we get
If , we get , as desired. If , we divide both sides by and get
#Proposition A.9
In the setting of Proposition A.8, assuming but sans the assumption :
#Proof of Proposition A.9
We use the same notation as in the proof of Proposition A.8. As before, we have
Combining, we get
#Definition A.3
Denote , . Given any , the splitting of at is defined as follows.
Given and , define by . Given , define by . Define by
Given , and , define by
Define by
It is easy to see that this is indeed a splitting.
#Proof of Theorem 1
Fix . Let be as in Proposition 2. Define by
By Corollary B, .
Consider any . By definition of , for any we have and . Therefore, we can apply Proposition A.8 to the splitting of at , with . This gives us
Here, .
By definition of , this implies
Now, consider any . By definition of , for any we have . Therefore, we can apply Proposition A.9 to the splitting of at , with . This gives us
By definition of , this also implies
##Appendix B
The following theorem is adapted from Blackwell and Dubbins (1962).
#Theorem B.1
Consider any . Assume that is absolutely continuous with respect to . Then:
#Theorem B.2
Consider any . Define by
Define . Then, for any Borel :
#Proof of Theorem B.2
This is almost a special case of Theorem 5.3.3 in Durret (2010), except that we don’t assume is locally absolutely continuous w.r.t. (i.e. we don’t assume that whenever ). Careful reading of the proof shows that in our case it doesn’t matter: ( is Durret’s notation) is no longer a -martingale but is still a -supermartingale, and the identity still holds if the fraction is understood to stand for whenever .
#Corollary B
Consider any . Define
Then, .
#Proof of Corollary B
Assume that . Then, we can define by
Obviously, is absolutely continuous w.r.t. , therefore we can use Theorem B.1 to conclude
By Theorem B.2, is dominated by and in particular is absolutely continuous w.r.t. . Therefore, again by Theorem B.1:
Using the triangle inequality for , we get
By definition of and , this means that
This holds even without the assumption , since if then both sides vanish.
is -supermatingale, therefore (by Doob’s convergence theorem) its limit exists -almost surely. If the limit exists for some , then it vanishes iff . It follows that
By Theorem B.2
Applying Theorem B.2 to the right hand side:
- The Learning-Theoretic AI Alignment Research Agenda by 4 Jul 2018 9:53 UTC; 92 points) (
- A measure-theoretic generalization of logical induction by 18 Jan 2017 13:56 UTC; 6 points) (
- 1 Sep 2020 12:52 UTC; 5 points) 's comment on Introduction To The Infra-Bayesianism Sequence by (
- Towards learning incomplete models using inner prediction markets by 8 Jan 2017 13:37 UTC; 3 points) (
- Subagent perfect minimax by 6 Jan 2017 13:47 UTC; 1 point) (
- Minimax and dynamic (in)consistency by 11 Dec 2016 10:42 UTC; 0 points) (
Thanks! :-)
Would you mind putting the break *** closer to the top of the post? Cheers!