The idea is an elaboration of a comment I made previously.
Motivation: Improving the theoretical understanding of AGI by facilitating synthesis between algorithmic information theory and statistical learning theory.
Topic: Fix some reasonable encoding of communicating MDPs, and use this encoding to define ζCMDP: the Solomonoff-type prior over communicating MDPs. That is, the probability of a communicating MDP H is proportional to 2−K(H) where K(H) is the length of the shortest program producing the encoding of H.
Consider CMDP-AIXI: the Bayes-optimal agent for ζCMDP. Morally speaking, we would like to prove that CMDP-AIXI (or any other policy) has a frequentist (i.e. per hypothesis H) non-anytime regret bound of the form O(nαζCMDP(H)−βτ(H)γ), where n is the time horizon[1], τ(H) is a parameter such as MDP diameter, bias span or mixing time, α∈(0,1), β,γ∈(0,∞) (this time γ is just a constant, not time discount). However, this precise result is probably impossible, because the Solomonoff prior falls off very slowly. Warm-up: Prove this!
Next, we need the concept of “sophisticated core”, inspired by algorithmic statistics. Given a bit string x, we consider the Kolmogorov complexity K(x) of x. Then we consider pairs (Q,y) where Q is a program that halts on all inputs, y is a bit string, Q(y)=x and |Q|+|y|≤K(x)+O(1). Finally, we minimize over |Q|. The minimal |Q| is called the sophistication of x. For our problem, we are interested in the minimal Q itself: I call it the “sophisticated core” of x and denote it SC(x).
To any halting program Q we can associate the environment μQ:=EH∼ζCMDP[H∣SC(H)=Q]. We also define the prior ξ by ξ(Q):=PrH∼ζCMDP[SC(H)=Q]. ζ and ξ are “equivalent” in the sense that EQ∼ξ[μQ]=EH∼ζCMDP[H]. However, they are not equivalent for the purpose of regret bounds.
Challenge: Investigate the conjecture that there is a (n-dependent) policy satisfying the regret bound O(nαξ(Q)−βE[τ(μQ)]γ) for every μQ, or something similar.
Strategy: See the theoretical research part of my other answer.
I am using unnormalized regret and step-function time discount here to make the notation more standard, even though usually I prefer normalized regret and geometric time discount.
The idea is an elaboration of a comment I made previously.
Motivation: Improving the theoretical understanding of AGI by facilitating synthesis between algorithmic information theory and statistical learning theory.
Topic: Fix some reasonable encoding of communicating MDPs, and use this encoding to define ζCMDP: the Solomonoff-type prior over communicating MDPs. That is, the probability of a communicating MDP H is proportional to 2−K(H) where K(H) is the length of the shortest program producing the encoding of H.
Consider CMDP-AIXI: the Bayes-optimal agent for ζCMDP. Morally speaking, we would like to prove that CMDP-AIXI (or any other policy) has a frequentist (i.e. per hypothesis H) non-anytime regret bound of the form O(nαζCMDP(H)−βτ(H)γ), where n is the time horizon[1], τ(H) is a parameter such as MDP diameter, bias span or mixing time, α∈(0,1), β,γ∈(0,∞) (this time γ is just a constant, not time discount). However, this precise result is probably impossible, because the Solomonoff prior falls off very slowly. Warm-up: Prove this!
Next, we need the concept of “sophisticated core”, inspired by algorithmic statistics. Given a bit string x, we consider the Kolmogorov complexity K(x) of x. Then we consider pairs (Q,y) where Q is a program that halts on all inputs, y is a bit string, Q(y)=x and |Q|+|y|≤K(x)+O(1). Finally, we minimize over |Q|. The minimal |Q| is called the sophistication of x. For our problem, we are interested in the minimal Q itself: I call it the “sophisticated core” of x and denote it SC(x).
To any halting program Q we can associate the environment μQ:=EH∼ζCMDP[H∣SC(H)=Q]. We also define the prior ξ by ξ(Q):=PrH∼ζCMDP[SC(H)=Q]. ζ and ξ are “equivalent” in the sense that EQ∼ξ[μQ]=EH∼ζCMDP[H]. However, they are not equivalent for the purpose of regret bounds.
Challenge: Investigate the conjecture that there is a (n-dependent) policy satisfying the regret bound O(nαξ(Q)−βE[τ(μQ)]γ) for every μQ, or something similar.
Strategy: See the theoretical research part of my other answer.
I am using unnormalized regret and step-function time discount here to make the notation more standard, even though usually I prefer normalized regret and geometric time discount.