A major impediment in applying RL theory to any realistic scenario is that even the control problem[1] is intractable when the state space is exponentially large (in general). Real-life agents probably overcome this problem by exploiting some special properties of real-life environments. Here are two strong candidates for such properties:
In real life, processes can often be modeled as made of independent co-existing parts. For example, if I need to decide on my exercise routine for the next month and also on my research goals for the next month, the two can be optimized more or less independently.
In real life, planning can often be decomposed across timescales, s.t. you don’t need to make short timescale plans for steps that only happen later on the long timescale. For example, if I’m in the process of planning a trip to Paris, I might need to worry about (i) booking hotel and tickets (long timescale), (ii) navigating the website I’m using to find a flight (medium timescale) and (iii) moving my finger towards the correct key for entering some specific text into a field (short timescale). But I don’t need to worry about walking down the escalator in the airport at this moment.
Here’s an attempt to formalize these properties.
We will define a certain formal language for describing environments. These environments are going to be certain asymptotic regions in the space of MDPs.
Each term t has a type which consists of a tuple of inputs (x1,x2…xn) and a single output y. Each input xi is a associated with an HV-polytope[2]P(xi). The output is associated with an H-polytope[3]Q(y). The inputs represent action spaces (to get a discrete action set, we use the simplex of probability distributions on this set). The output represents the space of admissible equilibria.
The atomic terms are finite communicating[4] MDPs, in which each state s∈S is associated with a particular input ιs and a transition kernel Ts:P(ιs)→ΔS which has to be an affine mapping. For an atomic term, Q(y) is the polytope of stationary state-action distributions. Notice that it’s efficiently computable.
Given two terms t1:(x1…xn)→y and t′:(x′1…x′m)→y′, we can construct a new term t1×t2:(x1…xn,x′1…x′m)→y×y′. We set Q(y×y′):=Q(y)×Q(y′). This represents a process made of two independent parts.
Given a term t:(x1…xn)→y, n terms {uk:(x′k1…x′kmk)→y′k}1≤k≤n and surjective affine mappings {fk:Q(y′k)→P(xk)}, we can construct a new term tf[u]:(x′ij)→y. This represents an environment governed by t on long timescales and by u on short timescales. Notice that it’s possible to efficiently verify that f is a surjection, which is why we use HV-polytopes for inputs[5].
It might be useful to think of t1×t2 as vertical composition and tf[u] as horizontal composition, in the category-theoretic sense.
In order to assign semantics to this language, we need to define the environment associated with each term t:(x1…xn)→y. We will do so by assigning t a state space S(t), each state s∈S(t) an input ι(s) (which determines the action space at this state) and a transition kernel. This is done recursively:
For the atomic terms, it is straightforward.
For t1×t2:
S(t1×t2):=S(t1)×S(t2)×{1,2}. Here, the last factor represents which subenvironment is active. This is needed because we want the two subenvironments to be asynchronous, i.e. their time dynamics don’t have to be in lockstep.
ι(s1,s2,i):=ι(si)
The transition kernel at (s1,s2,i) is defined by updating si according to the transition kernel of ti and then changing i according to some arbitrary probabilistic rule, as long as this rule switches the active subenvironment sufficiently often. The degrees of freedom here are one reason we get an asymptotic region in MDP-space rather than a specific MDP.
For tf[u]:
S(tf[u]):=⋃s∈S(t)S(uι(s)), where we abuse notation to identify the input ι(s) with its index inside the tuple.
ι is extended from u in the obvious way.
Given s∈S(t) and s′∈S(uι(s)), the tf[u]-transition kernel at s′ is defined by (i) with high probability, s′ is updated according to the transition kernel of uι(s) (ii) with low probability, s is updated according to the transition kernel of t, where the action is determined by the frequency of state-action pairs since the last type II transition: it is easy to see that Q(y) is always a polytope in an appropriately defined space of state-action distributions.
The upshot is that, given a list of term definitions (which has a structure similar to a directed acyclic graph, since the definition of each term can refer to previously defined terms), we get an environment that can have an exponentially large number of states, but the control problem can be solved in time polynomial in the size of this description, given some assumptions about the reward function. Specifically, we “decorate” our terms with reward functions in the following way:
For atomic terms, we just specify the reward function in the straightforward way.
For t1×t2, we specify some c1,c2≥0. The reward is then a linear combination of the individual rewards with these coefficients (and doesn’t depend on which subenvironment is active).
For a term of the form tf[u], we need that r′k(p)=maxq∈f−1k(p)ruk(q) for some affine r′k:P(xk)→R which is part of the decoration. This can be validated efficiently (here it’s important again that the input is an HV-polytope). In addition, we specify some c,c′≥0 and the reward a linear combination with these coefficients of the t-reward and the u-reward.
For timescale decomposition, this planning algorithm can be regarded as formalization of instrumental goals.
An important problem is, understanding the sample complexity of learning hypothesis classes made of such environments. First in the unbounded case and then with polynomial-time learning algorithms.
An HV-polytope is a polytope described by a list of inequalities and a list of vertices (notice that it’s possible to efficiently validate such a description).
According to Tiwary 2008, projection of H-polytopes is NP-hard even in the output-sensitive sense, but for non-degenerate projection directions it is output-sensitive polynomial time. In particular, this means we should be able to efficiently verify surjectivity in the non-degenerate case even for H-polytopes on the inputs. However, the proof given there seems poorly written and the paper is not peer reviewed AFAICT.
A major impediment in applying RL theory to any realistic scenario is that even the control problem[1] is intractable when the state space is exponentially large (in general). Real-life agents probably overcome this problem by exploiting some special properties of real-life environments. Here are two strong candidates for such properties:
In real life, processes can often be modeled as made of independent co-existing parts. For example, if I need to decide on my exercise routine for the next month and also on my research goals for the next month, the two can be optimized more or less independently.
In real life, planning can often be decomposed across timescales, s.t. you don’t need to make short timescale plans for steps that only happen later on the long timescale. For example, if I’m in the process of planning a trip to Paris, I might need to worry about (i) booking hotel and tickets (long timescale), (ii) navigating the website I’m using to find a flight (medium timescale) and (iii) moving my finger towards the correct key for entering some specific text into a field (short timescale). But I don’t need to worry about walking down the escalator in the airport at this moment.
Here’s an attempt to formalize these properties.
We will define a certain formal language for describing environments. These environments are going to be certain asymptotic regions in the space of MDPs.
Each term t has a type which consists of a tuple of inputs (x1,x2…xn) and a single output y. Each input xi is a associated with an HV-polytope[2] P(xi). The output is associated with an H-polytope[3] Q(y). The inputs represent action spaces (to get a discrete action set, we use the simplex of probability distributions on this set). The output represents the space of admissible equilibria.
The atomic terms are finite communicating[4] MDPs, in which each state s∈S is associated with a particular input ιs and a transition kernel Ts:P(ιs)→ΔS which has to be an affine mapping. For an atomic term, Q(y) is the polytope of stationary state-action distributions. Notice that it’s efficiently computable.
Given two terms t1:(x1…xn)→y and t′:(x′1…x′m)→y′, we can construct a new term t1×t2:(x1…xn,x′1…x′m)→y×y′. We set Q(y×y′):=Q(y)×Q(y′). This represents a process made of two independent parts.
Given a term t:(x1…xn)→y, n terms {uk:(x′k1…x′kmk)→y′k}1≤k≤n and surjective affine mappings {fk:Q(y′k)→P(xk)}, we can construct a new term tf[u]:(x′ij)→y. This represents an environment governed by t on long timescales and by u on short timescales. Notice that it’s possible to efficiently verify that f is a surjection, which is why we use HV-polytopes for inputs[5].
It might be useful to think of t1×t2 as vertical composition and tf[u] as horizontal composition, in the category-theoretic sense.
In order to assign semantics to this language, we need to define the environment associated with each term t:(x1…xn)→y. We will do so by assigning t a state space S(t), each state s∈S(t) an input ι(s) (which determines the action space at this state) and a transition kernel. This is done recursively:
For the atomic terms, it is straightforward.
For t1×t2:
S(t1×t2):=S(t1)×S(t2)×{1,2}. Here, the last factor represents which subenvironment is active. This is needed because we want the two subenvironments to be asynchronous, i.e. their time dynamics don’t have to be in lockstep.
ι(s1,s2,i):=ι(si)
The transition kernel at (s1,s2,i) is defined by updating si according to the transition kernel of ti and then changing i according to some arbitrary probabilistic rule, as long as this rule switches the active subenvironment sufficiently often. The degrees of freedom here are one reason we get an asymptotic region in MDP-space rather than a specific MDP.
For tf[u]:
S(tf[u]):=⋃s∈S(t)S(uι(s)), where we abuse notation to identify the input ι(s) with its index inside the tuple.
ι is extended from u in the obvious way.
Given s∈S(t) and s′∈S(uι(s)), the tf[u]-transition kernel at s′ is defined by (i) with high probability, s′ is updated according to the transition kernel of uι(s) (ii) with low probability, s is updated according to the transition kernel of t, where the action is determined by the frequency of state-action pairs since the last type II transition: it is easy to see that Q(y) is always a polytope in an appropriately defined space of state-action distributions.
The upshot is that, given a list of term definitions (which has a structure similar to a directed acyclic graph, since the definition of each term can refer to previously defined terms), we get an environment that can have an exponentially large number of states, but the control problem can be solved in time polynomial in the size of this description, given some assumptions about the reward function. Specifically, we “decorate” our terms with reward functions in the following way:
For atomic terms, we just specify the reward function in the straightforward way.
For t1×t2, we specify some c1,c2≥0. The reward is then a linear combination of the individual rewards with these coefficients (and doesn’t depend on which subenvironment is active).
For a term of the form tf[u], we need that r′k(p)=maxq∈f−1k(p)ruk(q) for some affine r′k:P(xk)→R which is part of the decoration. This can be validated efficiently (here it’s important again that the input is an HV-polytope). In addition, we specify some c,c′≥0 and the reward a linear combination with these coefficients of the t-reward and the u-reward.
For timescale decomposition, this planning algorithm can be regarded as formalization of instrumental goals.
An important problem is, understanding the sample complexity of learning hypothesis classes made of such environments. First in the unbounded case and then with polynomial-time learning algorithms.
“Control” means finding the optimal policy given known transition kernel and reward function.
An HV-polytope is a polytope described by a list of inequalities and a list of vertices (notice that it’s possible to efficiently validate such a description).
An H-polytope is a polytope described by list of inequalities.
Maybe we can drop this requirement and use the polytope of reachable stationary state-action distributions for Q(y).
According to Tiwary 2008, projection of H-polytopes is NP-hard even in the output-sensitive sense, but for non-degenerate projection directions it is output-sensitive polynomial time. In particular, this means we should be able to efficiently verify surjectivity in the non-degenerate case even for H-polytopes on the inputs. However, the proof given there seems poorly written and the paper is not peer reviewed AFAICT.