I am not sure I understand your use of C(U) in the third from last paragraph where you define goal directed intelligence. As you define C it is a complexity measure over programs P. I assume this was a typo and you mean K(U)? Or am I misunderstanding the definition of either U or C?
I’m imagining that we have a program P that outputs (i) a time discount parameter γ∈Q∩[0,1), (ii) a circuit for the transition kernel of an automaton T:S×A×O→S and (iii) a circuit for a reward function r:S→Q (and, ii+iii are allowed to have a shared component to save computation time complexity). The utility function is U:(A×O)ω→R defined by
Okay, I think this makes sense. The idea is trying to re-interpret the various functions in the utility function as a single function and asking about the notion of complexity on that function which combines the complexity of producing a circuit which computes that function and the complexity of the circuit itself.
But just to check: is T over S×A×O→S? I thought T in utility functions only depended on states and actions S×A→S?
Maybe I am confused by what you mean by S. I thought it was the state space, but that isn’t consistent with r in your post which was defined over A×O→Q? As a follow up: defining r as depending on actions and observations instead of actions and states (which e.g. the definition in POMDP on Wikipedia) seems like it changes things. So I’m not sure if you intended the rewards to correspond with the observations or ‘underlying’ states.
One more question, this one about the priors: what are they a prior over exactly? I will use the letters/terms from https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process to try to be explicit. Is the prior capturing the “set of conditional observation probabilities” (O on Wikipedia)? Or is it capturing the “set of conditional transition probabilities between states” (T on Wikipedia)? Or is it capturing a distribution over all possible T and O? Or are you imaging that T is defined with U (and is non-random) and O is defined within the prior? I ask because the term DKL(ζ0||ζ) will be positive infinity if ζ is zero for any value where ζ0 is non-zero. Which makes the interpretation that it is either O or T directly pretty strange (for example, in the case where there are two states s1 and s2 and two obersvations o1 and o2 an O where P(si|oi)=1 and P(si|oj)=0 if i≠j would have a KL divergence of infinity from the ζ0 if ζ0 had non-zero probability on P(s1|o2)). So, I assume this is a prior over what the conditional observation matrices might be. I am assuming that your comment above implies that T is defined in the utility function U instead, and is deterministic?
Maybe I am confused by what you mean by S. I thought it was the state space, but that isn’t consistent with r in your post which was defined over A×O→Q?
I’m not entirely sure what you mean by the state space.S is a state space associated specifically with the utility function. It has nothing to do with the state space of the environment. The reward function in the OP is (A×O)∗→R, not A×O→R. I slightly abused notation by defining r:S→Q in the parent comment. Let’s say it’s r′:S→Q and r is defined by using T to translate the history to the (last) state and then applying r′.
One more question, this one about the priors: what are they a prior over exactly? …I ask because the term DKL(ζ0||ζ) will be positive infinity if ζ is zero for any value where ζ0 is non-zero.
The prior is just an environment i.e. a partial mapping ζ:(A×O)∗→ΔO defined on every history to which it doesn’t itself assign probability 0. The expression DKL(ξ||ζ) means that we consider all possible ways to choose a Polish space X, probability distributions μ,ν∈ΔX and a mapping f:X×(A×O)∗→ΔO s.t.ζ=Eμ[f] and ξ=Eν[f] (where the expected value is defined using the Bayes law and not pointwise, see also the definition of “instrumental states” here), and take the minimum over all of them of DKL(ν||μ).
I am not sure I understand your use of C(U) in the third from last paragraph where you define goal directed intelligence. As you define C it is a complexity measure over programs P. I assume this was a typo and you mean K(U)? Or am I misunderstanding the definition of either U or C?
This is not a typo.
I’m imagining that we have a program P that outputs (i) a time discount parameter γ∈Q∩[0,1), (ii) a circuit for the transition kernel of an automaton T:S×A×O→S and (iii) a circuit for a reward function r:S→Q (and, ii+iii are allowed to have a shared component to save computation time complexity). The utility function is U:(A×O)ω→R defined by
U(x):=(1−γ)∞∑n=0γnr(sxn)
where sx∈Sω is defined recursively by
sxn+1=T(s,xn)
Okay, I think this makes sense. The idea is trying to re-interpret the various functions in the utility function as a single function and asking about the notion of complexity on that function which combines the complexity of producing a circuit which computes that function and the complexity of the circuit itself.
But just to check: is T over S×A×O→S? I thought T in utility functions only depended on states and actions S×A→S?
Maybe I am confused by what you mean by S. I thought it was the state space, but that isn’t consistent with r in your post which was defined over A×O→Q? As a follow up: defining r as depending on actions and observations instead of actions and states (which e.g. the definition in POMDP on Wikipedia) seems like it changes things. So I’m not sure if you intended the rewards to correspond with the observations or ‘underlying’ states.
One more question, this one about the priors: what are they a prior over exactly? I will use the letters/terms from https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process to try to be explicit. Is the prior capturing the “set of conditional observation probabilities” (O on Wikipedia)? Or is it capturing the “set of conditional transition probabilities between states” (T on Wikipedia)? Or is it capturing a distribution over all possible T and O? Or are you imaging that T is defined with U (and is non-random) and O is defined within the prior?
I ask because the term DKL(ζ0||ζ) will be positive infinity if ζ is zero for any value where ζ0 is non-zero. Which makes the interpretation that it is either O or T directly pretty strange (for example, in the case where there are two states s1 and s2 and two obersvations o1 and o2 an O where P(si|oi)=1 and P(si|oj)=0 if i≠j would have a KL divergence of infinity from the ζ0 if ζ0 had non-zero probability on P(s1|o2)). So, I assume this is a prior over what the conditional observation matrices might be. I am assuming that your comment above implies that T is defined in the utility function U instead, and is deterministic?
I’m not entirely sure what you mean by the state space.S is a state space associated specifically with the utility function. It has nothing to do with the state space of the environment. The reward function in the OP is (A×O)∗→R, not A×O→R. I slightly abused notation by defining r:S→Q in the parent comment. Let’s say it’s r′:S→Q and r is defined by using T to translate the history to the (last) state and then applying r′.
The prior is just an environment i.e. a partial mapping ζ:(A×O)∗→ΔO defined on every history to which it doesn’t itself assign probability 0. The expression DKL(ξ||ζ) means that we consider all possible ways to choose a Polish space X, probability distributions μ,ν∈ΔX and a mapping f:X×(A×O)∗→ΔO s.t.ζ=Eμ[f] and ξ=Eν[f] (where the expected value is defined using the Bayes law and not pointwise, see also the definition of “instrumental states” here), and take the minimum over all of them of DKL(ν||μ).