This idea was inspired by a correspondence with Adam Shimi.
It seem very interesting and important to understand to what extent a purely “behaviorist” view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?
Consider a general reinforcement learning settings: we have a set of actions A, a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)
The simplest attempt at defining “goal-directed intelligence” is requiring that the policy π in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows π, or the prior can believe that behavior not according to π leads to some terrible outcome.
The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are “contrived”. However, description complexity is only naturally well-defined up to an additive constant. So, if we want to have a crisp concept, we need to consider an asymptotic in which the complexity of something goes to infinity. Indeed, it seems natural to ask that the complexity of the policy should be much higher than the complexity of the prior and the reward function: in this case we can say that the “intentional stance” is an efficient description. However, this doesn’t make sense with description complexity: the description “optimal policy for U and ζ” is of size K(U)+K(ζ)+O(1) (K(x) stands for “description complexity of x”).
To salvage this idea, we need to take not only description complexity but also computational complexity into account. [EDIT: I was wrong, and we can get a well-defined concept in the unbounded setting too, see child comment. The bounded concept is still interesting.] For the intentional stance to be non-vacuous we need to demand that the policy does some “hard work” in order to be optimal. Let’s make it formal. Consider any function of the type f:Σ∗→ΔΞ where Σ and Ξ are some finite alphabets. Then, we can try to represent it by a probabilistic automaton T:S×Σ→Δ(S×Ξ), where S is the finite set space, T is the transition kernel, and we’re feeding symbols into the automaton one by one. Moreover, T can be represented as a boolean circuit R and this circuit can be the output of some program P executed by some fixed universal Turing machine. We can associate with this object 5 complexity parameters:
The description complexity, which is the length of P.
The computation time complexity, which is the size of R.
The computation space complexity, which is the maximum between the depth of R and log|S|.
The precomputation time complexity, which is the time it takes P to run.
The precomputation space complexity, which is the space P needs to run.
It is then natural to form a single complexity measure by applying a logarithm to the times and taking a linear combination of all 5 (we apply a logarithm so that a brute force search over n bits is roughly equivalent to hard-coding n bits). The coefficients in this combination represent the “prices” of the various resources (but we should probably fix the price of description complexity to be 1). Of course not all coefficients must be non-vanishing, it’s just that I prefer to keep maximal generality for now. We will denote this complexity measure C.
We can use such automatons to represent policies, finite POMDP environments and reward functions (ofc not any policy or reward function, but any that can be computed on a machine with finite space). In the case of policies, the computation time/space complexity can be regarded as the time/space cost of applying the “trained” algorithm, whereas the precomputation time/space complexity can be regarded as the time/space cost of training. If we wish, we can also think of the boolean circuit as a recurrent neural network.
We can also use C to define a prior ζ0, by ranging over programs P that output a valid POMDP and assigning probability proportional to 2−C to each instance. (Assuming that the environment has a finite state space might seem restrictive, but becomes quite reasonable if we use a quasi-Bayesian setting with quasi-POMDPs that are not meant to be complete descriptions of the environment; for now we won’t go into details about this.)
Now, return to our policy π. Given g>0, we define that ”π has goal-directed intelligence (at least) g” when there is a suitable prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then C(π′)≥DKL(ζ0||ζ)+C(U)+g. When g=+∞ (i.e. no finite automaton can match the expected utility of π; in particular, this implies π is optimal since any policy can be approximated by a finite automaton), we say that π is “perfectly goal-directed”. Here, DKL(ζ0||ζ) serves as a way to measure the complexity of ζ, which also ensures ζ is non-dogmatic in some rather strong sense.
[EDIT: if we fix U and ζ then g is essentially the same as Yudkowsky’s definition of optimization power if we regard the policy as the “outcome” and use 2−C as our measure on the space of outcomes.]
With this definition we cannot “cheat” by encoding the policy into the prior or into the utility function, since that would allow no complexity difference. Therefore this notion seems like a non-trivial requirement on the policy. On the other hand, this requirement does hold sometimes, because solving the optimization problem can be much more computationally costly than just evaluating the utility function or sampling the prior.
I am not sure I understand your use of C(U) in the third from last paragraph where you define goal directed intelligence. As you define C it is a complexity measure over programs P. I assume this was a typo and you mean K(U)? Or am I misunderstanding the definition of either U or C?
I’m imagining that we have a program P that outputs (i) a time discount parameter γ∈Q∩[0,1), (ii) a circuit for the transition kernel of an automaton T:S×A×O→S and (iii) a circuit for a reward function r:S→Q (and, ii+iii are allowed to have a shared component to save computation time complexity). The utility function is U:(A×O)ω→R defined by
Okay, I think this makes sense. The idea is trying to re-interpret the various functions in the utility function as a single function and asking about the notion of complexity on that function which combines the complexity of producing a circuit which computes that function and the complexity of the circuit itself.
But just to check: is T over S×A×O→S? I thought T in utility functions only depended on states and actions S×A→S?
Maybe I am confused by what you mean by S. I thought it was the state space, but that isn’t consistent with r in your post which was defined over A×O→Q? As a follow up: defining r as depending on actions and observations instead of actions and states (which e.g. the definition in POMDP on Wikipedia) seems like it changes things. So I’m not sure if you intended the rewards to correspond with the observations or ‘underlying’ states.
One more question, this one about the priors: what are they a prior over exactly? I will use the letters/terms from https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process to try to be explicit. Is the prior capturing the “set of conditional observation probabilities” (O on Wikipedia)? Or is it capturing the “set of conditional transition probabilities between states” (T on Wikipedia)? Or is it capturing a distribution over all possible T and O? Or are you imaging that T is defined with U (and is non-random) and O is defined within the prior? I ask because the term DKL(ζ0||ζ) will be positive infinity if ζ is zero for any value where ζ0 is non-zero. Which makes the interpretation that it is either O or T directly pretty strange (for example, in the case where there are two states s1 and s2 and two obersvations o1 and o2 an O where P(si|oi)=1 and P(si|oj)=0 if i≠j would have a KL divergence of infinity from the ζ0 if ζ0 had non-zero probability on P(s1|o2)). So, I assume this is a prior over what the conditional observation matrices might be. I am assuming that your comment above implies that T is defined in the utility function U instead, and is deterministic?
Maybe I am confused by what you mean by S. I thought it was the state space, but that isn’t consistent with r in your post which was defined over A×O→Q?
I’m not entirely sure what you mean by the state space.S is a state space associated specifically with the utility function. It has nothing to do with the state space of the environment. The reward function in the OP is (A×O)∗→R, not A×O→R. I slightly abused notation by defining r:S→Q in the parent comment. Let’s say it’s r′:S→Q and r is defined by using T to translate the history to the (last) state and then applying r′.
One more question, this one about the priors: what are they a prior over exactly? …I ask because the term DKL(ζ0||ζ) will be positive infinity if ζ is zero for any value where ζ0 is non-zero.
The prior is just an environment i.e. a partial mapping ζ:(A×O)∗→ΔO defined on every history to which it doesn’t itself assign probability 0. The expression DKL(ξ||ζ) means that we consider all possible ways to choose a Polish space X, probability distributions μ,ν∈ΔX and a mapping f:X×(A×O)∗→ΔO s.t.ζ=Eμ[f] and ξ=Eν[f] (where the expected value is defined using the Bayes law and not pointwise, see also the definition of “instrumental states” here), and take the minimum over all of them of DKL(ν||μ).
Actually, as opposed to what I claimed before, we don’t need computational complexity bounds for this definition to make sense. This is because the Solomonoff prior is made of computable hypotheses but is uncomputable itself.
Given g>0, we define that ”π has (unbounded) goal-directed intelligence (at least) g” when there is a prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then K(π′)≥DKL(ζ0||ζ)+K(U)+g. Here, ζ0 is the Solomonoff prior and K is Kolmogorov complexity. When g=+∞ (i.e. no computable policy can match the expected utility of π; in particular, this implies π is optimal since any policy can be approximated by a computable policy), we say that π is “perfectly (unbounded) goal-directed”.
Compare this notion to the Legg-Hutter intelligence measure. The LH measure depends on the choice of UTM in radical ways. In fact, for some UTMs, AIXI (which is the maximum of the LH measure) becomes computable or even really stupid. For example, it can always keep taking the same action because of the fear that taking any other action leads to an inescapable “hell” state. On the other hand, goal-directed intelligence differs only by O(1) between UTMs, just like Kolmogorov complexity. A perfectly unbounded goal-directed policy has to be uncomputable, and the notion of which policies are such doesn’t depend on the UTM at all.
I think that it’s also possible to prove that intelligence is rare, in the sense that, for any computable stochastic policy, if we regard it as a probability measure over deterministic policies, then for any ϵ>0 there is g s.t. the probability to get intelligence at least g is smaller than ϵ.
Also interesting is that, for bounded goal-directed intelligence, increasing the prices can only decrease intelligence by O(1), and a policy that is perfectly goal-directed w.r.t. lower prices is also such w.r.t. higher prices (I think). In particular, a perfectly unbounded goal-directed policy is perfectly goal-directed for any price vector. Informally speaking, an agent that is very smart relatively to a context with cheap computational resources is still very smart relatively to a context where they are expensive, which makes intuitive sense.
If we choose just one computational resource, we can speak of the minimal price for which a given policy is perfectly goal-directed, which is another way to measure intelligence with a more restricted domain. Curiously, our bounded Solomonoff-like prior has the shape of a Maxwell-Boltzmann distribution in which the prices are thermodynamic parameters. Perhaps we can regard the minimal price as the point of a phase transition.
Some problems to work on regarding goal-directed intelligence. Conjecture 5 is especially important for deconfusing basic questions in alignment, as it stands in opposition to Stuart Armstrong’s thesis about the impossibility to deduce preferences from behavior alone.
Conjecture. Informally: It is unlikely to produce intelligence by chance. Formally: Denote Π the space of deterministic policies, and consider some μ∈ΔΠ. Suppose μ is equivalent to a stochastic policy π∗. Then, Eπ∼μ[g(π)]=O(C(π∗)).
Find an “intelligence hierarchy theorem”. That is, find an increasing sequence {gn} s.t. for every n, there is a policy with goal-directed intelligence in (gn,gn+1) (no more and no less).
What is the computational complexity of evaluating g given (i) oracle access to the policy or (ii) description of the policy as a program or automaton?
What is the computational complexity of producing a policy with given g?
Conjecture. Informally: Intelligent agents have well defined priors and utility functions. Formally: For every (U,ζ) with C(U)<∞ and DKL(ζ0||ζ)<∞, and every ϵ>0, there exists g∈(0,∞) s.t. for every policy π with intelligence at least g w.r.t. (U,ζ), and every (~U,~ζ) s.t.π has intelligence at least g w.r.t. them, any optimal policies π∗,~π∗ for (U,ζ) and (~U,~ζ) respectively satisfy Eζ~π∗[U]≥Eζπ∗[U]−ϵ.
re: #5, that doesn’t seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn’t some U1≠U2 such that π∗(U1,ζ)=π∗(U2,ζ).
(And as pointed out elsewhere, it isn’t Stuart’s thesis, it’s a well known and basic result in the decision theory / economics / philosophy literature.)
re: #5, that doesn’t seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming.
You misunderstand the intent. We’re talking about inverse reinforcement learning. The goal is not necessarily inferring the unknown U, but producing some behavior that optimizes the unknown U. Ofc if the policy you’re observing is optimal then it’s trivial to do so by following the same policy. But, using my approach we might be able to extend it into results like “the policy you’re observing is optimal w.r.t. certain computational complexity, and your goal is to produce an optimal policy w.r.t. higher computational complexity.”
(Btw I think the formal statement I gave for 5 is false, but there might be an alternative version that works.)
(And as pointed out elsewhere, it isn’t Stuart’s thesis, it’s a well known and basic result in the decision theory / economics / philosophy literature.)
I am referring to this and related work by Armstrong.
It seems like this means that, for any policy, we can represent it as optimizing reward with only the minimal overhead in description/computational complexity of the wrapper.
So...
Do you think this analysis is correct? Or what is it missing? (maybe the assumption that the policy is deterministic is significant? This turns out to be the case for Orseau et al.’s “Agents and Devices” approach, I think https://arxiv.org/abs/1805.12387).
Are you trying to get around this somehow? Or are you fine with this minimal overhead being used to distinguish goal-directed from non-goal directed policies?
My framework discards such contrived reward functions because it penalizes for the complexity of the reward function. In the construction you describe, we have C(U)≈C(π). This corresponds to g≈0 (no/low intelligence). On the other hand, policies with g≫0 (high intelligence) have the property that C(π)≫C(U) for the U which “justifies” this g. In other words, your “minimal” overhead is very large from my point of view: to be acceptable, the “overhead” should be substantially negative.
I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper). It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper. So then you would never have $C(\pi) >> C(U)$. What am I missing/misunderstanding?
For the contrived reward function you suggested, we would never have C(π)≫C(U). But for other reward functions, it is possible that C(π)≫C(U). Which is exactly why this framework rejects the contrived reward function in favor of those other reward functions. And also why this framework considers some policies unintelligent (despite the availability of the contrived reward function) and other policies intelligent.
This idea was inspired by a correspondence with Adam Shimi.
It seem very interesting and important to understand to what extent a purely “behaviorist” view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?
Consider a general reinforcement learning settings: we have a set of actions A, a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)
The simplest attempt at defining “goal-directed intelligence” is requiring that the policy π in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows π, or the prior can believe that behavior not according to π leads to some terrible outcome.
The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are “contrived”. However, description complexity is only naturally well-defined up to an additive constant. So, if we want to have a crisp concept, we need to consider an asymptotic in which the complexity of something goes to infinity. Indeed, it seems natural to ask that the complexity of the policy should be much higher than the complexity of the prior and the reward function: in this case we can say that the “intentional stance” is an efficient description. However, this doesn’t make sense with description complexity: the description “optimal policy for U and ζ” is of size K(U)+K(ζ)+O(1) (K(x) stands for “description complexity of x”).
To salvage this idea, we need to take not only description complexity but also computational complexity into account. [EDIT: I was wrong, and we can get a well-defined concept in the unbounded setting too, see child comment. The bounded concept is still interesting.] For the intentional stance to be non-vacuous we need to demand that the policy does some “hard work” in order to be optimal. Let’s make it formal. Consider any function of the type f:Σ∗→ΔΞ where Σ and Ξ are some finite alphabets. Then, we can try to represent it by a probabilistic automaton T:S×Σ→Δ(S×Ξ), where S is the finite set space, T is the transition kernel, and we’re feeding symbols into the automaton one by one. Moreover, T can be represented as a boolean circuit R and this circuit can be the output of some program P executed by some fixed universal Turing machine. We can associate with this object 5 complexity parameters:
The description complexity, which is the length of P.
The computation time complexity, which is the size of R.
The computation space complexity, which is the maximum between the depth of R and log|S|.
The precomputation time complexity, which is the time it takes P to run.
The precomputation space complexity, which is the space P needs to run.
It is then natural to form a single complexity measure by applying a logarithm to the times and taking a linear combination of all 5 (we apply a logarithm so that a brute force search over n bits is roughly equivalent to hard-coding n bits). The coefficients in this combination represent the “prices” of the various resources (but we should probably fix the price of description complexity to be 1). Of course not all coefficients must be non-vanishing, it’s just that I prefer to keep maximal generality for now. We will denote this complexity measure C.
We can use such automatons to represent policies, finite POMDP environments and reward functions (ofc not any policy or reward function, but any that can be computed on a machine with finite space). In the case of policies, the computation time/space complexity can be regarded as the time/space cost of applying the “trained” algorithm, whereas the precomputation time/space complexity can be regarded as the time/space cost of training. If we wish, we can also think of the boolean circuit as a recurrent neural network.
We can also use C to define a prior ζ0, by ranging over programs P that output a valid POMDP and assigning probability proportional to 2−C to each instance. (Assuming that the environment has a finite state space might seem restrictive, but becomes quite reasonable if we use a quasi-Bayesian setting with quasi-POMDPs that are not meant to be complete descriptions of the environment; for now we won’t go into details about this.)
Now, return to our policy π. Given g>0, we define that ”π has goal-directed intelligence (at least) g” when there is a suitable prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then C(π′)≥DKL(ζ0||ζ)+C(U)+g. When g=+∞ (i.e. no finite automaton can match the expected utility of π; in particular, this implies π is optimal since any policy can be approximated by a finite automaton), we say that π is “perfectly goal-directed”. Here, DKL(ζ0||ζ) serves as a way to measure the complexity of ζ, which also ensures ζ is non-dogmatic in some rather strong sense.
[EDIT: if we fix U and ζ then g is essentially the same as Yudkowsky’s definition of optimization power if we regard the policy as the “outcome” and use 2−C as our measure on the space of outcomes.]
With this definition we cannot “cheat” by encoding the policy into the prior or into the utility function, since that would allow no complexity difference. Therefore this notion seems like a non-trivial requirement on the policy. On the other hand, this requirement does hold sometimes, because solving the optimization problem can be much more computationally costly than just evaluating the utility function or sampling the prior.
I am not sure I understand your use of C(U) in the third from last paragraph where you define goal directed intelligence. As you define C it is a complexity measure over programs P. I assume this was a typo and you mean K(U)? Or am I misunderstanding the definition of either U or C?
This is not a typo.
I’m imagining that we have a program P that outputs (i) a time discount parameter γ∈Q∩[0,1), (ii) a circuit for the transition kernel of an automaton T:S×A×O→S and (iii) a circuit for a reward function r:S→Q (and, ii+iii are allowed to have a shared component to save computation time complexity). The utility function is U:(A×O)ω→R defined by
U(x):=(1−γ)∞∑n=0γnr(sxn)
where sx∈Sω is defined recursively by
sxn+1=T(s,xn)
Okay, I think this makes sense. The idea is trying to re-interpret the various functions in the utility function as a single function and asking about the notion of complexity on that function which combines the complexity of producing a circuit which computes that function and the complexity of the circuit itself.
But just to check: is T over S×A×O→S? I thought T in utility functions only depended on states and actions S×A→S?
Maybe I am confused by what you mean by S. I thought it was the state space, but that isn’t consistent with r in your post which was defined over A×O→Q? As a follow up: defining r as depending on actions and observations instead of actions and states (which e.g. the definition in POMDP on Wikipedia) seems like it changes things. So I’m not sure if you intended the rewards to correspond with the observations or ‘underlying’ states.
One more question, this one about the priors: what are they a prior over exactly? I will use the letters/terms from https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process to try to be explicit. Is the prior capturing the “set of conditional observation probabilities” (O on Wikipedia)? Or is it capturing the “set of conditional transition probabilities between states” (T on Wikipedia)? Or is it capturing a distribution over all possible T and O? Or are you imaging that T is defined with U (and is non-random) and O is defined within the prior?
I ask because the term DKL(ζ0||ζ) will be positive infinity if ζ is zero for any value where ζ0 is non-zero. Which makes the interpretation that it is either O or T directly pretty strange (for example, in the case where there are two states s1 and s2 and two obersvations o1 and o2 an O where P(si|oi)=1 and P(si|oj)=0 if i≠j would have a KL divergence of infinity from the ζ0 if ζ0 had non-zero probability on P(s1|o2)). So, I assume this is a prior over what the conditional observation matrices might be. I am assuming that your comment above implies that T is defined in the utility function U instead, and is deterministic?
I’m not entirely sure what you mean by the state space.S is a state space associated specifically with the utility function. It has nothing to do with the state space of the environment. The reward function in the OP is (A×O)∗→R, not A×O→R. I slightly abused notation by defining r:S→Q in the parent comment. Let’s say it’s r′:S→Q and r is defined by using T to translate the history to the (last) state and then applying r′.
The prior is just an environment i.e. a partial mapping ζ:(A×O)∗→ΔO defined on every history to which it doesn’t itself assign probability 0. The expression DKL(ξ||ζ) means that we consider all possible ways to choose a Polish space X, probability distributions μ,ν∈ΔX and a mapping f:X×(A×O)∗→ΔO s.t.ζ=Eμ[f] and ξ=Eν[f] (where the expected value is defined using the Bayes law and not pointwise, see also the definition of “instrumental states” here), and take the minimum over all of them of DKL(ν||μ).
Actually, as opposed to what I claimed before, we don’t need computational complexity bounds for this definition to make sense. This is because the Solomonoff prior is made of computable hypotheses but is uncomputable itself.
Given g>0, we define that ”π has (unbounded) goal-directed intelligence (at least) g” when there is a prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then K(π′)≥DKL(ζ0||ζ)+K(U)+g. Here, ζ0 is the Solomonoff prior and K is Kolmogorov complexity. When g=+∞ (i.e. no computable policy can match the expected utility of π; in particular, this implies π is optimal since any policy can be approximated by a computable policy), we say that π is “perfectly (unbounded) goal-directed”.
Compare this notion to the Legg-Hutter intelligence measure. The LH measure depends on the choice of UTM in radical ways. In fact, for some UTMs, AIXI (which is the maximum of the LH measure) becomes computable or even really stupid. For example, it can always keep taking the same action because of the fear that taking any other action leads to an inescapable “hell” state. On the other hand, goal-directed intelligence differs only by O(1) between UTMs, just like Kolmogorov complexity. A perfectly unbounded goal-directed policy has to be uncomputable, and the notion of which policies are such doesn’t depend on the UTM at all.
I think that it’s also possible to prove that intelligence is rare, in the sense that, for any computable stochastic policy, if we regard it as a probability measure over deterministic policies, then for any ϵ>0 there is g s.t. the probability to get intelligence at least g is smaller than ϵ.
Also interesting is that, for bounded goal-directed intelligence, increasing the prices can only decrease intelligence by O(1), and a policy that is perfectly goal-directed w.r.t. lower prices is also such w.r.t. higher prices (I think). In particular, a perfectly unbounded goal-directed policy is perfectly goal-directed for any price vector. Informally speaking, an agent that is very smart relatively to a context with cheap computational resources is still very smart relatively to a context where they are expensive, which makes intuitive sense.
If we choose just one computational resource, we can speak of the minimal price for which a given policy is perfectly goal-directed, which is another way to measure intelligence with a more restricted domain. Curiously, our bounded Solomonoff-like prior has the shape of a Maxwell-Boltzmann distribution in which the prices are thermodynamic parameters. Perhaps we can regard the minimal price as the point of a phase transition.
Some problems to work on regarding goal-directed intelligence. Conjecture 5 is especially important for deconfusing basic questions in alignment, as it stands in opposition to Stuart Armstrong’s thesis about the impossibility to deduce preferences from behavior alone.
Conjecture. Informally: It is unlikely to produce intelligence by chance. Formally: Denote Π the space of deterministic policies, and consider some μ∈ΔΠ. Suppose μ is equivalent to a stochastic policy π∗. Then, Eπ∼μ[g(π)]=O(C(π∗)).
Find an “intelligence hierarchy theorem”. That is, find an increasing sequence {gn} s.t. for every n, there is a policy with goal-directed intelligence in (gn,gn+1) (no more and no less).
What is the computational complexity of evaluating g given (i) oracle access to the policy or (ii) description of the policy as a program or automaton?
What is the computational complexity of producing a policy with given g?
Conjecture. Informally: Intelligent agents have well defined priors and utility functions. Formally: For every (U,ζ) with C(U)<∞ and DKL(ζ0||ζ)<∞, and every ϵ>0, there exists g∈(0,∞) s.t. for every policy π with intelligence at least g w.r.t. (U,ζ), and every (~U,~ζ) s.t.π has intelligence at least g w.r.t. them, any optimal policies π∗,~π∗ for (U,ζ) and (~U,~ζ) respectively satisfy Eζ~π∗[U]≥Eζπ∗[U]−ϵ.
re: #5, that doesn’t seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn’t some U1≠U2 such that π∗(U1,ζ)=π∗(U2,ζ).
(And as pointed out elsewhere, it isn’t Stuart’s thesis, it’s a well known and basic result in the decision theory / economics / philosophy literature.)
You misunderstand the intent. We’re talking about inverse reinforcement learning. The goal is not necessarily inferring the unknown U, but producing some behavior that optimizes the unknown U. Ofc if the policy you’re observing is optimal then it’s trivial to do so by following the same policy. But, using my approach we might be able to extend it into results like “the policy you’re observing is optimal w.r.t. certain computational complexity, and your goal is to produce an optimal policy w.r.t. higher computational complexity.”
(Btw I think the formal statement I gave for 5 is false, but there might be an alternative version that works.)
I am referring to this and related work by Armstrong.
Apologies, I didn’t take the time to understand all of this yet, but I have a basic question you might have an answer to...
We know how to map (deterministic) policies to reward functions using the construction at the bottom of page 6 of the reward modelling agenda (https://arxiv.org/abs/1811.07871v1): the agent is rewarded only if it has so far done exactly what the policy would do. I think of this as a wrapper function (https://en.wikipedia.org/wiki/Wrapper_function).
It seems like this means that, for any policy, we can represent it as optimizing reward with only the minimal overhead in description/computational complexity of the wrapper.
So...
Do you think this analysis is correct? Or what is it missing? (maybe the assumption that the policy is deterministic is significant? This turns out to be the case for Orseau et al.’s “Agents and Devices” approach, I think https://arxiv.org/abs/1805.12387).
Are you trying to get around this somehow? Or are you fine with this minimal overhead being used to distinguish goal-directed from non-goal directed policies?
My framework discards such contrived reward functions because it penalizes for the complexity of the reward function. In the construction you describe, we have C(U)≈C(π). This corresponds to g≈0 (no/low intelligence). On the other hand, policies with g≫0 (high intelligence) have the property that C(π)≫C(U) for the U which “justifies” this g. In other words, your “minimal” overhead is very large from my point of view: to be acceptable, the “overhead” should be substantially negative.
I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper). It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper. So then you would never have $C(\pi) >> C(U)$. What am I missing/misunderstanding?
For the contrived reward function you suggested, we would never have C(π)≫C(U). But for other reward functions, it is possible that C(π)≫C(U). Which is exactly why this framework rejects the contrived reward function in favor of those other reward functions. And also why this framework considers some policies unintelligent (despite the availability of the contrived reward function) and other policies intelligent.