Vanessa Kosoy comments on The Credit Assignment Problem

Vanessa Kosoy 10 Nov 2019 16:43 UTC
LW: 4 AF: 2
AF

From the perspective of full agency (ie, the negation of partial agency), a system which needs a protected epistemic layer sounds suspiciously like a system that can’t tile. You look at the world, and you say: “how can I maximize utility?” You look at your beliefs, and you say: “how can I maximize accuracy?” That’s not a consequentialist agent; that’s two different consequentialist agents!

For reinforcement learning with incomplete/fuzzy hypotheses, this separation doesn’t exist, because the update rule for fuzzy beliefs depends on the utility function and in some sense even on the actual policy.
- abramdemski 13 Nov 2019 22:55 UTC
  LW: 3 AF: 2
  AF Parent
  How does that work?
  - Vanessa Kosoy 23 Nov 2019 16:26 UTC
    LW: 9 AF: 6
    AF Parent
    Actually I was somewhat confused about what the right update rule for fuzzy beliefs is when I wrote that comment. But I think I got it figured out now.
    
    First, background about fuzzy beliefs:
    
    Let $E$ be the space of environments (defined as the space of instrumental states in Definition 9 here). A fuzzy belief is a concave function $ϕ : E \to [0, 1]$ s.t. $sup ϕ = 1$ . We can think of it as the membership function of a fuzzy set. For an incomplete model $Φ \subseteq E$ , the corresponding $ϕ$ is the concave hull of the characteristic function of $Φ$ (i.e. the minimal concave $ϕ$ s.t. $ϕ \geq χ_{Φ}$ ).
    
    Let $γ$ be the geometric discount parameter and $U (γ) := (1 - γ) \sum_{n = 0}^{\infty} γ^{n} r_{n}$ be the utility function. Given a policy $π$ (EDIT: in general, we allow our policies to explicitly depend on $γ$ ), the value of $π$ at $ϕ$ is defined by
    
    $V_{π} (ϕ, γ) := 1 + inf μ \in E (E_{μ π} [U (γ)] - ϕ (μ))$
    
    The optimal policy and the optimal value for $ϕ$ are defined by
    
    $π_{ϕ, γ}^{*} := arg max π V_{π} (ϕ, γ)$ $V (ϕ, γ) := max π V_{π} (ϕ, γ)$
    
    Given a policy $π$ , the regret of $π$ at $ϕ$ is defined by
    
    ${R g}_{π} (ϕ, γ) := V (ϕ, γ) - V_{π} (ϕ, γ)$
    
    $π$ is said to learn $ϕ$ when it is asymptotically optimal for $ϕ$ when $γ \to 1$ , that is
    
    $lim γ \to 1 {R g}_{π} (ϕ, γ) = 0$
    
    Given $ζ$ a probability measure over the space fuzzy hypotheses, the Bayesian regret of $π$ at $ζ$ is defined by
    
    ${B R g}_{π} (ζ, γ) := E_{ϕ \sim ζ} [{R g}_{π} (ϕ, γ)]$
    
    $π$ is said to learn $ζ$ when
    
    $lim γ \to 1 {B R g}_{π} (ζ, γ) = 0$
    
    If such a $π$ exists, $ζ$ is said to be learnable. Analogously to Bayesian RL, $ζ$ is learnable if and only if it is learned by a specific policy $π_{ζ}^{*}$ (the Bayes-optimal policy). To define it, we define the fuzzy belief $ϕ_{ζ}$ by
    
    $ϕ_{ζ} (μ) := sup (σ : s u p p ζ \to E) : E_{ϕ \sim ζ} [σ (ϕ)] = μ E_{ϕ \sim ζ} [ϕ (σ (ϕ))]$
    
    We now define $π_{ζ}^{*} := ϕ_{ϕ_{ζ}}^{*}$ .
    
    Now, updating: (EDIT: the definition was needlessly complicated, simplified)
    
    Consider a history $h \in {(A \times O)}^{*}$ or $h \in {(A \times O)}^{*} \times A$ . Here $A$ is the set of actions and $O$ is the set of observations. Define $μ_{ϕ}^{*}$ by
    
    $μ_{ϕ}^{*} := arg max μ \in E min π (ϕ (μ) - E_{μ π} [U])$
    
    Let $E^{'}$ be the space of “environments starting from $h$ ”. That is, if $h \in {(A \times O)}^{*}$ then $E^{'} = E$ and if $h \in {(A \times O)}^{*} \times A$ then $E^{'}$ is slightly different because the history now begins with an observation instead of with an action.
    
    For any $μ \in E, ν \in E^{'}$ we define ${[ν]}_{h} μ \in E$ by
    
    ${[ν]}_{h} μ (o ∣ h^{'}) := {\begin{matrix} ν (o ∣ h^{''}) if h^{'} = h h^{''} μ (o ∣ h^{'}) otherwise \end{matrix}$
    
    Then, the updated fuzzy belief is
    
    $(ϕ ∣ h) (ν) := ϕ ({[ν]}_{h} μ_{ϕ}^{*}) + constant$
    What links here?
    Vanessa Kosoy's comment on Vanessa Kosoy’s Shortform by Vanessa Kosoy (5 Jan 2020 16:54 UTC; 13 points)
    AlexMennen's comment on AlexMennen’s Shortform by AlexMennen (8 Dec 2019 4:51 UTC; 7 points)