michaelcohen comments on Formal Solution to the Inner Alignment Problem

michaelcohen 18 Feb 2021 20:01 UTC
LW: 1 AF: 1
AF
If I’m interpreting the algorithm on page 7 correctly, it looks like if there’s a trap that the demonstrator falls into with probability $10^{- 9}$ , there’s no limit on the probability that the agent falls into the trap?
Also, and maybe relatedly, do the demonstrator-models in the paper have to be deterministic?
- Vanessa Kosoy 18 Feb 2021 23:37 UTC
  LW: 3 AF: 2
  AF Parent
  Yes to the first question. In the DLRL paper I assume the advisor takes unsafe actions with probability exactly $0$ . However, it is straightforward to generalize the result s.t. the advisor can take unsafe actions with probability $δ ≪ ϵ$ , where $ϵ$ is the lower bound for the probability to take an optimal action (Definition 8). Moreover, in DLIRL (which, I believe, is closer to your setting) I use a “soft” assumption (see Definition 3 there) that doesn’t require any probability to vanish entirely.
  
  No to the second question. In neither setting the advisor is assumed to be deterministic.
  - michaelcohen 19 Feb 2021 11:33 UTC
    LW: 1 AF: 1
    AF Parent
    In neither setting the advisor is assumed to be deterministic.
    Okay, then I assume the agent’s models of the advisor are not exclusively deterministic either?
    However, it is straightforward to generalize the result s.t. the advisor can take unsafe actions with probability $δ ≪ ϵ$ , where $ϵ$ is the lower bound for the probability to take an optimal action
    What I care most about is the ratio of probabilities that the advisor vs. agent takes the unsafe action, where don’t know as programmers (so the agent doesn’t get told at the beginning) any bounds on what these advisor-probabilities are. Can this modification be recast to have that property? Or does it already?
    - Vanessa Kosoy 19 Feb 2021 17:41 UTC
      LW: 2 AF: 1
      AF Parent
      
      Okay, then I assume the agent’s models of the advisor are not exclusively deterministic either?
      
      Of course. I assume realizability, so one of the hypothesis is the true advisor behavior, which is stochastic.
      
      What I care most about is the ratio of probabilities that the advisor vs. agent takes the unsafe action, where don’t know as programmers (so the agent doesn’t get told at the beginning) any bounds on what these advisor-probabilities are. Can this modification be recast to have that property? Or does it already?
      
      In order to achieve the optimal regret bound, you do need to know the values of $δ$ and $ϵ$ . In DLIRL, you need to know $β$ . However, AFAIU your algorithm also depends on some parameter ( $α$ )? In principle, if you don’t know anything about the parameters, you can set them to be some function of the time discount s.t. as $γ \to 1$ the bound becomes true and the regret still goes to $0$ . In DLRL, this requires $ω (1 - γ) \leq ϵ (γ) \leq o (1)$ , in DLIRL $ω ((1 - γ)^{\frac{2}{3}}) \leq β (γ)^{- 1} \leq o (1)$ . However, then you only know regret vanishes with certain asymptotic rate without having a quantitative bound.
      - michaelcohen 20 Feb 2021 10:23 UTC
        LW: 3 AF: 2
        AF Parent
        That makes sense.
        Alpha only needs to be set based on a guess about what the prior on the truth is. It doesn’t need to be set based on guesses about possibly countably many traps of varying advisor-probability.
        I’m not sure I understand whether you were saying ratio of probabilities that the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.
        Vanessa Kosoy 20 Feb 2021 12:12 UTC
        LW: 3 AF: 2
        AF Parent
        
        Alpha only needs to be set based on a guess about what the prior on the truth is. It doesn’t need to be set based on guesses about possibly countably many traps of varying advisor-probability.
        
        Hmm, yes, I think the difference comes from imitation vs. RL. In your setting, you only care about producing a good imitation of the advisor. On the other hand in my settings, I want to achieve near-optimal performance (which the advisor doesn’t achieve). So I need stronger assumptions.
        
        I’m not sure I understand whether you were saying ratio of probabilities that the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.
        
        Well, in DLIRL the probability that the advisor takes an unsafe action on any given round is bounded by roughly $e^{- β}$ , whereas the probability that the agent takes an unsafe action over a duration of $(1 - γ)^{- 1}$ is bounded by roughly $β^{- 1} (1 - γ)^{- \frac{2}{3}}$ , so it’s not a ratio but there is some relationship. I’m sure you can derive some relationship in DLRL too, but I haven’t studied it (like I said, I only worked out the details when the advisor never takes unsafe actions).
        michaelcohen 20 Feb 2021 15:54 UTC
        LW: 1 AF: 1
        AF Parent
        Neat, makes sense.