Vanessa Kosoy comments on Formal Solution to the Inner Alignment Problem

Vanessa Kosoy Feb 19, 2021, 5:41 PM
LW: 2 AF: 1
AF

Okay, then I assume the agent’s models of the advisor are not exclusively deterministic either?

Of course. I assume realizability, so one of the hypothesis is the true advisor behavior, which is stochastic.

What I care most about is the ratio of probabilities that the advisor vs. agent takes the unsafe action, where don’t know as programmers (so the agent doesn’t get told at the beginning) any bounds on what these advisor-probabilities are. Can this modification be recast to have that property? Or does it already?

In order to achieve the optimal regret bound, you do need to know the values of $δ$ and $ϵ$ . In DLIRL, you need to know $β$ . However, AFAIU your algorithm also depends on some parameter ( $α$ )? In principle, if you don’t know anything about the parameters, you can set them to be some function of the time discount s.t. as $γ \to 1$ the bound becomes true and the regret still goes to $0$ . In DLRL, this requires $ω (1 - γ) \leq ϵ (γ) \leq o (1)$ , in DLIRL $ω ((1 - γ)^{\frac{2}{3}}) \leq β (γ)^{- 1} \leq o (1)$ . However, then you only know regret vanishes with certain asymptotic rate without having a quantitative bound.
- michaelcohen Feb 20, 2021, 10:23 AM
  LW: 3 AF: 2
  AF Parent
  That makes sense.
  Alpha only needs to be set based on a guess about what the prior on the truth is. It doesn’t need to be set based on guesses about possibly countably many traps of varying advisor-probability.
  I’m not sure I understand whether you were saying ratio of probabilities that the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.
  - Vanessa Kosoy Feb 20, 2021, 12:12 PM
    LW: 3 AF: 2
    AF Parent
    
    Alpha only needs to be set based on a guess about what the prior on the truth is. It doesn’t need to be set based on guesses about possibly countably many traps of varying advisor-probability.
    
    Hmm, yes, I think the difference comes from imitation vs. RL. In your setting, you only care about producing a good imitation of the advisor. On the other hand in my settings, I want to achieve near-optimal performance (which the advisor doesn’t achieve). So I need stronger assumptions.
    
    I’m not sure I understand whether you were saying ratio of probabilities that the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.
    
    Well, in DLIRL the probability that the advisor takes an unsafe action on any given round is bounded by roughly $e^{- β}$ , whereas the probability that the agent takes an unsafe action over a duration of $(1 - γ)^{- 1}$ is bounded by roughly $β^{- 1} (1 - γ)^{- \frac{2}{3}}$ , so it’s not a ratio but there is some relationship. I’m sure you can derive some relationship in DLRL too, but I haven’t studied it (like I said, I only worked out the details when the advisor never takes unsafe actions).
    - michaelcohen Feb 20, 2021, 3:54 PM
      LW: 1 AF: 1
      AF Parent
      Neat, makes sense.