Alpha only needs to be set based on a guess about what the prior on the truth is. It doesn’t need to be set based on guesses about possibly countably many traps of varying advisor-probability.
I’m not sure I understand whether you were saying ratio of probabilities that the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.
Alpha only needs to be set based on a guess about what the prior on the truth is. It doesn’t need to be set based on guesses about possibly countably many traps of varying advisor-probability.
Hmm, yes, I think the difference comes from imitation vs. RL. In your setting, you only care about producing a good imitation of the advisor. On the other hand in my settings, I want to achieve near-optimal performance (which the advisor doesn’t achieve). So I need stronger assumptions.
I’m not sure I understand whether you were saying ratio of probabilities that the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.
Well, in DLIRL the probability that the advisor takes an unsafe action on any given round is bounded by roughly e−β, whereas the probability that the agent takes an unsafe action over a duration of (1−γ)−1 is bounded by roughly β−1(1−γ)−23, so it’s not a ratio but there is some relationship. I’m sure you can derive some relationship in DLRL too, but I haven’t studied it (like I said, I only worked out the details when the advisor never takes unsafe actions).
That makes sense.
Alpha only needs to be set based on a guess about what the prior on the truth is. It doesn’t need to be set based on guesses about possibly countably many traps of varying advisor-probability.
I’m not sure I understand whether you were saying ratio of probabilities that the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.
Hmm, yes, I think the difference comes from imitation vs. RL. In your setting, you only care about producing a good imitation of the advisor. On the other hand in my settings, I want to achieve near-optimal performance (which the advisor doesn’t achieve). So I need stronger assumptions.
Well, in DLIRL the probability that the advisor takes an unsafe action on any given round is bounded by roughly e−β, whereas the probability that the agent takes an unsafe action over a duration of (1−γ)−1 is bounded by roughly β−1(1−γ)−23, so it’s not a ratio but there is some relationship. I’m sure you can derive some relationship in DLRL too, but I haven’t studied it (like I said, I only worked out the details when the advisor never takes unsafe actions).
Neat, makes sense.