The way to avoid outputting predictions that may have been corrupted by a mesa-optimizer is to ask for help when plausible stochastic models disagree about probabilities.
This is exactly the method used in my paper about delegative RL and also an earlier essay that doesn’t assume finite MDPs or access to rewards but has stronger assumptions about the advisor (so it’s essentially doing imitation learning + denoising). I pointed out the connection to mesa-optimizers (which I called “daemons”) in another essay.
If I’m interpreting the algorithm on page 7 correctly, it looks like if there’s a trap that the demonstrator falls into with probability 10−9, there’s no limit on the probability that the agent falls into the trap?
Also, and maybe relatedly, do the demonstrator-models in the paper have to be deterministic?
Yes to the first question. In the DLRL paper I assume the advisor takes unsafe actions with probability exactly 0. However, it is straightforward to generalize the result s.t. the advisor can take unsafe actions with probability δ≪ϵ, where ϵ is the lower bound for the probability to take an optimal action (Definition 8). Moreover, in DLIRL (which, I believe, is closer to your setting) I use a “soft” assumption (see Definition 3 there) that doesn’t require any probability to vanish entirely.
No to the second question. In neither setting the advisor is assumed to be deterministic.
In neither setting the advisor is assumed to be deterministic.
Okay, then I assume the agent’s models of the advisor are not exclusively deterministic either?
However, it is straightforward to generalize the result s.t. the advisor can take unsafe actions with probability δ≪ϵ, where ϵ is the lower bound for the probability to take an optimal action
What I care most about is the ratio of probabilities that the advisor vs. agent takes the unsafe action, where don’t know as programmers (so the agent doesn’t get told at the beginning) any bounds on what these advisor-probabilities are. Can this modification be recast to have that property? Or does it already?
Okay, then I assume the agent’s models of the advisor are not exclusively deterministic either?
Of course. I assume realizability, so one of the hypothesis is the true advisor behavior, which is stochastic.
What I care most about is the ratio of probabilities that the advisor vs. agent takes the unsafe action, where don’t know as programmers (so the agent doesn’t get told at the beginning) any bounds on what these advisor-probabilities are. Can this modification be recast to have that property? Or does it already?
In order to achieve the optimal regret bound, you do need to know the values of δ and ϵ. In DLIRL, you need to know β. However, AFAIU your algorithm also depends on some parameter (α)? In principle, if you don’t know anything about the parameters, you can set them to be some function of the time discount s.t. as γ→1 the bound becomes true and the regret still goes to 0. In DLRL, this requires ω(1−γ)≤ϵ(γ)≤o(1), in DLIRL ω((1−γ)23)≤β(γ)−1≤o(1). However, then you only know regret vanishes with certain asymptotic rate without having a quantitative bound.
Alpha only needs to be set based on a guess about what the prior on the truth is. It doesn’t need to be set based on guesses about possibly countably many traps of varying advisor-probability.
I’m not sure I understand whether you were saying ratio of probabilities that the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.
Alpha only needs to be set based on a guess about what the prior on the truth is. It doesn’t need to be set based on guesses about possibly countably many traps of varying advisor-probability.
Hmm, yes, I think the difference comes from imitation vs. RL. In your setting, you only care about producing a good imitation of the advisor. On the other hand in my settings, I want to achieve near-optimal performance (which the advisor doesn’t achieve). So I need stronger assumptions.
I’m not sure I understand whether you were saying ratio of probabilities that the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.
Well, in DLIRL the probability that the advisor takes an unsafe action on any given round is bounded by roughly e−β, whereas the probability that the agent takes an unsafe action over a duration of (1−γ)−1 is bounded by roughly β−1(1−γ)−23, so it’s not a ratio but there is some relationship. I’m sure you can derive some relationship in DLRL too, but I haven’t studied it (like I said, I only worked out the details when the advisor never takes unsafe actions).
This is exactly the method used in my paper about delegative RL and also an earlier essay that doesn’t assume finite MDPs or access to rewards but has stronger assumptions about the advisor (so it’s essentially doing imitation learning + denoising). I pointed out the connection to mesa-optimizers (which I called “daemons”) in another essay.
If I’m interpreting the algorithm on page 7 correctly, it looks like if there’s a trap that the demonstrator falls into with probability 10−9, there’s no limit on the probability that the agent falls into the trap?
Also, and maybe relatedly, do the demonstrator-models in the paper have to be deterministic?
Yes to the first question. In the DLRL paper I assume the advisor takes unsafe actions with probability exactly 0. However, it is straightforward to generalize the result s.t. the advisor can take unsafe actions with probability δ≪ϵ, where ϵ is the lower bound for the probability to take an optimal action (Definition 8). Moreover, in DLIRL (which, I believe, is closer to your setting) I use a “soft” assumption (see Definition 3 there) that doesn’t require any probability to vanish entirely.
No to the second question. In neither setting the advisor is assumed to be deterministic.
Okay, then I assume the agent’s models of the advisor are not exclusively deterministic either?
What I care most about is the ratio of probabilities that the advisor vs. agent takes the unsafe action, where don’t know as programmers (so the agent doesn’t get told at the beginning) any bounds on what these advisor-probabilities are. Can this modification be recast to have that property? Or does it already?
Of course. I assume realizability, so one of the hypothesis is the true advisor behavior, which is stochastic.
In order to achieve the optimal regret bound, you do need to know the values of δ and ϵ. In DLIRL, you need to know β. However, AFAIU your algorithm also depends on some parameter (α)? In principle, if you don’t know anything about the parameters, you can set them to be some function of the time discount s.t. as γ→1 the bound becomes true and the regret still goes to 0. In DLRL, this requires ω(1−γ)≤ϵ(γ)≤o(1), in DLIRL ω((1−γ)23)≤β(γ)−1≤o(1). However, then you only know regret vanishes with certain asymptotic rate without having a quantitative bound.
That makes sense.
Alpha only needs to be set based on a guess about what the prior on the truth is. It doesn’t need to be set based on guesses about possibly countably many traps of varying advisor-probability.
I’m not sure I understand whether you were saying ratio of probabilities that the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.
Hmm, yes, I think the difference comes from imitation vs. RL. In your setting, you only care about producing a good imitation of the advisor. On the other hand in my settings, I want to achieve near-optimal performance (which the advisor doesn’t achieve). So I need stronger assumptions.
Well, in DLIRL the probability that the advisor takes an unsafe action on any given round is bounded by roughly e−β, whereas the probability that the agent takes an unsafe action over a duration of (1−γ)−1 is bounded by roughly β−1(1−γ)−23, so it’s not a ratio but there is some relationship. I’m sure you can derive some relationship in DLRL too, but I haven’t studied it (like I said, I only worked out the details when the advisor never takes unsafe actions).
Neat, makes sense.