(This entire comment is setting aside embedded agency concerns, except for mesa optimization)
You seem to be equivocating between two notions of exploration. Consider an agent that is trained via RL to do well on a distribution of environments, p(E). Then there are two kinds of exploration:
Across-episode exploration: Exploration across training trajectories, where the RL algorithm collects trajectories going to various different parts of the state space in the environment, in order to figure out where the reward is.
Within-episode exploration: Exploration within a single trajectory, where you try to identify which particular E has been sampled, so that you can tailor your trajectory to that E.
In across-episode exploration, the exploration is being done by some human-designed algorithm. (I would claim this of RND, Boltzmann exploration, ϵ-greedy and entropy bonuses.) I agree that these work because you want to tailor your exploration based on the value of information, but the agent isn’t evaluating the value of information and deciding where to explore, the human-designed algorithm is doing that. So mesa optimization is not going to affect this.
In within-episode exploration, the exploration is being done directly by the policy, and so it is reasonable to talk about how a mesa optimizer would do such exploration.
With that in mind, some thoughts:
With more modern approaches, however—especially policy gradient approaches like PPO that aren’t amenable to something like Boltzmann exploration—the exploration is instead entirely learned, encouraged by some sort of extra term in the loss to implicitly encourage exploratory behavior.
This initially sounds like you are talking about across-episode exploration, but then the phrase “entirely learned” makes me think you are talking about things within the domain of a mesa optimizer, i.e. within-episode exploration. Right now this is just semantics, but I think it plays into my confusion below.
Making ϵ-greedy exploration safe is in some sense quite easy, since the way it explores is totally random.
Usually (in ML) “safe exploration” means “the agent doesn’t make a mistake, even by accident”; ϵ-greedy exploration wouldn’t be safe in that sense, since it can fall into traps. I’m assuming that by “safe exploration” you mean “when the agent explores, it is not trying to deceive us / hurt us / etc”.
exploration should arise naturally as an instrumental goal of pursuing the given reward function—though current RL methods aren’t quite good enough to get that yet, those methods which are closer to it are starting to perform better.
Since by default policies can’t affect across-episode exploration, I assume you’re talking about within-episode exploration. But this happens all the time with current RL methods, e.g. one consequence of domain randomization was that OpenAI Five would go explore what Roshan’s health was for the current Dota game. In general, it’ll happen with any POMDP with a random initial state. We even have examples of this where you are (kind of) exploring the objective: Learning to Interactively Learn and Assist.
instrumental exploration gives us capability exploration but not objective exploration.
As you mention later, you would get objective exploration if the agent had uncertainty over the objective.
An agent is cooperation corrigible if it optimizes under uncertainty over what goal you might want it to have.
This sounds to me like reward uncertainty, assistance games / CIRL, and more generally Stuart Russell’s agenda, except applied to mesa optimization now. Should I take away something other than “we should have our mesa optimizers behave like the AIs in assistance games”? I feel like you are trying to say something else but I don’t know what.
Finally, what does this tell us about safe exploration and how to think about current safe exploration research? Current safe exploration research tends to focus on the avoidance of traps in the environment.
I thought we were talking about “the agent doesn’t try to deceive us / hurt us by exploring”, which wouldn’t tell us anything about the problem of “the agent doesn’t make an accidental mistake”.
(Aside: these problems should not both be called safe exploration; they seem ~unrelated to me.)
What about objective exploration—how do we do it properly?
The same way as capability exploration; based on value of information (VoI). (I assume you have a well-specified distribution over objectives; if you don’t, then there is no proper way to do it, in the same way there’s no proper way to do capability exploration without a prior over what you might see when you take the new action.)
And do we need measures to put a damper on objective exploration as well?
You only need to put dampers on exploration if you’re concerned that the agent cannot make proper VoI calculations for optimal exploration. (Alternatively, you can remove exploration altogether if you can provide the information that would be gained via exploration some other way, e.g. from human input; this allows you to avoid the otherwise-unavoidable regret incurred through exploration.)
My perspective is that Safety Gym and things like it are proposing that we specify objectives via rewards + constraints, because incorporating the constraints in the reward function is difficult (it requires tuning a hyperparameter that specifies the tradeoff between obtaining reward and avoiding constraint violations). Separately, it also proposes that we measure reward / constraint violation throughout training, as a way to measure regret (rather than just test-time performance). The algorithms used are not putting dampers on exploration; they are trying to get the agent to do better exploration (e.g. if you crashed into the wall and saw that that violated a constraint, don’t crash into the wall again just because you forgot about that experience).
And what about cooperation corrigibility—is the “right” way to put a damper on exploration through constraints or through uncertainty?
If you have the right uncertainty, then acting optimally to maximize that is the “right” thing to do.
I completely agree with the distinction between across-episode vs. within-episode exploration, and I agree I should have been clearer about that. Mostly I want to talk about across-episode exploration here, though when I was writing this post I was mostly motivated by the online learning case where the distinction is in fact somewhat blurred, since in an online learning setting you do in fact need the deployment policy to balance between within-episode exploration and across-episode exploration.
Usually (in ML) “safe exploration” means “the agent doesn’t make a mistake, even by accident”; ϵ-greedy exploration wouldn’t be safe in that sense, since it can fall into traps. I’m assuming that by “safe exploration” you mean “when the agent explores, it is not trying to deceive us / hurt us / etc”.
Agreed. My point is that “If you assume that the policy without exploration is safe, then for ϵ-greedy exploration to be safe on average, it just needs to be the case that the environment is safe on average, which is just a standard engineering question.” That is, even though it seems like it’s hard for ϵ-greedy exploration to be safe, it’s actually quite easy for it to be safe on average—you just need to be in a safe environment. That’s not true for learned exploration, though.
Since by default policies can’t affect across-episode exploration, I assume you’re talking about within-episode exploration. But this happens all the time with current RL methods
Yeah, I agree that was confusing—I’ll rephrase it. The point I was trying to make was that across-episode exploration should arise naturally, since an agent with a fixed objective should want to be modified to better pursue that objective, but not want to be modified to pursue a different objective.
This sounds to me like reward uncertainty, assistance games / CIRL, and more generally Stuart Russell’s agenda, except applied to mesa optimization now. Should I take away something other than “we should have our mesa optimizers behave like the AIs in assistance games”? I feel like you are trying to say something else but I don’t know what.
Agreed that there’s a similarity there—that’s the motivation for calling it “cooperative.” But I’m not trying to advocate for that agenda here—I’m just trying to better classify the different types of corrigibility and understand how they work. In fact, I think it’s quite plausible that you could get away without cooperative corrigibility, though I don’t really want to take a stand on that right now.
I thought we were talking about “the agent doesn’t try to deceive us / hurt us by exploring”, which wouldn’t tell us anything about the problem of “the agent doesn’t make an accidental mistake”.
If your definition of “safe exploration” is “not making accidental mistakes” then I agree that what I’m pointing at doesn’t fall under that heading. What I’m trying to point at is that I think there are other problems that we need to figure out regarding how models explore than just the “not making accidental mistakes” problem, though I have no strong feelings about whether or not to call those other problems “safe exploration” problems.
The same way as capability exploration; based on value of information (VoI). (I assume you have a well-specified distribution over objectives; if you don’t, then there is no proper way to do it, in the same way there’s no proper way to do capability exploration without a prior over what you might see when you take the new action.)
Agreed, though I don’t think that’s the end of the story. In particular, I don’t think it’s at all obvious what an agent that cares about the value of information that its actions produce relative to some objective distribution will look like, how you could get such an agent, or how you could verify when you had such an agent. And, even if you could do those things, it still seems pretty unclear to me what the right distribution over objectives should be and how you should learn it.
The algorithms used are not putting dampers on exploration; they are trying to get the agent to do better exploration (e.g. if you crashed into the wall and saw that that violated a constraint, don’t crash into the wall again just because you forgot about that experience).
Well, what does “better exploration” mean? Better across-episode exploration or better within-episode exploration? Better relative to the base objective or better relative to the mesa-objective? I think it tends to be “better within-episode exploration relative to the base objective,” which I would call putting a damper on instrumental exploration, which does across-episode and within-episode exploration only for the mesa-objective, not the base objective.
If you have the right uncertainty, then acting optimally to maximize that is the “right” thing to do.
Sure, but as you note getting the right uncertainty could be quite difficult, so for practical purposes my question is still unanswered.
The point I was trying to make was that across-episode exploration should arise naturally
Are you saying that across-episode exploration should arise naturally when applying a deep RL algorithm? I disagree with that, at least in the episodic case; the deep RL algorithm optimizes within an episode, not across episodes. (With online learning, I think I still disagree but I’d want to specify an algorithm first.)
I suppose if for some reason you applied a planning algorithm that planned across episodes (quite a weird thing to do), then I suppose it would arise naturally; but that didn’t sound like what you were saying.
If your definition of “safe exploration” is “not making accidental mistakes” then I agree that what I’m pointing at doesn’t fall under that heading.
But in your post, you said:
Finally, what does this tell us about safe exploration and how to think about current safe exploration research? Current safe exploration research tends to focus on the avoidance of traps in the environment.
Isn’t that entire paragraph about the “not making accidental mistakes” line of research?
Well, what does “better exploration” mean? Better across-episode exploration or better within-episode exploration? Better relative to the base objective or better relative to the mesa-objective?
I was talking about Safety Gym and algorithms meant for it here. Safety Gym explicitly measures total number of constraint violations across all of training; this seems pretty clearly about across-episode exploration (since it’s across all training) relative to the base objective (the constraint specification is in the base objective; also there just aren’t any mesa objectives because the policies are not mesa optimizers).
putting a damper on instrumental exploration, which does across-episode and within-episode exploration only for the mesa-objective
I continue to be confused how instrumental / learned exploration happens across episodes.
I am also confused at the model here—is the idea that if you do better exploration for the base objective, then the mesa optimizer doesn’t need to do exploration for the mesa objective? If so, why is that true, and even if it is true, why does it matter, since presumably the mesa optimizer then already knows the information it would have gotten via exploration?
I think I’d benefit a lot from a concrete example (i.e. pick an environment and an algorithm; talk about what happens in the limit of lots of compute / data, feel free to assume that a mesa optimizer is created).
(This entire comment is setting aside embedded agency concerns, except for mesa optimization)
You seem to be equivocating between two notions of exploration. Consider an agent that is trained via RL to do well on a distribution of environments, p(E). Then there are two kinds of exploration:
Across-episode exploration: Exploration across training trajectories, where the RL algorithm collects trajectories going to various different parts of the state space in the environment, in order to figure out where the reward is.
Within-episode exploration: Exploration within a single trajectory, where you try to identify which particular E has been sampled, so that you can tailor your trajectory to that E.
In across-episode exploration, the exploration is being done by some human-designed algorithm. (I would claim this of RND, Boltzmann exploration, ϵ-greedy and entropy bonuses.) I agree that these work because you want to tailor your exploration based on the value of information, but the agent isn’t evaluating the value of information and deciding where to explore, the human-designed algorithm is doing that. So mesa optimization is not going to affect this.
In within-episode exploration, the exploration is being done directly by the policy, and so it is reasonable to talk about how a mesa optimizer would do such exploration.
With that in mind, some thoughts:
This initially sounds like you are talking about across-episode exploration, but then the phrase “entirely learned” makes me think you are talking about things within the domain of a mesa optimizer, i.e. within-episode exploration. Right now this is just semantics, but I think it plays into my confusion below.
Usually (in ML) “safe exploration” means “the agent doesn’t make a mistake, even by accident”; ϵ-greedy exploration wouldn’t be safe in that sense, since it can fall into traps. I’m assuming that by “safe exploration” you mean “when the agent explores, it is not trying to deceive us / hurt us / etc”.
Since by default policies can’t affect across-episode exploration, I assume you’re talking about within-episode exploration. But this happens all the time with current RL methods, e.g. one consequence of domain randomization was that OpenAI Five would go explore what Roshan’s health was for the current Dota game. In general, it’ll happen with any POMDP with a random initial state. We even have examples of this where you are (kind of) exploring the objective: Learning to Interactively Learn and Assist.
As you mention later, you would get objective exploration if the agent had uncertainty over the objective.
This sounds to me like reward uncertainty, assistance games / CIRL, and more generally Stuart Russell’s agenda, except applied to mesa optimization now. Should I take away something other than “we should have our mesa optimizers behave like the AIs in assistance games”? I feel like you are trying to say something else but I don’t know what.
I thought we were talking about “the agent doesn’t try to deceive us / hurt us by exploring”, which wouldn’t tell us anything about the problem of “the agent doesn’t make an accidental mistake”.
(Aside: these problems should not both be called safe exploration; they seem ~unrelated to me.)
The same way as capability exploration; based on value of information (VoI). (I assume you have a well-specified distribution over objectives; if you don’t, then there is no proper way to do it, in the same way there’s no proper way to do capability exploration without a prior over what you might see when you take the new action.)
You only need to put dampers on exploration if you’re concerned that the agent cannot make proper VoI calculations for optimal exploration. (Alternatively, you can remove exploration altogether if you can provide the information that would be gained via exploration some other way, e.g. from human input; this allows you to avoid the otherwise-unavoidable regret incurred through exploration.)
My perspective is that Safety Gym and things like it are proposing that we specify objectives via rewards + constraints, because incorporating the constraints in the reward function is difficult (it requires tuning a hyperparameter that specifies the tradeoff between obtaining reward and avoiding constraint violations). Separately, it also proposes that we measure reward / constraint violation throughout training, as a way to measure regret (rather than just test-time performance). The algorithms used are not putting dampers on exploration; they are trying to get the agent to do better exploration (e.g. if you crashed into the wall and saw that that violated a constraint, don’t crash into the wall again just because you forgot about that experience).
If you have the right uncertainty, then acting optimally to maximize that is the “right” thing to do.
I completely agree with the distinction between across-episode vs. within-episode exploration, and I agree I should have been clearer about that. Mostly I want to talk about across-episode exploration here, though when I was writing this post I was mostly motivated by the online learning case where the distinction is in fact somewhat blurred, since in an online learning setting you do in fact need the deployment policy to balance between within-episode exploration and across-episode exploration.
Agreed. My point is that “If you assume that the policy without exploration is safe, then for ϵ-greedy exploration to be safe on average, it just needs to be the case that the environment is safe on average, which is just a standard engineering question.” That is, even though it seems like it’s hard for ϵ-greedy exploration to be safe, it’s actually quite easy for it to be safe on average—you just need to be in a safe environment. That’s not true for learned exploration, though.
Yeah, I agree that was confusing—I’ll rephrase it. The point I was trying to make was that across-episode exploration should arise naturally, since an agent with a fixed objective should want to be modified to better pursue that objective, but not want to be modified to pursue a different objective.
Agreed that there’s a similarity there—that’s the motivation for calling it “cooperative.” But I’m not trying to advocate for that agenda here—I’m just trying to better classify the different types of corrigibility and understand how they work. In fact, I think it’s quite plausible that you could get away without cooperative corrigibility, though I don’t really want to take a stand on that right now.
If your definition of “safe exploration” is “not making accidental mistakes” then I agree that what I’m pointing at doesn’t fall under that heading. What I’m trying to point at is that I think there are other problems that we need to figure out regarding how models explore than just the “not making accidental mistakes” problem, though I have no strong feelings about whether or not to call those other problems “safe exploration” problems.
Agreed, though I don’t think that’s the end of the story. In particular, I don’t think it’s at all obvious what an agent that cares about the value of information that its actions produce relative to some objective distribution will look like, how you could get such an agent, or how you could verify when you had such an agent. And, even if you could do those things, it still seems pretty unclear to me what the right distribution over objectives should be and how you should learn it.
Well, what does “better exploration” mean? Better across-episode exploration or better within-episode exploration? Better relative to the base objective or better relative to the mesa-objective? I think it tends to be “better within-episode exploration relative to the base objective,” which I would call putting a damper on instrumental exploration, which does across-episode and within-episode exploration only for the mesa-objective, not the base objective.
Sure, but as you note getting the right uncertainty could be quite difficult, so for practical purposes my question is still unanswered.
Are you saying that across-episode exploration should arise naturally when applying a deep RL algorithm? I disagree with that, at least in the episodic case; the deep RL algorithm optimizes within an episode, not across episodes. (With online learning, I think I still disagree but I’d want to specify an algorithm first.)
I suppose if for some reason you applied a planning algorithm that planned across episodes (quite a weird thing to do), then I suppose it would arise naturally; but that didn’t sound like what you were saying.
But in your post, you said:
Isn’t that entire paragraph about the “not making accidental mistakes” line of research?
I was talking about Safety Gym and algorithms meant for it here. Safety Gym explicitly measures total number of constraint violations across all of training; this seems pretty clearly about across-episode exploration (since it’s across all training) relative to the base objective (the constraint specification is in the base objective; also there just aren’t any mesa objectives because the policies are not mesa optimizers).
I continue to be confused how instrumental / learned exploration happens across episodes.
I am also confused at the model here—is the idea that if you do better exploration for the base objective, then the mesa optimizer doesn’t need to do exploration for the mesa objective? If so, why is that true, and even if it is true, why does it matter, since presumably the mesa optimizer then already knows the information it would have gotten via exploration?
I think I’d benefit a lot from a concrete example (i.e. pick an environment and an algorithm; talk about what happens in the limit of lots of compute / data, feel free to assume that a mesa optimizer is created).