Rohin Shah comments on Safe exploration and corrigibility

Rohin Shah 29 Dec 2019 21:04 UTC
LW: 3 AF: 3
AF
The point I was trying to make was that across-episode exploration should arise naturally
Are you saying that across-episode exploration should arise naturally when applying a deep RL algorithm? I disagree with that, at least in the episodic case; the deep RL algorithm optimizes within an episode, not across episodes. (With online learning, I think I still disagree but I’d want to specify an algorithm first.)
I suppose if for some reason you applied a planning algorithm that planned across episodes (quite a weird thing to do), then I suppose it would arise naturally; but that didn’t sound like what you were saying.
If your definition of “safe exploration” is “not making accidental mistakes” then I agree that what I’m pointing at doesn’t fall under that heading.
But in your post, you said:
Finally, what does this tell us about safe exploration and how to think about current safe exploration research? Current safe exploration research tends to focus on the avoidance of traps in the environment.
Isn’t that entire paragraph about the “not making accidental mistakes” line of research?
Well, what does “better exploration” mean? Better across-episode exploration or better within-episode exploration? Better relative to the base objective or better relative to the mesa-objective?
I was talking about Safety Gym and algorithms meant for it here. Safety Gym explicitly measures total number of constraint violations across all of training; this seems pretty clearly about across-episode exploration (since it’s across all training) relative to the base objective (the constraint specification is in the base objective; also there just aren’t any mesa objectives because the policies are not mesa optimizers).
putting a damper on instrumental exploration, which does across-episode and within-episode exploration only for the mesa-objective
I continue to be confused how instrumental / learned exploration happens across episodes.
I am also confused at the model here—is the idea that if you do better exploration for the base objective, then the mesa optimizer doesn’t need to do exploration for the mesa objective? If so, why is that true, and even if it is true, why does it matter, since presumably the mesa optimizer then already knows the information it would have gotten via exploration?
I think I’d benefit a lot from a concrete example (i.e. pick an environment and an algorithm; talk about what happens in the limit of lots of compute / data, feel free to assume that a mesa optimizer is created).