So to sum up so far, the basic idea is to shoot for a specific expected value of something by stochastically combining policies that have expected values above and below the target. The policies to be combined should be picked from some “mostly safe” distribution rather being whatever policies are closest to the specific target, because the absolute closest policies might involve inner optimization for exactly that target, when we really want “do something reasonable that gets close to the target.”
And the “aspiration updating” thing is a way to track which policy you think you’re shooting for, in a way that you’re hoping generalizes decently to cases where you have limited planning ability?
In our current ongoing work, we generalize that further to the case of multiple evaluation metrics, in order to get closer to plausible real-world goals, see our teaser post.
So to sum up so far, the basic idea is to shoot for a specific expected value of something by stochastically combining policies that have expected values above and below the target. The policies to be combined should be picked from some “mostly safe” distribution rather being whatever policies are closest to the specific target, because the absolute closest policies might involve inner optimization for exactly that target, when we really want “do something reasonable that gets close to the target.”
And the “aspiration updating” thing is a way to track which policy you think you’re shooting for, in a way that you’re hoping generalizes decently to cases where you have limited planning ability?
Exactly! Thanks for providing this concise summary in your words.
In the next post we generalize the target from a single point to an interval to get even more freedom that we can use for increasing safety further.
In our current ongoing work, we generalize that further to the case of multiple evaluation metrics, in order to get closer to plausible real-world goals, see our teaser post.