Does the notation get flipped at some point? In the abstract you say
prior policy π0
and
there are arbitrarily well-performing policies π
But then later you say
This strongly penalizes π0 taking actions the base policy never takes
Which makes it sound like they’re switched.
I also notice that you call it “prior policy”, “base policy” and “reference policy” at different times; these all make sense but it’d be a bit nicer if there was one phrase used consistently.
The third one was a typo which I just fixed. I have also changed it to use “base policy” everywhere to be consistent, although this may change depending on what terminology is most common in an ML context, which I’m not sure of.
Does the notation get flipped at some point? In the abstract you say
and
But then later you say
Which makes it sound like they’re switched.
I also notice that you call it “prior policy”, “base policy” and “reference policy” at different times; these all make sense but it’d be a bit nicer if there was one phrase used consistently.
The third one was a typo which I just fixed. I have also changed it to use “base policy” everywhere to be consistent, although this may change depending on what terminology is most common in an ML context, which I’m not sure of.