Good point, changed
by setting the baseline to the last state in which a reward was achieved.
to
by using the inaction baseline, and resetting its initial state to the current state whenever a reward is achieved.
Good point, changed
to