You mention converging to a deterministic policy is bad because of repetition, but did I miss you addressing that it’s also bad because we want diversity? (Edit: now that I reread that sentence, it makes no sense. Sorry!) In some sense we don’t want RL in the limit, we want something a little more aware that we want to sample from a distribution and get lots of different continuations that are all pretty good.
You mention converging to a deterministic policy is bad because of repetition, but did I miss you addressing that it’s also bad because we want diversity? (Edit: now that I reread that sentence, it makes no sense. Sorry!) In some sense we don’t want RL in the limit, we want something a little more aware that we want to sample from a distribution and get lots of different continuations that are all pretty good.