Jeremy Gillen comments on Soft optimization makes the value target bigger

Jeremy Gillen 5 Jan 2023 15:32 UTC
3 points
0
- On it always being a rescaled subset: Nice! This explains the results of my empirical experiments. Jessica made a similar argument for why quantilizers are optimal, but I hadn’t gotten around to trying to adapt it to this slightly different situation. It makes sense now that the maximin distribution is like quantilizing against the value lower bound, except that the value lower bound changes if you change the minimax distribution. This explains why some of the distributions are exactly quantilizers but some not, it depends on whether that value lower bound drops lower than the start of the policy distribution.
- On planning: Yeah it might be hard to factorize the final policy distribution. But I think it will be easy to approximately factorize the prior in lots of different ways. And I’m hopeful that we can prove that some approximate factorizations maintain the same q value, or maybe only have a small impact on the q value. Haven’t done any work on this yet.
  - If it turns out we need near-exact factorizations, we might still be able to use sampling techniques like rejection sampling to correct an approximate sampling distribution, because we have easy access to the correct density of samples that we have generated (just prior/q), we just need an approximate distribution to use for getting high value samples more often, which seems straightforward.