Here’s the results of an abbreviated literature search for papers that bring quantile-case concepts into contact with contemporary RL and/or deep learning:
Defines an even stronger concept of “IPOC bound”, which implies Uniform-PAC, and also outputs a certified per-episode regret bound along with each proposed action.
Constructs an algorithm ORLC that has an IPOC-bound
Here’s the results of an abbreviated literature search for papers that bring quantile-case concepts into contact with contemporary RL and/or deep learning:
Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning. Christoph Dann, Tor Lattimore, Emma Brunskill. NIPS 2017.
Defines a concept of “Uniform-PAC bound”, which is roughly when p-quantile-case episodic regret scales polynomially in 1/(1−p).
Proves that a Uniform-PAC bound implies:
PAC bound
Uniform high-probability regret bound
Convergence to zero regret with high probability
Constructs an algorithm, UBEV, that has a Uniform-PAC bound
Empirically compares quite favorably to other algorithms with only PAC or regret bounds
Policy Certificates: Towards Accountable Reinforcement Learning. Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill. ICML 2019.
Defines an even stronger concept of “IPOC bound”, which implies Uniform-PAC, and also outputs a certified per-episode regret bound along with each proposed action.
Constructs an algorithm ORLC that has an IPOC-bound
Empirically compares favorably to UBEV
Revisiting Generalization for Deep Learning: PAC-Bayes, Flat Minima, and Generative Models. Gintare Dziugaite. December 2018 PhD thesis under Zoubin Ghahrmani.
Lipschitz Lifelong Reinforcement Learning. Erwan Lecarpentier, David Abel, Kavosh Asadi, et al. AAAI 2021.
Defines a pseudometric on the space of all MDPs
Proves that the mapping from an MDP to its optimal Q-function is (pseudo-)Lipschitz
Uses this to construct an algorithm LRMax that can transfer-learn from past similar MDPs while also being PAC-MDP
Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation. Jiafan He, Dongruo Zhou, Quanwaun Gu. NIPS 2021.
Constructs an algorithm FLUTE that has a Uniform-PAC bound with a certain linearity assumption on the structure of the MDP being learned.
Beyond No Regret: Instance-Dependent PAC RL. Andrew Wagenmaker, Max Simchowitz, Kevin Jamieson. August 2021 preprint.
Learning PAC-Bayes Priors for Probabilistic Neural Networks. María Pérez-Ortiz, Omar Rivasplata, Benjamin Guedj, et al. September 2021 preprint.
Tigheter Risk Certificates for Neural Networks. María Pérez-Ortiz, Omar Rivasplata, John Shawe-Taylor, Csaba Szepesvári. ICML 2021.
PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees. Jonas Rothfuss, Vincent Fortuin, Martin Josifoski, Andreas Krause. ICML 2021.
I would add How useful is quantilization for specification gaming? Ryan Carey