Somewhere between worst-case and average-case performance is quantile-case performance, known in SRE circles as percentile latency and widely measured empirically in practice (but rarely estimated in theory). Formally, optimizing p-quantile-case performance looks like maxθsup{k|Ex[[f(θ,x)>k]]>p} (compare to my expressions below for other cases). My impression is that quantile-case is heavily underexplored in theoretical CS and also underused in ML, with the exceptions of PAC learning and VC theory.
Here’s the results of an abbreviated literature search for papers that bring quantile-case concepts into contact with contemporary RL and/or deep learning:
Defines an even stronger concept of “IPOC bound”, which implies Uniform-PAC, and also outputs a certified per-episode regret bound along with each proposed action.
Constructs an algorithm ORLC that has an IPOC-bound
Somewhere between worst-case and average-case performance is quantile-case performance, known in SRE circles as percentile latency and widely measured empirically in practice (but rarely estimated in theory). Formally, optimizing p-quantile-case performance looks like maxθsup{k|Ex[[f(θ,x)>k]]>p} (compare to my expressions below for other cases). My impression is that quantile-case is heavily underexplored in theoretical CS and also underused in ML, with the exceptions of PAC learning and VC theory.
Here’s the results of an abbreviated literature search for papers that bring quantile-case concepts into contact with contemporary RL and/or deep learning:
Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning. Christoph Dann, Tor Lattimore, Emma Brunskill. NIPS 2017.
Defines a concept of “Uniform-PAC bound”, which is roughly when p-quantile-case episodic regret scales polynomially in 1/(1−p).
Proves that a Uniform-PAC bound implies:
PAC bound
Uniform high-probability regret bound
Convergence to zero regret with high probability
Constructs an algorithm, UBEV, that has a Uniform-PAC bound
Empirically compares quite favorably to other algorithms with only PAC or regret bounds
Policy Certificates: Towards Accountable Reinforcement Learning. Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill. ICML 2019.
Defines an even stronger concept of “IPOC bound”, which implies Uniform-PAC, and also outputs a certified per-episode regret bound along with each proposed action.
Constructs an algorithm ORLC that has an IPOC-bound
Empirically compares favorably to UBEV
Revisiting Generalization for Deep Learning: PAC-Bayes, Flat Minima, and Generative Models. Gintare Dziugaite. December 2018 PhD thesis under Zoubin Ghahrmani.
Lipschitz Lifelong Reinforcement Learning. Erwan Lecarpentier, David Abel, Kavosh Asadi, et al. AAAI 2021.
Defines a pseudometric on the space of all MDPs
Proves that the mapping from an MDP to its optimal Q-function is (pseudo-)Lipschitz
Uses this to construct an algorithm LRMax that can transfer-learn from past similar MDPs while also being PAC-MDP
Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation. Jiafan He, Dongruo Zhou, Quanwaun Gu. NIPS 2021.
Constructs an algorithm FLUTE that has a Uniform-PAC bound with a certain linearity assumption on the structure of the MDP being learned.
Beyond No Regret: Instance-Dependent PAC RL. Andrew Wagenmaker, Max Simchowitz, Kevin Jamieson. August 2021 preprint.
Learning PAC-Bayes Priors for Probabilistic Neural Networks. María Pérez-Ortiz, Omar Rivasplata, Benjamin Guedj, et al. September 2021 preprint.
Tigheter Risk Certificates for Neural Networks. María Pérez-Ortiz, Omar Rivasplata, John Shawe-Taylor, Csaba Szepesvári. ICML 2021.
PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees. Jonas Rothfuss, Vincent Fortuin, Martin Josifoski, Andreas Krause. ICML 2021.
I would add How useful is quantilization for specification gaming? Ryan Carey