I have unpublished work on that. And a similar experiment (with myopic reinforcement learning) in our paper “Misleading meta-objectives and hidden incentives for distributional shift.” ( https://sites.google.com/view/safeml-iclr2019/accepted-papers?authuser=0 )
The environment used in the unpublished work is summarized here: https://docs.google.com/presentation/d/1K6Cblt_kSJBAkVtYRswDgNDvULlP5l7EH09ikP2hK3I/edit?usp=sharing
I have unpublished work on that. And a similar experiment (with myopic reinforcement learning) in our paper “Misleading meta-objectives and hidden incentives for distributional shift.” ( https://sites.google.com/view/safeml-iclr2019/accepted-papers?authuser=0 )
The environment used in the unpublished work is summarized here: https://docs.google.com/presentation/d/1K6Cblt_kSJBAkVtYRswDgNDvULlP5l7EH09ikP2hK3I/edit?usp=sharing