You might be interested in my work on learning from untrusted data (see also earlier work on aggregating unreliable human input). I think it is pretty relevant to what you discussed, although if you do not think it is, then I would also be pretty interested in understanding that.
Unrelated, but for quantilizers, isn’t the biggest issue going to be that if you need to make a sequence of decisions, the probabilities are going to accumulate and give exponential decay? I don’t see how to make a sequence of 100 decisions in a quantilizing way unless the base distribution of policies is very close to the target policy.
You might be interested in my work on learning from untrusted data (see also earlier work on aggregating unreliable human input). I think it is pretty relevant to what you discussed, although if you do not think it is, then I would also be pretty interested in understanding that.
Unrelated, but for quantilizers, isn’t the biggest issue going to be that if you need to make a sequence of decisions, the probabilities are going to accumulate and give exponential decay? I don’t see how to make a sequence of 100 decisions in a quantilizing way unless the base distribution of policies is very close to the target policy.
Looks relevant at first glance! Untrusted data is one of the cases I wanted to think more about.