Warning: I haven’t read the paper so take this with a grain of salt
Here’s how it would go wrong if I understand it right: For exponentially discounted MDPs there’s something called an effective horizon. That means everything after that time is essentially ignored.
You pick a tiny ϵ>0. Say (without loss of generality) that all utilities ut∈[−1,1]. Then there is a time t0 with δt0<ϵ. So the discounted cumulative utility from anything after t0 is bounded by c=ϵ11−δ (which follows from the limit of the geometric series). That’s an arbitrarily small constant.
We can now easily construct pairs of sequences for which LDU gives counterintuitive conclusions. E.g. a sequence s1 which is maximally better than s2 for any t>t0 until the end of time but ever so slightly worse (by c) for 0<t<t0.
So anything that happens after t0 is essentially ignored—we’ve essentially made the problem finite.
Exponential discounting in MDPs is standard practice. I’m surprised that this is presented as a big advance in infinite ethics as people have certainly thought about this in economics, machine learning and ethics before.
Btw, your meta-MDP probably falls into the category of Bayes-Adaptive MDP (BAMDP) or Bayes-Adaptive partially observable MDP (BAPOMDP) with learned rewards.
Thanks for the response. EDIT: Adam pointed out to me that LDU does not suffer from dictatorship of the present as I originally stated below and as you argued above. What you are saying is true for a fixed discount factor, but in this case we take the limit as δ→1.
The property you describe is known as “dictatorship of the present”, and you can read more about it here. In order to get rid of this “dictatorship” you end up having to do things like reject stationary, which are plausibly just as counterintuitive.
> I’m surprised that this is presented as a big advance in infinite ethics as people have certainly thought about this in economics, machine learning and ethics before.
Could you elaborate? The reason that I thought this was important was:
> Previous algorithms like the overtaking criterion had fairly “obvious” incomparable streams, with no real justification for why those streams would not be encountered by a decision-maker. LDU is not complete, but we at least have some reason to think that it may be all we “practically” need.
Are there other algorithms which you think are all we will “practically” need?
Warning: I haven’t read the paper so take this with a grain of salt
Here’s how it would go wrong if I understand it right: For exponentially discounted MDPs there’s something called an effective horizon. That means everything after that time is essentially ignored.
You pick a tiny ϵ>0. Say (without loss of generality) that all utilities ut∈[−1,1]. Then there is a time t0 with δt0<ϵ. So the discounted cumulative utility from anything after t0 is bounded by c=ϵ11−δ (which follows from the limit of the geometric series). That’s an arbitrarily small constant.
We can now easily construct pairs of sequences for which LDU gives counterintuitive conclusions. E.g. a sequence s1 which is maximally better than s2 for any t>t0 until the end of time but ever so slightly worse (by c) for 0<t<t0.
So anything that happens after t0 is essentially ignored—we’ve essentially made the problem finite.
Exponential discounting in MDPs is standard practice. I’m surprised that this is presented as a big advance in infinite ethics as people have certainly thought about this in economics, machine learning and ethics before.
Btw, your meta-MDP probably falls into the category of Bayes-Adaptive MDP (BAMDP) or Bayes-Adaptive partially observable MDP (BAPOMDP) with learned rewards.
Thanks for the response. EDIT: Adam pointed out to me that LDU does not suffer from dictatorship of the present as I originally stated below and as you argued above. What you are saying is true for a fixed discount factor, but in this case we take the limit as δ→1.
The property you describe is known as “dictatorship of the present”, and you can read more about it here. In order to get rid of this “dictatorship” you end up having to do things like reject stationary, which are plausibly just as counterintuitive.
> I’m surprised that this is presented as a big advance in infinite ethics as people have certainly thought about this in economics, machine learning and ethics before.
Could you elaborate? The reason that I thought this was important was:
> Previous algorithms like the overtaking criterion had fairly “obvious” incomparable streams, with no real justification for why those streams would not be encountered by a decision-maker. LDU is not complete, but we at least have some reason to think that it may be all we “practically” need.
Are there other algorithms which you think are all we will “practically” need?