I’m not clear on whether it is rational or not. It seems like behavior we don’t want from a value learner, but I was curious about how “inevitable” it is from attempts to mix updatelessness with value learning. (Perhaps it is a really simple point, but I haven’t thought it entirely through, still.)
I’m not clear on whether it is rational or not. It seems like behavior we don’t want from a value learner, but I was curious about how “inevitable” it is from attempts to mix updatelessness with value learning. (Perhaps it is a really simple point, but I haven’t thought it entirely through, still.)
I have a recent result about value learning in UDT, it turns out to work very nicely and doesn’t suffer from the problem you describe.