I don’t need to calculate all that, in order to make an expected-utility-maximizing lunch order. I just need to calculate the difference between the utility which I expect if I order lamb Karahi vs a sisig burrito.
… and since my expectations for most of the world are the same under those two options, I should be able to calculate the difference lazily, without having to query most of my world model. Much like the message-passing update, I expect deltas to quickly fall off to zero as things propagate through the model.
This is an exciting observation. I wonder if you could empirically demonstrate that this works in a model based RL setup, on a videogame or something?
This is an exciting observation. I wonder if you could empirically demonstrate that this works in a model based RL setup, on a videogame or something?