I mean that you take some known distribution (the training distribution) as a starting point. But when sampling actions you do so from shifted on truncated distribution to favour higher reward policies.
The in the decision transformers I linked, AI is playing a variety of different games, where the programmers might not know what a good future reward value would be. So they let the system AI predict the future reward, but with the distribution shifted towards higher rewards.
I discussed this a bit more after posting the above comment, and there is something I want to add about the comparison.
In quantilizers if you know the probability of DOOM from the base distribution, you get an upper bound on DOOM for the quantaizer. This is not the case for type of probability shift used for the linked decision transformer.
DOOM = Unforeseen catastrophic outcome. Would not be labelled as very bad by the AI’s reward function but is in reality VERY BAD.
From my reading of quantilizers, they might still choose “near-optimal” actions, just only with a small probability. Whereas a system based on decision transformers (possibly combined with a LLM) could be designed that we could then simply tell to “make me a tea of this quantity and quality within this time and with this probability” and it would attempt to do just that, without trying to make more or better tea or faster or with higher probability.
Yes, that is a thing you can do with decision transforms too. I was referring to variant of the decision transformer (see link in original short form) where the AI samples the reward it’s aiming for.
You mean, in that you can simply prompt for a reasonable non-infinite performance and get said outcome?
Similar but not exactly.
I mean that you take some known distribution (the training distribution) as a starting point. But when sampling actions you do so from shifted on truncated distribution to favour higher reward policies.
The in the decision transformers I linked, AI is playing a variety of different games, where the programmers might not know what a good future reward value would be. So they let the system AI predict the future reward, but with the distribution shifted towards higher rewards.
I discussed this a bit more after posting the above comment, and there is something I want to add about the comparison.
In quantilizers if you know the probability of DOOM from the base distribution, you get an upper bound on DOOM for the quantaizer. This is not the case for type of probability shift used for the linked decision transformer.
DOOM = Unforeseen catastrophic outcome. Would not be labelled as very bad by the AI’s reward function but is in reality VERY BAD.
From my reading of quantilizers, they might still choose “near-optimal” actions, just only with a small probability. Whereas a system based on decision transformers (possibly combined with a LLM) could be designed that we could then simply tell to “make me a tea of this quantity and quality within this time and with this probability” and it would attempt to do just that, without trying to make more or better tea or faster or with higher probability.
Yes, that is a thing you can do with decision transforms too. I was referring to variant of the decision transformer (see link in original short form) where the AI samples the reward it’s aiming for.