Paul Bricman comments on Linda Linsefors’s Shortform

Paul Bricman 3 Oct 2022 14:16 UTC
LW: 1 AF: 1
AF
You mean, in that you can simply prompt for a reasonable non-infinite performance and get said outcome?
- Linda Linsefors 10 Oct 2022 21:40 UTC
  LW: 1 AF: 1
  AF Parent
  Similar but not exactly.
  
  I mean that you take some known distribution (the training distribution) as a starting point. But when sampling actions you do so from shifted on truncated distribution to favour higher reward policies.
  
  The in the decision transformers I linked, AI is playing a variety of different games, where the programmers might not know what a good future reward value would be. So they let the system AI predict the future reward, but with the distribution shifted towards higher rewards.
  
  I discussed this a bit more after posting the above comment, and there is something I want to add about the comparison.
  
  In quantilizers if you know the probability of DOOM from the base distribution, you get an upper bound on DOOM for the quantaizer. This is not the case for type of probability shift used for the linked decision transformer.
  
  DOOM = Unforeseen catastrophic outcome. Would not be labelled as very bad by the AI’s reward function but is in reality VERY BAD.
  - Jobst Heitzig 1 Feb 2023 8:49 UTC
    1 point
    AF Parent
    From my reading of quantilizers, they might still choose “near-optimal” actions, just only with a small probability. Whereas a system based on decision transformers (possibly combined with a LLM) could be designed that we could then simply tell to “make me a tea of this quantity and quality within this time and with this probability” and it would attempt to do just that, without trying to make more or better tea or faster or with higher probability.
    - Linda Linsefors 2 Feb 2023 10:22 UTC
      1 point
      Parent
      Yes, that is a thing you can do with decision transforms too. I was referring to variant of the decision transformer (see link in original short form) where the AI samples the reward it’s aiming for.