Rohin Shah comments on Recursive Quantilizers II

Rohin Shah 8 Mar 2021 23:49 UTC
LW: 2 AF: 2
AF
Most of this makes sense (or perhaps more accurately, sounds like it might be true, but there’s a good chance if I reread the post and all the comments I’d object again / get confused somehow). One thing though:
Every piece of feedback gets put into the same big pool which helps define Hv, the initial (“human”) value function. [...]
Okay, I think with this elaboration I stand by what I originally said:
It seemed to me like since the first few bits of feedback determine how the system interprets all future feedback, it’s particularly important for those first few bits to be correct and not lock in e.g. a policy that ignores all future feedback.
Specifically, isn’t it the case that the first few bits of feedback determine $D_{1}$ , which might then lock in some bad way of interpreting feedback (whether existing or future feedback)?
- abramdemski 9 Mar 2021 14:54 UTC
  LW: 2 AF: 2
  AF Parent
  Okay, I think with this elaboration I stand by what I originally said
  You mean with respect to the system as described in the post (in which case I 100% agree), or the modified system which restarts training upon new feedback (which is what I was just describing)?
  Because I think this is pretty solidly wrong of the system that restarts.
  Specifically, isn’t it the case that the first few bits of feedback determine $D_{1}$ , which might then lock in some bad way of interpreting feedback (whether existing or future feedback)?
  All feedback so far determines the new $D_{1}$ when the system restarts training.
  (Again, I’m not saying it’s feasible to restart training all the time, I’m just using it as a proof-of-concept to show that we’re not fundamentally forced to make a trade-off between (a) order independence and (b) using the best model to interpret feedback.)
  - Rohin Shah 9 Mar 2021 18:41 UTC
    LW: 2 AF: 2
    AF Parent
    I continue to not understand this but it seems like such a simple question that it must be that there’s just some deeper misunderstanding of the exact proposal we’re now debating. It seems not particularly worth it to find this misunderstanding; I don’t think it will really teach us anything conceptually new.
    (If I did want to find it, I would write out pseudocode for the new proposed system and then try to make a more precise claim in terms of the variables in the pseudocode.)
    - abramdemski 9 Mar 2021 19:51 UTC
      LW: 2 AF: 2
      AF Parent
      Fair.