Most of this makes sense (or perhaps more accurately, sounds like it might be true, but there’s a good chance if I reread the post and all the comments I’d object again / get confused somehow). One thing though:
Every piece of feedback gets put into the same big pool which helps define Hv, the initial (“human”) value function. [...]
Okay, I think with this elaboration I stand by what I originally said:
It seemed to me like since the first few bits of feedback determine how the system interprets all future feedback, it’s particularly important for those first few bits to be correct and not lock in e.g. a policy that ignores all future feedback.
Specifically, isn’t it the case that the first few bits of feedback determine D1, which might then lock in some bad way of interpreting feedback (whether existing or future feedback)?
Okay, I think with this elaboration I stand by what I originally said
You mean with respect to the system as described in the post (in which case I 100% agree), or the modified system which restarts training upon new feedback (which is what I was just describing)?
Because I think this is pretty solidly wrong of the system that restarts.
Specifically, isn’t it the case that the first few bits of feedback determine D1, which might then lock in some bad way of interpreting feedback (whether existing or future feedback)?
All feedback so far determines the new D1 when the system restarts training.
(Again, I’m not saying it’s feasible to restart training all the time, I’m just using it as a proof-of-concept to show that we’re not fundamentally forced to make a trade-off between (a) order independence and (b) using the best model to interpret feedback.)
I continue to not understand this but it seems like such a simple question that it must be that there’s just some deeper misunderstanding of the exact proposal we’re now debating. It seems not particularly worth it to find this misunderstanding; I don’t think it will really teach us anything conceptually new.
(If I did want to find it, I would write out pseudocode for the new proposed system and then try to make a more precise claim in terms of the variables in the pseudocode.)
Most of this makes sense (or perhaps more accurately, sounds like it might be true, but there’s a good chance if I reread the post and all the comments I’d object again / get confused somehow). One thing though:
Okay, I think with this elaboration I stand by what I originally said:
Specifically, isn’t it the case that the first few bits of feedback determine D1, which might then lock in some bad way of interpreting feedback (whether existing or future feedback)?
You mean with respect to the system as described in the post (in which case I 100% agree), or the modified system which restarts training upon new feedback (which is what I was just describing)?
Because I think this is pretty solidly wrong of the system that restarts.
All feedback so far determines the new D1 when the system restarts training.
(Again, I’m not saying it’s feasible to restart training all the time, I’m just using it as a proof-of-concept to show that we’re not fundamentally forced to make a trade-off between (a) order independence and (b) using the best model to interpret feedback.)
I continue to not understand this but it seems like such a simple question that it must be that there’s just some deeper misunderstanding of the exact proposal we’re now debating. It seems not particularly worth it to find this misunderstanding; I don’t think it will really teach us anything conceptually new.
(If I did want to find it, I would write out pseudocode for the new proposed system and then try to make a more precise claim in terms of the variables in the pseudocode.)
Fair.