would probably eventually stop treating you as a source of new information once it had learned a lot from you, at which point it would stop being deferential.
It seems that 1) when extrapolating to new situations 2) if you add a term to decay the relevance of old information (pretty standard in RL) 3) or you add a minimum bounds to uncertainty then it would remain deferential.
In other words, it doesn’t seem like an unsolvable problem, just an open question. But every other alignment agenda also has numerous open questions. So why the hostility.
Academia and LessWrong are two different groups, which have different cultures and jargon. I think they may be overly skeptical towards each other’s work at times.
It’s worth noting though that many of the nice deferential properties may appear in other value modelling techniques (like recursive reward modelling at OpenAI).
It seems that 1) when extrapolating to new situations 2) if you add a term to decay the relevance of old information (pretty standard in RL) 3) or you add a minimum bounds to uncertainty then it would remain deferential.
In other words, it doesn’t seem like an unsolvable problem, just an open question. But every other alignment agenda also has numerous open questions. So why the hostility.
Academia and LessWrong are two different groups, which have different cultures and jargon. I think they may be overly skeptical towards each other’s work at times.
It’s worth noting though that many of the nice deferential properties may appear in other value modelling techniques (like recursive reward modelling at OpenAI).