Aren’t “modularly varying reward functions” exactly what D𝜋′s Self-Organised Neural Networks accomplish? Each example in the training data is a module of the reward function. By only learning on the training examples that are currently hardest for the network, we make those examples easier and thus implicitly swap them out of the “examples currently hardest for the network” set.
Aren’t “modularly varying reward functions” exactly what D𝜋′s Self-Organised Neural Networks accomplish? Each example in the training data is a module of the reward function. By only learning on the training examples that are currently hardest for the network, we make those examples easier and thus implicitly swap them out of the “examples currently hardest for the network” set.