Oliver Sourbut comments on Reward is not the optimization target

Oliver Sourbut 30 Sep 2022 15:46 UTC
1 point
0
smarter exploration strategies will depend on the agent’s value function

I think this is plausible but overconfident.

FWIW I think with moderate confidence that smarter exploration strategies are fundamental to advanced agency—I think of things like play, ‘deliberate exploration’, experiment design, goal-backchaining and so-on. Mainly because epsilon exploration is scuppered for sparse rewards and real-world dynamics are super-duper highly-branching.

I also think we’ve barely scratched the surface of understanding exploration, though there are some interesting directions like EMPA^[1], VariBAD^[2], HER^[3], and older stuff like pseudocount-based and prediction-error-based ‘curiosity’.

If humans (and/or supervised speedups of humans or similar) can provide dense signals, this claim is weaker, but I think the key problem for AGI learning is OOD dense signals, and I don’t think humans are capable of safe/accurate OOD dense reward/value signals.
1. ↩︎
  Tsividis et al—Human-Level Reinforcement Learning through Theory-Based Modeling, Exploration, and Planning
2. ↩︎
  Zintgraf et al—VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
3. ↩︎
  Andrychowicz et al—Hindsight Experience Replay