smarter exploration strategies will depend on the agent’s value function
I think this is plausible but overconfident.
FWIW I think with moderate confidence that smarter exploration strategies are fundamental to advanced agency—I think of things like play, ‘deliberate exploration’, experiment design, goal-backchaining and so-on. Mainly because epsilon exploration is scuppered for sparse rewards and real-world dynamics are super-duper highly-branching.
I also think we’ve barely scratched the surface of understanding exploration, though there are some interesting directions like EMPA[1], VariBAD[2], HER[3], and older stuff like pseudocount-based and prediction-error-based ‘curiosity’.
If humans (and/or supervised speedups of humans or similar) can provide dense signals, this claim is weaker, but I think the key problem for AGI learning is OOD dense signals, and I don’t think humans are capable of safe/accurate OOD dense reward/value signals.
I think this is plausible but overconfident.
FWIW I think with moderate confidence that smarter exploration strategies are fundamental to advanced agency—I think of things like play, ‘deliberate exploration’, experiment design, goal-backchaining and so-on. Mainly because epsilon exploration is scuppered for sparse rewards and real-world dynamics are super-duper highly-branching.
I also think we’ve barely scratched the surface of understanding exploration, though there are some interesting directions like EMPA[1], VariBAD[2], HER[3], and older stuff like pseudocount-based and prediction-error-based ‘curiosity’.
If humans (and/or supervised speedups of humans or similar) can provide dense signals, this claim is weaker, but I think the key problem for AGI learning is OOD dense signals, and I don’t think humans are capable of safe/accurate OOD dense reward/value signals.
Tsividis et al—Human-Level Reinforcement Learning through Theory-Based Modeling, Exploration, and Planning
Zintgraf et al—VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
Andrychowicz et al—Hindsight Experience Replay