+1 on this comment, I feel pretty confused about the excerpt from Paul that Steve quoted above. And even without the agent deliberately deciding where to avoid exploring, incomplete exploration may lead to agents which learn non-reward goals before convergence—so if Paul’s statement is intended to refer to optimal policies, I’d be curious why he thinks that’s the most important case to focus on.
+1 on this comment, I feel pretty confused about the excerpt from Paul that Steve quoted above. And even without the agent deliberately deciding where to avoid exploring, incomplete exploration may lead to agents which learn non-reward goals before convergence—so if Paul’s statement is intended to refer to optimal policies, I’d be curious why he thinks that’s the most important case to focus on.