I’m not even sure whether you are closer or further from understanding what I meant, now.
:(
I can only assume that you are confused why I would have set up things the way I did in the post if this was my point, since I didn’t end up talking much about directly learning the policies.
My assumption was that you were arguing for why learning policies directly (assuming we could do it) has advantages over the default approach of value learning + optimization. That framing seems to explain most of the post.
:(
My assumption was that you were arguing for why learning policies directly (assuming we could do it) has advantages over the default approach of value learning + optimization. That framing seems to explain most of the post.