On a cross country train, so delays and brevity for the next several days. This comment is just learning resources, I will reply to the other stuff later.
Another good resource is Steven byrnes’ less wrong sequence on brain like agi; it seems like you know neuro already, but seeing it described by a computer scientist might help you acquire some grounding by seeing stuff you know explained in rl terms.
Deep RL gets fairly technical pretty quickly; probably the most useful algorithms to understand are q-learning and REINFORCE, because most modern stuff is PPO, which is a couple nice hacks on top of REINFORCE. One good way to tame the complexity is to understand that fundamentally, deep RL is about doing RL in a context where your state space is too large to enumerate, and you must use a function approximator. So the two things you need to understand of an algorithm are what it looks like on a small finite mdp (Markov decision process), and what the function approximator looks like. (This slightly glosses over continuous control problems, which are not reducible to a finite mdp, but I stand by it as a principle for learning.)
The q function looks a lot like the circuitry of the basal ganglia (this is covered in more depth by Steven byrnes’ posts). Although actually the basal ganglia are way smarter, more like what are called generalized q functions.
A good project (if you are a project based learner) might be to implement a tabular q learner on the taxi gym environment; this is quite straightforward, and is basically the same math as deep q networks, just in the finite mdp setting. (It would also expose you to how punishingly complex it is to implement even simple RL algorithms in practice; for instance, I think optimistic initialization is crucial to good tabular q learning, which can easily get left out of introductions. )
One important distinction is between model-free and model-based RL. Everything listed above is model free, while human and smarter animal cognition seems like it includes substantial model based components. In model based stuff, you try to represent the structure of the mdp rather than just learning how to navigate it. Mu zero is a good state of the art algorithm; the finite mdp version is basically a more complex version of baum welch, together with dynamic programming to generate optimal trajectories once you know the mdp.
A good less wrong post to read is “models don’t get reward”. It points out a bunch of conceptual errors that people sometimes make when thinking of current RL too analogously to animals.
On a cross country train, so delays and brevity for the next several days. This comment is just learning resources, I will reply to the other stuff later.
A good textbook, although very formal and slightly incomplete, is Sutton and barto. http://incompleteideas.net/book/the-book-2nd.html . Fun fact: the first author has perhaps the most terrifying AI tweet of all time: https://twitter.com/RichardSSutton/status/1575619651563708418 . If you want something friendlier than that, I’m not entirely sure what the best resource is, but I can look around.
Another good resource is Steven byrnes’ less wrong sequence on brain like agi; it seems like you know neuro already, but seeing it described by a computer scientist might help you acquire some grounding by seeing stuff you know explained in rl terms.
Deep RL gets fairly technical pretty quickly; probably the most useful algorithms to understand are q-learning and REINFORCE, because most modern stuff is PPO, which is a couple nice hacks on top of REINFORCE. One good way to tame the complexity is to understand that fundamentally, deep RL is about doing RL in a context where your state space is too large to enumerate, and you must use a function approximator. So the two things you need to understand of an algorithm are what it looks like on a small finite mdp (Markov decision process), and what the function approximator looks like. (This slightly glosses over continuous control problems, which are not reducible to a finite mdp, but I stand by it as a principle for learning.)
The q function looks a lot like the circuitry of the basal ganglia (this is covered in more depth by Steven byrnes’ posts). Although actually the basal ganglia are way smarter, more like what are called generalized q functions.
A good project (if you are a project based learner) might be to implement a tabular q learner on the taxi gym environment; this is quite straightforward, and is basically the same math as deep q networks, just in the finite mdp setting. (It would also expose you to how punishingly complex it is to implement even simple RL algorithms in practice; for instance, I think optimistic initialization is crucial to good tabular q learning, which can easily get left out of introductions. )
One important distinction is between model-free and model-based RL. Everything listed above is model free, while human and smarter animal cognition seems like it includes substantial model based components. In model based stuff, you try to represent the structure of the mdp rather than just learning how to navigate it. Mu zero is a good state of the art algorithm; the finite mdp version is basically a more complex version of baum welch, together with dynamic programming to generate optimal trajectories once you know the mdp.
A good less wrong post to read is “models don’t get reward”. It points out a bunch of conceptual errors that people sometimes make when thinking of current RL too analogously to animals.
Thank you so much for writing this out! Will probably have a bunch of follow up questions when I dig deeper, already very grateful.