In some cases, we should aim to use RL-based finetuning to elicit a particular behavior. Especially in cases where the model is supposed to show agentic behavior, RL-based finetuning seems preferable over SFT.
Therefore, it is helpful to be familiar with the combination of RL and the type of model you’re evaluating, e.g. LLMs. Useful knowledge includes how to set up the pipeline to build well-working reward models for LLMs or how to do “fake RL” that skips training a reward model and replaces it e.g. with a well-prompted LLM.
Deep RL is black magic and should be your last resort. But sometimes you have to use the dark arts. In which case, I’ll pass on a recommendation for this RL course from Roon of OpenAI and Twitter fame. Roon claimed that he couldn’t get deep RL to actually work until he watched this video/series. Then it started making sense.
Deep RL is black magic and should be your last resort. But sometimes you have to use the dark arts. In which case, I’ll pass on a recommendation for this RL course from Roon of OpenAI and Twitter fame. Roon claimed that he couldn’t get deep RL to actually work until he watched this video/series. Then it started making sense.