we’ll elide all of the subtle difficulties involved in actually getting RL to work in practice
I haven’t properly internalized the rest of the post, but this confuses me because I thought this post was about the subtle difficulties.
The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens, P(s'|s,a)=append(s,a) and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.
The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens, P(s’|s,a)=append(s,a) and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.
I agree that this is the most straightforward interpretation, but OpenAI have made no commitment to sticking to honest and straightforward interpretations. So I don’t think the RL setup is actually that straightforward.
If you want more technical detail, I recommend watching the Rush & Ritter talk (see also slides and bibliography). This post was meant as a high-level overview of the different compatible interpretations with some pointers to further reading/watching.
I haven’t properly internalized the rest of the post, but this confuses me because I thought this post was about the subtle difficulties.
The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens,
P(s'|s,a)=append(s,a)
and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.I agree that this is the most straightforward interpretation, but OpenAI have made no commitment to sticking to honest and straightforward interpretations. So I don’t think the RL setup is actually that straightforward.
If you want more technical detail, I recommend watching the Rush & Ritter talk (see also slides and bibliography). This post was meant as a high-level overview of the different compatible interpretations with some pointers to further reading/watching.