Abhimanyu Pallavi Sudhir comments on o1: A Technical Primer

Abhimanyu Pallavi Sudhir 10 Dec 2024 14:29 UTC
LW: 5 AF: 2
0
AF

we’ll elide all of the subtle difficulties involved in actually getting RL to work in practice

I haven’t properly internalized the rest of the post, but this confuses me because I thought this post was about the subtle difficulties.

The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens, P(s'|s,a)=append(s,a) and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.
- Jesse Hoogland 11 Dec 2024 17:22 UTC
  LW: 2 AF: 1
  0
  AF Parent
  The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens, P(s’|s,a)=append(s,a) and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.
  I agree that this is the most straightforward interpretation, but OpenAI have made no commitment to sticking to honest and straightforward interpretations. So I don’t think the RL setup is actually that straightforward.
  
  If you want more technical detail, I recommend watching the Rush & Ritter talk (see also slides and bibliography). This post was meant as a high-level overview of the different compatible interpretations with some pointers to further reading/watching.