Hm… It might be hard to distinguish between ‘it is devoting more capacity to implicitly plan rhyming better and that is why it can choose a valid rhyme’ and ‘it is putting more weight on the “same” amount of rhyme-planning and just reducing contribution from valid non-rhyme completions (such as ending the poem and adding a text commentary about it, or starting a new poem, which are common in the base models) to always choose a valid rhyme’, particularly given that it may be mode-collapsing onto the most confident rhymes, distorting the pseudo “log probs” even further. The RL model might be doing more planning internally but then picking only one safest rhyme, so you can’t read off anything from the logprobs, I don’t think. I’m also not sure if you can infer any degree of planning by, say, giving it a half-written line and seeing how badly it screws up… And you can’t build a search tree to quantify it nicely as ‘how much do I need to expand the tree to get a valid rhyme’ because LM search trees are full of degeneracy and loops and most of it is off-policy so it would again be hard to tell what anything meant: the RL model is never used with tree search in any way and anywhere besides the argmax choice, it’s now off-policy and it was never supposed to go there and perf may be arbitrarily bad because it learned to choose while assuming always being on-policy. Hard.
This might be a good test-case or goal for interpretability research: “can you tell me if this model is doing more planning [of rhymes] than another similar model?”
Hm… It might be hard to distinguish between ‘it is devoting more capacity to implicitly plan rhyming better and that is why it can choose a valid rhyme’ and ‘it is putting more weight on the “same” amount of rhyme-planning and just reducing contribution from valid non-rhyme completions (such as ending the poem and adding a text commentary about it, or starting a new poem, which are common in the base models) to always choose a valid rhyme’, particularly given that it may be mode-collapsing onto the most confident rhymes, distorting the pseudo “log probs” even further. The RL model might be doing more planning internally but then picking only one safest rhyme, so you can’t read off anything from the logprobs, I don’t think. I’m also not sure if you can infer any degree of planning by, say, giving it a half-written line and seeing how badly it screws up… And you can’t build a search tree to quantify it nicely as ‘how much do I need to expand the tree to get a valid rhyme’ because LM search trees are full of degeneracy and loops and most of it is off-policy so it would again be hard to tell what anything meant: the RL model is never used with tree search in any way and anywhere besides the argmax choice, it’s now off-policy and it was never supposed to go there and perf may be arbitrarily bad because it learned to choose while assuming always being on-policy. Hard.
This might be a good test-case or goal for interpretability research: “can you tell me if this model is doing more planning [of rhymes] than another similar model?”