I gave the model both the PGN and the FEN on every move with this in mind. Why do you think conditioning on high level games would help? I can see why for the base models, but I expect that the RLHFed models would try to play the moves which maximize their chances of winning, with or without such prompting.
but I expect that the RLHFed models would try to play the moves which maximize their chances of winning
RLHF doesn’t maximize probability of winning, it maximizes a mix of token-level predictive loss (since that is usually added as a loss either directly or implicitly by the K-L) and rater approval, and god knows what else goes on these days in the ‘post-training’ phase muddying the waters further. Not at all the same thing. (Same way that a RLHF model might not optimize for correctness, and instead be sycophantic. “Yes master, it is just as you say!”) It’s not at all obvious to me that RLHF should be expected to make the LLMs play their hardest (a rater might focus on punishing illegal moves, or rewarding good-but-not-better-than-me moves), or that the post-training would affect it much at all: how many chess games are really going into the RLHF or post-training, anyway? (As opposed to the pretraining PGNs.) It’s hardly an important or valuable task.
I gave the model both the PGN and the FEN on every move with this in mind. Why do you think conditioning on high level games would help? I can see why for the base models, but I expect that the RLHFed models would try to play the moves which maximize their chances of winning, with or without such prompting.
RLHF doesn’t maximize probability of winning, it maximizes a mix of token-level predictive loss (since that is usually added as a loss either directly or implicitly by the K-L) and rater approval, and god knows what else goes on these days in the ‘post-training’ phase muddying the waters further. Not at all the same thing. (Same way that a RLHF model might not optimize for correctness, and instead be sycophantic. “Yes master, it is just as you say!”) It’s not at all obvious to me that RLHF should be expected to make the LLMs play their hardest (a rater might focus on punishing illegal moves, or rewarding good-but-not-better-than-me moves), or that the post-training would affect it much at all: how many chess games are really going into the RLHF or post-training, anyway? (As opposed to the pretraining PGNs.) It’s hardly an important or valuable task.