looking at your code—seems like there’s an option for next-token prediction in the initial finetuning state, but no mention (that I can find) in the paper—am I correct in assuming the next token prediction weight was set to 0? (apologies for bugging you on this stuff!)
That’s right. We initially thought it might be important so that the LLM “understood” the task better, but it didn’t matter much in the end. The main hyperparameters for our experiments are in train_ray.py, where you can see that we use a “token_loss_weight” of 0.
looking at your code—seems like there’s an option for next-token prediction in the initial finetuning state, but no mention (that I can find) in the paper—am I correct in assuming the next token prediction weight was set to 0? (apologies for bugging you on this stuff!)
That’s right. We initially thought it might be important so that the LLM “understood” the task better, but it didn’t matter much in the end. The main hyperparameters for our experiments are in train_ray.py, where you can see that we use a “token_loss_weight” of 0.
(Feel free to ask more questions!)