Interestingly, fine-tuning does better than the other methods on Multi-answer, but not that well on multiply-divide. I would have forecast the opposite considering the training task. For example, the model could have just guessed the probability by looking at the number of digits involved in the operation.
It’s task dependant
Ok, training on the task add-subtract with different subtasks (1digit, 2 digits, …) and then you evaluate on multi-answers and multiply-divide.
Interestingly, fine-tuning does better than the other methods on Multi-answer, but not that well on multiply-divide. I would have forecast the opposite considering the training task. For example, the model could have just guessed the probability by looking at the number of digits involved in the operation.
Hum, I do not understand why.