Not a criticism, just a note about a thing I wish could be done more easily. I’d love to see Brier score loss for each. Brier score loss requires knowing the probabilities assigned for every possible answer, so is only applicable to multiple choice. It’s hard to derive through APIs as currently designed.
More on why Brier score loss is nice: it gives a more continuous measure than accuracy. https://arxiv.org/abs/2304.15004
Brier score loss requires knowing the probabilities assigned for every possible answer, so is only applicable to multiple choice.
Hello Nathan! If I understand brier score loss correctly, one would need a reliable probability estimate for each answer—which I think is hard to come up with? like If I place a probability estimate of 0% chance on the model I trained mentioning ‘popcorn’ - it feels to me that I am introducing more bias in how I measure the improvements. or I misunderstood this part?
I think there’s a misunderstanding. You are supposed to ask the model for its probability estimate, not give your own probability estimate. The Brier score loss is based on the question-answer’s probabilities over possible answers, not the question-grader’s probabilities.
Not a criticism, just a note about a thing I wish could be done more easily. I’d love to see Brier score loss for each. Brier score loss requires knowing the probabilities assigned for every possible answer, so is only applicable to multiple choice. It’s hard to derive through APIs as currently designed. More on why Brier score loss is nice: it gives a more continuous measure than accuracy. https://arxiv.org/abs/2304.15004
Will look into it. Thank you for the suggestion!
Hello Nathan! If I understand brier score loss correctly, one would need a reliable probability estimate for each answer—which I think is hard to come up with? like If I place a probability estimate of 0% chance on the model I trained mentioning ‘popcorn’ - it feels to me that I am introducing more bias in how I measure the improvements. or I misunderstood this part?
I think there’s a misunderstanding. You are supposed to ask the model for its probability estimate, not give your own probability estimate. The Brier score loss is based on the question-answer’s probabilities over possible answers, not the question-grader’s probabilities.