It should be pointed out that the original paper/press release describing GPT-4 explicitly says that they found that BIG-bench had contaminated their training data, and therefore excluded it as an evaluation. As far as I know there was no similar disclosure for claude or other models. See footnote 5 here: https://arxiv.org/abs/2303.08774v1
It should be pointed out that the original paper/press release describing GPT-4 explicitly says that they found that BIG-bench had contaminated their training data, and therefore excluded it as an evaluation. As far as I know there was no similar disclosure for claude or other models. See footnote 5 here: https://arxiv.org/abs/2303.08774v1