Would be interesting to see the transcripts on harder questions altered enough to exclude the possibility of being in the training set.
Edit: That is, it’s the originally composed (rather than remembered) transcripts themselves that I’m interested in seeing, not as a way of verifying the score. Like, with writing a quine the interesting thing is that there is no internal monologue to support the planning of how the code gets written. I wouldn’t be able to solve this problem without a bit of planning and prototyping, even if it takes place in my head, but GPT-4 just writes out the final result. For math, it’s also interesting how much of “showing the work” it takes to solve harder problems. Here’s an example of the kind of thing I mean.
They tried to detect and prevent questions appearing in the training set being asked as part of the tests. It didn’t seem to make much difference. See table 10, “contamination data for exams”. It’s a pretty tiny fraction of the data, and removing it didn’t make much difference.
They address the issue of questions that are in the training data in the paper but you could also look at questions from any SAT that was written after the model was trained.
Would be interesting to see the transcripts on harder questions altered enough to exclude the possibility of being in the training set.
Edit: That is, it’s the originally composed (rather than remembered) transcripts themselves that I’m interested in seeing, not as a way of verifying the score. Like, with writing a quine the interesting thing is that there is no internal monologue to support the planning of how the code gets written. I wouldn’t be able to solve this problem without a bit of planning and prototyping, even if it takes place in my head, but GPT-4 just writes out the final result. For math, it’s also interesting how much of “showing the work” it takes to solve harder problems. Here’s an example of the kind of thing I mean.
They tried to detect and prevent questions appearing in the training set being asked as part of the tests. It didn’t seem to make much difference. See table 10, “contamination data for exams”. It’s a pretty tiny fraction of the data, and removing it didn’t make much difference.
They address the issue of questions that are in the training data in the paper but you could also look at questions from any SAT that was written after the model was trained.