Is it possible that the canary string itself has been learned but not any documents that used the canary string in order to be removed from the dataset?
Yes, it could have trained on the repository itself (apache-2.0 licensed on Github and a decent number of years old), and I’m guessing did based on general knowledge of the project; it could have snuck into web data like this thread itself.
Additionally, while the intent was to be removed here, it could have been used in documents that were in non-benchmark datasets in an effort to get filtered out (like papers published on arXiv). This indicates the canary string data at least wasn’t filtered out, but it isn’t a sure canary for benchmark contamination, just a possible one.
Is it possible that the canary string itself has been learned but not any documents that used the canary string in order to be removed from the dataset?
Yes, it could have trained on the repository itself (apache-2.0 licensed on Github and a decent number of years old), and I’m guessing did based on general knowledge of the project; it could have snuck into web data like this thread itself.
Additionally, while the intent was to be removed here, it could have been used in documents that were in non-benchmark datasets in an effort to get filtered out (like papers published on arXiv). This indicates the canary string data at least wasn’t filtered out, but it isn’t a sure canary for benchmark contamination, just a possible one.