If you just google the string, there are many instances of people sharing it verbatim. Would be good to do further testing to know if it was actually trained on the benchmark or learned through many other sources.
Good to know! So I guess people were expecting that every company is running a “check if canary string is anywhere in our entire dataset and remove document if so” function?
Yes, if just because the contents of BIG-BENCH contents might get copied to many different places, and manually filtering for them would be cumbersome or error-prone. And there’s not much lost by filtering all the content containing the canary string—realistically, only a very small number of people used it to prevent text from getting into training data.
If you just google the string, there are many instances of people sharing it verbatim. Would be good to do further testing to know if it was actually trained on the benchmark or learned through many other sources.
The canary string was supposed to be a magic opt out button.
Good to know! So I guess people were expecting that every company is running a “check if canary string is anywhere in our entire dataset and remove document if so” function?
Yes, if just because the contents of BIG-BENCH contents might get copied to many different places, and manually filtering for them would be cumbersome or error-prone. And there’s not much lost by filtering all the content containing the canary string—realistically, only a very small number of people used it to prevent text from getting into training data.