Good to know! So I guess people were expecting that every company is running a “check if canary string is anywhere in our entire dataset and remove document if so” function?
Yes, if just because the contents of BIG-BENCH contents might get copied to many different places, and manually filtering for them would be cumbersome or error-prone. And there’s not much lost by filtering all the content containing the canary string—realistically, only a very small number of people used it to prevent text from getting into training data.
Good to know! So I guess people were expecting that every company is running a “check if canary string is anywhere in our entire dataset and remove document if so” function?
Yes, if just because the contents of BIG-BENCH contents might get copied to many different places, and manually filtering for them would be cumbersome or error-prone. And there’s not much lost by filtering all the content containing the canary string—realistically, only a very small number of people used it to prevent text from getting into training data.