Yep I think I agree, I didn’t understand the point you made about systematic anti-correlation originally.
If I understand correctly the issues is something like:
There are 10 India related statements, exactly 5 of which are false and 5 of which are true.
We do a random split of all the data, so if there is more true+india in train there will be more false+india in test
There are of course various fixes to make the data actually IID.
Yep I think I agree, I didn’t understand the point you made about systematic anti-correlation originally.
If I understand correctly the issues is something like:
There are 10 India related statements, exactly 5 of which are false and 5 of which are true.
We do a random split of all the data, so if there is more true+india in train there will be more false+india in test
There are of course various fixes to make the data actually IID.