We upsample code snippets such that the training dataset has 5 000 trusted data points, of which half are positive and half are negative, and 20000 untrusted data points, of which 10% are fake negatives, 40% are real positives, 35% are completely negative, and the other 15% are equally split between the 6 ways to have some but not all of the measurement be positive.
Is this a typo? (My understanding was that are no fake negatives i.e. no examples where the diamond is in the vault but all the measurements suggest the diamond is not in the vault. Also there are fake positives, which I believe are absent from this description).
From section 2.1.2 of the paper (Emphasis mine)
Is this a typo? (My understanding was that are no fake negatives i.e. no examples where the diamond is in the vault but all the measurements suggest the diamond is not in the vault. Also there are fake positives, which I believe are absent from this description).
Yes, it’s a typo. We meant “10% are fake positives”. Thanks for catching it, I’ll update the paper.