Why did you opt for recovered accuracy as your metric? If I understand correctly, a random probe would achieve 100% recovered accuracy. Are you certain your metric doesn’t bias towards ineffective probes?
For recovered accuracy, we select a single threshold per dataset, taking the best value across all probe algorithms and datasets. So a random probe would be compared to the best probe algorithm on that dataset, and likely perform poorly.
I did check the thresholds used for recovered accuracy, and they seemed sensible, but I didn’t put this in the report. I’ll try to find time next week to put this in the appendix.
Why did you opt for recovered accuracy as your metric? If I understand correctly, a random probe would achieve 100% recovered accuracy. Are you certain your metric doesn’t bias towards ineffective probes?
Yes, I’m also curious about this @mishajw, did you check the actual accuracy of the different probes ?
(Apologies, been on holiday.)
For recovered accuracy, we select a single threshold per dataset, taking the best value across all probe algorithms and datasets. So a random probe would be compared to the best probe algorithm on that dataset, and likely perform poorly.
I did check the thresholds used for recovered accuracy, and they seemed sensible, but I didn’t put this in the report. I’ll try to find time next week to put this in the appendix.