For recovered accuracy, we select a single threshold per dataset, taking the best value across all probe algorithms and datasets. So a random probe would be compared to the best probe algorithm on that dataset, and likely perform poorly.
I did check the thresholds used for recovered accuracy, and they seemed sensible, but I didn’t put this in the report. I’ll try to find time next week to put this in the appendix.
(Apologies, been on holiday.)
For recovered accuracy, we select a single threshold per dataset, taking the best value across all probe algorithms and datasets. So a random probe would be compared to the best probe algorithm on that dataset, and likely perform poorly.
I did check the thresholds used for recovered accuracy, and they seemed sensible, but I didn’t put this in the report. I’ll try to find time next week to put this in the appendix.