Any thoughts on how useful existing benchmarks like WMDP-cyber would be; or, alternately, on how difficult it would be to develop a similar benchmark, but more tailored vs. secret collusion?
That’s a great question. I am not quite sure but WMDP-cyber does look relevant. If you are interested in working on a new benchmark for unlearning and secret collusion, do reach out to us!
Any thoughts on how useful existing benchmarks like WMDP-cyber would be; or, alternately, on how difficult it would be to develop a similar benchmark, but more tailored vs. secret collusion?
That’s a great question. I am not quite sure but WMDP-cyber does look relevant. If you are interested in working on a new benchmark for unlearning and secret collusion, do reach out to us!