Very cool. Something I’ve been wondering about recently is how to keep benchmarks from becoming available and optimized against by developers. I had an idea for a benchmark, and thought about submitting it to BigBench, but then decided against it since I was worried that the precautions against BigBench being optimized against were insufficient. It seems like a company that wanted to keep its model private and expose just a filtered API is a good idea, but what about the evaluators wanting to keep their benchmarks private? Maybe there could be a special arrangement where the companies agreed not to record the benchmarking as it passed through their APIs to their model?
Very cool. Something I’ve been wondering about recently is how to keep benchmarks from becoming available and optimized against by developers. I had an idea for a benchmark, and thought about submitting it to BigBench, but then decided against it since I was worried that the precautions against BigBench being optimized against were insufficient. It seems like a company that wanted to keep its model private and expose just a filtered API is a good idea, but what about the evaluators wanting to keep their benchmarks private? Maybe there could be a special arrangement where the companies agreed not to record the benchmarking as it passed through their APIs to their model?