I do think open sourcing is better, because there already was a lot of public attention and results on llm capabilities which are messy and misleading, and open sourcing one eval like this might improve our understanding a lot. Also, there are tons of llm agent projects/startups trying to build hype, so if you drop a benchmark here you are unlikely to attract unwanted attention (i’m guessing). I largely agree with https://www.lesswrong.com/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model
I do think open sourcing is better, because there already was a lot of public attention and results on llm capabilities which are messy and misleading, and open sourcing one eval like this might improve our understanding a lot. Also, there are tons of llm agent projects/startups trying to build hype, so if you drop a benchmark here you are unlikely to attract unwanted attention (i’m guessing). I largely agree with https://www.lesswrong.com/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model