I wonder if a basket of SOTA benchmarks would make more sense. Allow no more than X% increase in performance across the average of the benchmarks per year. This would capture the FLOPS metric along with potential speedups, fine-tuning, or other strategies.
Conveniently, this is how the teams are already ranking their models against each other so there’s ample evidence of past progress and researchers are incentivized to report accurately; there’s no incentive to “cheat” if researchers are not allowed to publish greater increases on SOTA benchmarks than the limit allows (e.g. journals would say “shut it down” instead of publish the paper), unless an actor wanted to simply jump ahead of everyone else and go for a singleton on their own, which is already an unavoidable risk without EY-style coordinated hard stop.
Great idea! Let’s measure algorithmic improvement in the same way economists measure inflation, with a basket-of-benchmarkets.
This basket can itself be adjusted over time so it continuously reflected the current use-cases of SOTA AI.
I haven’t thought about it much, but my guess is the best thing to do is to limit training compute directly but adjust the limit using the basket-of-benchmarks.
One weakness I realized overnight is that this incentivizes branching out into new problem domains. One potential fix is to, when novel domains show up, shoehorn the big LLMs into solving that domain on the same benchmark and limit new types of models/training to what the LLMs can accomplish in that new domain.
Basically setting an initially low SOTA that can grow at the same percentage as the rest of the basket. This might prevent leapfrogging the general models with narrow ones that are mostly mesa-optimizer or similar.
I wonder if a basket of SOTA benchmarks would make more sense. Allow no more than X% increase in performance across the average of the benchmarks per year. This would capture the FLOPS metric along with potential speedups, fine-tuning, or other strategies.
Conveniently, this is how the teams are already ranking their models against each other so there’s ample evidence of past progress and researchers are incentivized to report accurately; there’s no incentive to “cheat” if researchers are not allowed to publish greater increases on SOTA benchmarks than the limit allows (e.g. journals would say “shut it down” instead of publish the paper), unless an actor wanted to simply jump ahead of everyone else and go for a singleton on their own, which is already an unavoidable risk without EY-style coordinated hard stop.
Great idea! Let’s measure algorithmic improvement in the same way economists measure inflation, with a basket-of-benchmarkets.
This basket can itself be adjusted over time so it continuously reflected the current use-cases of SOTA AI.
I haven’t thought about it much, but my guess is the best thing to do is to limit training compute directly but adjust the limit using the basket-of-benchmarks.
One weakness I realized overnight is that this incentivizes branching out into new problem domains. One potential fix is to, when novel domains show up, shoehorn the big LLMs into solving that domain on the same benchmark and limit new types of models/training to what the LLMs can accomplish in that new domain. Basically setting an initially low SOTA that can grow at the same percentage as the rest of the basket. This might prevent leapfrogging the general models with narrow ones that are mostly mesa-optimizer or similar.