One weakness I realized overnight is that this incentivizes branching out into new problem domains. One potential fix is to, when novel domains show up, shoehorn the big LLMs into solving that domain on the same benchmark and limit new types of models/training to what the LLMs can accomplish in that new domain.
Basically setting an initially low SOTA that can grow at the same percentage as the rest of the basket. This might prevent leapfrogging the general models with narrow ones that are mostly mesa-optimizer or similar.
One weakness I realized overnight is that this incentivizes branching out into new problem domains. One potential fix is to, when novel domains show up, shoehorn the big LLMs into solving that domain on the same benchmark and limit new types of models/training to what the LLMs can accomplish in that new domain. Basically setting an initially low SOTA that can grow at the same percentage as the rest of the basket. This might prevent leapfrogging the general models with narrow ones that are mostly mesa-optimizer or similar.