Word on the grapevine: it sounds like they might just be adding a bunch of parameters in a way that’s cheap to train but doesn’t actually work that well (i.e. the “mixture of experts” thing).
It would be highly entertaining if ML researchers got into an arms race on parameter count, then Goodharted on it. Sounds like exactly the sort of thing I’d expect not-very-smart funding agencies to throw lots of money at. Perhaps the Goodharting would be done by the funding agencies themselves, by just funding whichever projects say they will use the most parameters, until they end up with lots of tiny nails. (Though one does worry that the agencies will find out that we can already do infinite-parameter-count models!)
That said, I haven’t looked into it enough myself to be confident that that’s what’s happening here. I’m just raising the hypothesis from entropy.
I think this take is basically correct. Restating my version of it:
Mixture of Experts and similar approaches modulate paths through the network, such that not every parameter is used every time. This means that parameters and FLOPs (floating point operations) are more decoupled than they are in dense networks.
To me, FLOPs remains the harder-to-fake metric, but both are valuable to track moving forward.
In a funny way, even if someone is stuck in a Goodhart trap doing Language Models it is probably better to Goodhart performance on Winograd Schemas than just adding parameters.
Word on the grapevine: it sounds like they might just be adding a bunch of parameters in a way that’s cheap to train but doesn’t actually work that well (i.e. the “mixture of experts” thing).
It would be highly entertaining if ML researchers got into an arms race on parameter count, then Goodharted on it. Sounds like exactly the sort of thing I’d expect not-very-smart funding agencies to throw lots of money at. Perhaps the Goodharting would be done by the funding agencies themselves, by just funding whichever projects say they will use the most parameters, until they end up with lots of tiny nails. (Though one does worry that the agencies will find out that we can already do infinite-parameter-count models!)
That said, I haven’t looked into it enough myself to be confident that that’s what’s happening here. I’m just raising the hypothesis from entropy.
I think this take is basically correct. Restating my version of it:
Mixture of Experts and similar approaches modulate paths through the network, such that not every parameter is used every time. This means that parameters and FLOPs (floating point operations) are more decoupled than they are in dense networks.
To me, FLOPs remains the harder-to-fake metric, but both are valuable to track moving forward.
In a funny way, even if someone is stuck in a Goodhart trap doing Language Models it is probably better to Goodhart performance on Winograd Schemas than just adding parameters.