I haven’t run many ensembling experiments so I don’t have a very confident prediction. However, my guess is that it will do a lot worse if done the naive way (training each RM on a different split/shuffling of data, with max likelihood), simply because ensembling doesn’t seem to work very well. I suspect it may be possible to do something fancy but I haven’t looked too deeply into it.
Would you expect ensembling 10 10M-parameter RMs to do better than a single 100M-parameter RM?
I haven’t run many ensembling experiments so I don’t have a very confident prediction. However, my guess is that it will do a lot worse if done the naive way (training each RM on a different split/shuffling of data, with max likelihood), simply because ensembling doesn’t seem to work very well. I suspect it may be possible to do something fancy but I haven’t looked too deeply into it.