Here’s an example of the largest dense model trained over the public Internet on commodity hardware I am aware of: ALBERT & SwAV.
It comes at a cost of like 3x a well-connected cluster. (This is with a model that fits in each node so no additional overhead from model parallelism.) I’m not sure if I’d expect >GPT-3-scale models to do a lot worse or not. In any case, the question here is, is the glass 1/3rd full or 2/3rds empty?
Using crowdsourcing might be much more feasible in the DRL setting. Leela was able to crowdsource very effectively because individual nodes can spend a lot of local time deeply evaluating the game tree and only returning small board-state values, so it’s embarrassingly parallel.
Here’s an example of the largest dense model trained over the public Internet on commodity hardware I am aware of: ALBERT & SwAV.
It comes at a cost of like 3x a well-connected cluster. (This is with a model that fits in each node so no additional overhead from model parallelism.) I’m not sure if I’d expect >GPT-3-scale models to do a lot worse or not. In any case, the question here is, is the glass 1/3rd full or 2/3rds empty?
Using crowdsourcing might be much more feasible in the DRL setting. Leela was able to crowdsource very effectively because individual nodes can spend a lot of local time deeply evaluating the game tree and only returning small board-state values, so it’s embarrassingly parallel.