Your quoted cost for training the model is for training such a model **once**. This is not how the researchers do it, they train the models many times with different hyperparameters. I have no idea, however how hyperparameter tuning is done at such scales, but I guarantee that the compute cost is higher than just the cost for training it once.
And OA trained GPT-3-175b **once**, it looks like: note the part where they say they didn’t want to train a second run to deal with the data contamination issue because of cost. (You can do this without it being a shot in the dark because of the scaling laws.)
Your quoted cost for training the model is for training such a model **once**. This is not how the researchers do it, they train the models many times with different hyperparameters. I have no idea, however how hyperparameter tuning is done at such scales, but I guarantee that the compute cost is higher than just the cost for training it once.
And OA trained GPT-3-175b **once**, it looks like: note the part where they say they didn’t want to train a second run to deal with the data contamination issue because of cost. (You can do this without it being a shot in the dark because of the scaling laws.)