One suggestion I would make: don’t run your own cluster at all (is that really a core compentency or skill of yours compared to the hyperscalers?), and simply give alignment researchers GCP funding. TRC (TPU Research Cloud) has a ton of TPUs which they give out like candy for research purposes, but which go highly underused in considerable part because they are unable to cover GCP costs, only TPU costs. (It is bizarre and perverse and there is no sign of the situation changing no matter who I complain to at Google.) There turn out to be a lot of people out there who can’t just pony up <$500/month for the GCP costs, even to unlock >$50k/month of TPU time. So, they don’t.
You could probably do this inside GCP itself by setting up an org or group where you pay the bills and people apply to have their accounts added as quasi-root users who can spin up whatever buckets or VMs they need to drive their TRC quotas. (This is sort of how Tensorfork operated.)
Thanks for the rec! I knew TRC was awesome but wasn’t aware you could get that much compute.
Still, beyond short-term needs it seems like this is a risky strategy. TRC is basically a charity project that AFAIK could be shut down at any time.
Overall this updates me towards “we should very likely do the GCP funding thing. If this works out fine, setting up a shared cluster is much less urgent. A shared cluster still seems like the safer option in the mid to long term, if there is enough demand for it to be worthwhile”
Yes, it’s possible TRC could shut down or scale back its grants. But then you are no worse off than you are now. And if you begin building up a shared cluster as a backup or alternative, you are losing the time-value of the money/research and it will be increasingly obsolete in terms of power or efficiency, and you aren’t really at much ‘risk’: a shutdown means that a researcher switches gears for a bit or has to pay normal prices like everyone else etc, but there’s no really catastrophic outcome like going-bankrupt. OK, you lose the time and effort you invested in learning GCP and setting up such an ‘org’ in it, but that’s small potatoes—probably buying a single A100 costs more! For DL researchers, the rent vs buy dichotomy is always heavily skewed towards ‘rent’. (Going the GCP route has other advantages in terms of getting running faster and building up a name and practical experience and a community who would even be interested in using your hypothetical shared cluster.)
One suggestion I would make: don’t run your own cluster at all (is that really a core compentency or skill of yours compared to the hyperscalers?), and simply give alignment researchers GCP funding. TRC (TPU Research Cloud) has a ton of TPUs which they give out like candy for research purposes, but which go highly underused in considerable part because they are unable to cover GCP costs, only TPU costs. (It is bizarre and perverse and there is no sign of the situation changing no matter who I complain to at Google.) There turn out to be a lot of people out there who can’t just pony up <$500/month for the GCP costs, even to unlock >$50k/month of TPU time. So, they don’t.
You could probably do this inside GCP itself by setting up an org or group where you pay the bills and people apply to have their accounts added as quasi-root users who can spin up whatever buckets or VMs they need to drive their TRC quotas. (This is sort of how Tensorfork operated.)
Thanks for the rec! I knew TRC was awesome but wasn’t aware you could get that much compute.
Still, beyond short-term needs it seems like this is a risky strategy. TRC is basically a charity project that AFAIK could be shut down at any time.
Overall this updates me towards “we should very likely do the GCP funding thing. If this works out fine, setting up a shared cluster is much less urgent. A shared cluster still seems like the safer option in the mid to long term, if there is enough demand for it to be worthwhile”
Curious if you disagree with any of this
Yes, it’s possible TRC could shut down or scale back its grants. But then you are no worse off than you are now. And if you begin building up a shared cluster as a backup or alternative, you are losing the time-value of the money/research and it will be increasingly obsolete in terms of power or efficiency, and you aren’t really at much ‘risk’: a shutdown means that a researcher switches gears for a bit or has to pay normal prices like everyone else etc, but there’s no really catastrophic outcome like going-bankrupt. OK, you lose the time and effort you invested in learning GCP and setting up such an ‘org’ in it, but that’s small potatoes—probably buying a single A100 costs more! For DL researchers, the rent vs buy dichotomy is always heavily skewed towards ‘rent’. (Going the GCP route has other advantages in terms of getting running faster and building up a name and practical experience and a community who would even be interested in using your hypothetical shared cluster.)