Thanks for the rec! I knew TRC was awesome but wasn’t aware you could get that much compute.
Still, beyond short-term needs it seems like this is a risky strategy. TRC is basically a charity project that AFAIK could be shut down at any time.
Overall this updates me towards “we should very likely do the GCP funding thing. If this works out fine, setting up a shared cluster is much less urgent. A shared cluster still seems like the safer option in the mid to long term, if there is enough demand for it to be worthwhile”
Yes, it’s possible TRC could shut down or scale back its grants. But then you are no worse off than you are now. And if you begin building up a shared cluster as a backup or alternative, you are losing the time-value of the money/research and it will be increasingly obsolete in terms of power or efficiency, and you aren’t really at much ‘risk’: a shutdown means that a researcher switches gears for a bit or has to pay normal prices like everyone else etc, but there’s no really catastrophic outcome like going-bankrupt. OK, you lose the time and effort you invested in learning GCP and setting up such an ‘org’ in it, but that’s small potatoes—probably buying a single A100 costs more! For DL researchers, the rent vs buy dichotomy is always heavily skewed towards ‘rent’. (Going the GCP route has other advantages in terms of getting running faster and building up a name and practical experience and a community who would even be interested in using your hypothetical shared cluster.)
Thanks for the rec! I knew TRC was awesome but wasn’t aware you could get that much compute.
Still, beyond short-term needs it seems like this is a risky strategy. TRC is basically a charity project that AFAIK could be shut down at any time.
Overall this updates me towards “we should very likely do the GCP funding thing. If this works out fine, setting up a shared cluster is much less urgent. A shared cluster still seems like the safer option in the mid to long term, if there is enough demand for it to be worthwhile”
Curious if you disagree with any of this
Yes, it’s possible TRC could shut down or scale back its grants. But then you are no worse off than you are now. And if you begin building up a shared cluster as a backup or alternative, you are losing the time-value of the money/research and it will be increasingly obsolete in terms of power or efficiency, and you aren’t really at much ‘risk’: a shutdown means that a researcher switches gears for a bit or has to pay normal prices like everyone else etc, but there’s no really catastrophic outcome like going-bankrupt. OK, you lose the time and effort you invested in learning GCP and setting up such an ‘org’ in it, but that’s small potatoes—probably buying a single A100 costs more! For DL researchers, the rent vs buy dichotomy is always heavily skewed towards ‘rent’. (Going the GCP route has other advantages in terms of getting running faster and building up a name and practical experience and a community who would even be interested in using your hypothetical shared cluster.)