Alignment researchers, how useful is extra compute for you?
TLDR
If you work in AI alignment / safety research, please fill out this form on how useful access to extra compute would be for your research. This should take under 10 minutes, and you don’t need to read the rest of this post beforehand—in fact it would be great if you could fill out the form right now.
Introduction
I want to get an idea of how much demand there is for a university-independent organization that manages a compute cluster for academic AI alignment groups and independent researchers. Currently I don’t know anybody who is willing to run such an organization, but if demand is large one could either actively look for people to run such a project or find an existing organization that is willing to take it on.
Main idea
Non-industry AI safety research organizations have a hard time procuring compute. Groups spend many researcher-hours on managing servers on a relatively small scale. Common obstacles are 1) having to deal with university bureaucracy (e.g. regarding hiring, engineer wages, and procurement) and 2) missing out on economies of scale.
Proposal: A university-independent organization that provides access to compute for academic AI alignment research groups as well as independent researchers. Such an organization could pay high wages for its engineers (compared to academic labs) and benefit from economies of scale.
Potential benefits
Time saved: currently, researchers spend time applying for compute grants, setting up and maintaining servers, and setting up software environments after switching between systems. Easy access to large amounts of compute would avoid most of these time costs.
Expanded capability to do research: a centralized organization could afford to manage larger compute clusters than those usually used by individual labs. The difference is even larger for independent researchers, who might not have access to large-memory GPUs at all.
Potential problems
Gatekeeping: just like with other resources such as funding, deciding who gets access can be hard and risks becoming a political problem. OTOH, subject-specific grants are common / accepted within academia. Still, management of access would have to be done carefully.
Demand: some groups have access to large university clusters and may not need this service. I’m currently uncertain about how large the demand for this is.
Leadership: even if it were clear that this is a good idea, I don’t know of anyone who is willing to run this project. This seems like a solvable problem though, once we have a clear idea of what the demand is.
Form
If you haven’t already, please fill out this form about how much extra compute might accelerate your research (<10 mins).
One suggestion I would make: don’t run your own cluster at all (is that really a core compentency or skill of yours compared to the hyperscalers?), and simply give alignment researchers GCP funding. TRC (TPU Research Cloud) has a ton of TPUs which they give out like candy for research purposes, but which go highly underused in considerable part because they are unable to cover GCP costs, only TPU costs. (It is bizarre and perverse and there is no sign of the situation changing no matter who I complain to at Google.) There turn out to be a lot of people out there who can’t just pony up <$500/month for the GCP costs, even to unlock >$50k/month of TPU time. So, they don’t.
You could probably do this inside GCP itself by setting up an org or group where you pay the bills and people apply to have their accounts added as quasi-root users who can spin up whatever buckets or VMs they need to drive their TRC quotas. (This is sort of how Tensorfork operated.)
Thanks for the rec! I knew TRC was awesome but wasn’t aware you could get that much compute.
Still, beyond short-term needs it seems like this is a risky strategy. TRC is basically a charity project that AFAIK could be shut down at any time.
Overall this updates me towards “we should very likely do the GCP funding thing. If this works out fine, setting up a shared cluster is much less urgent. A shared cluster still seems like the safer option in the mid to long term, if there is enough demand for it to be worthwhile”
Curious if you disagree with any of this
Yes, it’s possible TRC could shut down or scale back its grants. But then you are no worse off than you are now. And if you begin building up a shared cluster as a backup or alternative, you are losing the time-value of the money/research and it will be increasingly obsolete in terms of power or efficiency, and you aren’t really at much ‘risk’: a shutdown means that a researcher switches gears for a bit or has to pay normal prices like everyone else etc, but there’s no really catastrophic outcome like going-bankrupt. OK, you lose the time and effort you invested in learning GCP and setting up such an ‘org’ in it, but that’s small potatoes—probably buying a single A100 costs more! For DL researchers, the rent vs buy dichotomy is always heavily skewed towards ‘rent’. (Going the GCP route has other advantages in terms of getting running faster and building up a name and practical experience and a community who would even be interested in using your hypothetical shared cluster.)
Looking forward to seeing the survey results!
By the way, if you’re an alignment researcher and compute is your bottleneck, please send me a DM. EleutherAI already has a lot of compute resources (as well as a great community for discussing alignment and ML!), and we’re very interested in providing compute for alignment researchers with minimal bureaucracy required.