How do labs working at or near the frontier assess major architecture and/or algorithm changes before committing huge compute resources to try them out? For example, how do they assess stability and sample efficiency without having to do full-scale runs?
I am not an AI researcher, nor do I have direct access to any AI research processes. So, instead of submitting an answer, I am writing this in the comment section.
I have one definite easily sharable observation. I drew from this a lot of inferences, which I will separate out so that the reader can condition their world-model on their own interpretations of whatever pieces of evidence—if any—are unshared.
This interview in this particular segment, with the most seemingly relevant part to me occuring around roughly the timestamp 40:15.
So, in this segment Dwarkesh is asking Sholto Douglas, a researcher at Google Deepmind a sub-question in a discussion about how researchers see the feasibility of “The Intelligence Explosion.”
The intent of this question seems to be to get an object-level description of the workflow of an AI researcher, in order to inform the meta-question of “how is AI going to increase the rate of AI research.”
Potentially important additional detail, the other person at that table is Trenton Bricken, a “member of Technical Staff on the Mechanistic Interpretability team at Anthropic” (description according to his website.)
Sholto makes some kind of allusion to the fact that the bulk of his work at the time of this interview does not appear directly relevant to the question, so he seems to be answering for some more generic case of AI researcher.
Sholto’s description of his work excerpted from the “About” section of his blog hosted on GitHub.
I’m currently going after end to end learning for robotic manipulation because of the impact it could have on the real world, and the surface area of the problem in contact with understanding how to make agents learn and reason like we do.
I’m currently exploring whether self-supervised learning on play data and clever usage of language to align robot and human video in the same trajectory space can build models which provide a sufficient base that they can be swiftly fine-tuned to any manipulation task.
In the past, I’ve looked into hierarchial RL, energy models for planning, and seeing if we can learn a representation of visual inputs where optimal paths are by definition the shortest path through the transformed space.
In this segment of the podcast, Sholto talks about “scaling laws inference”—seemingly alluding to the fact that researchers will have some compute budget to run experiments, and there will be agreed upon desideratum in the metrics of these experiments which could be used in the process of selecting features for programs which will then be given much larger training runs.
How do the researchers get this compute budget? Do all researchers have some compute resources available beyond just their personal workstation hardware? What does the process look like for spinning up a small-scale training run and reporting its results?
I am unsure, but from context will draw some guesses.
Sholto mentions, in providing further context in this segment:
A lot of good research comes from working backwards from the actual problems you want to solve.
He continues to give a few sentences that seem to gesture at a part of this internal process:
There’s a couple of grand problems in making the models better that you identify as issues and then work on “how can I change things to achieve this?” When you scale you also run into a bunch of things and you want to fix behaviors and issues at scale.
This seems to imply that a part of this process is receiving some ‘mission’ or set of ‘missions’ (my words not theirs, you could say quests or tasks or assignments) - and then some group(s) of researchers propose and test small scale tests for solutions to those.
Does this involve taking snapshots of these models at the scale where “behaviors or issues” appear and branching them to run shorter, lower compute, continuations of training/reinforcement learning?
Presumably this list of “grand problems” may include some items like:
hallucinations
failures in reasoning on specific tasks
learning patterns which do not generalize well in new domains (learning by ‘wrote’ instead of learning simpler underlying patterns which can be invoked usefully in a different distribution)
Possibly the “behaviors and issues” which occur “when you scale” include:
unforseen differences between observed metrics and projected metrics
persistent failures to achieve lower loss on certain sections of the training data
tokens or sequences of tokens which cause degenerate behavior (not the human type) across some number of different contexts
Sholto continues:
Concretely, the barrier is a little bit of software engineering, having a code base that’s large and capable enough that it can support many people doing research at the same time often makes it complex. If you’re doing everything by yourself, your iteration pace is going to be much faster.
Actually operating with other people raises the complexity a lot, for natural reasons familiar to every software engineer and also the inherent running. Running and launching those experiments is easy but there’s inherent slowdowns induced by that. So you often want to be parallelizing multiple different streams. You can’t be totally focused on one thing necessarily. You might not have fast enough feedback cycles. And then intuiting what went wrong is actually really hard.
This seems to imply that these AI labs have put their finger on the problem of doing work in large teams/titled sub-projects introduces a lot of friction. This could be Sholto’s take on the ideal way to run an AI lab which could be informed by AI labs not actually working this way—but I presume Google Deepmind, at least, has a culture where they attempt to prevent individual researchers grumbling a lot about organizational stuff slowing down their projects. It seems, to me, that Sholto is right about it being much faster to do more in “parallel”—where individual researchers can work on these sub problems without having to organize a meeting, submit paperwork, and write memos to 3 other teams to get access to relevant pieces of their work.
The trio continues to talk about the meta-level question and sections relevant to “what does AI research look like” return to being as diffuse as you may expect in a conversation which includes 2/3rds AI researchers and focuses on topics associated with AI research.
One other particular quote that may be relevant to people drawing some inferences—Dwarkesh asks:
That’s interesting to think about because at least the compute part is not bottlenecked on more intelligence, it’s just bottlenecked on Sam’s $7 trillion or whatever, right? If I gave you 10x the [TPUs] to run your experiments, how much more effective a researcher are you?
Sholto:
I think the Gemini program would probably be maybe five times faster with 10 times more compute or something like that.
Dwarkesh:
So that’s pretty good. Elasticity of 0.5. Wait, that’s insane.
Sholto:
I think more compute would just directly convert into progress.
Dwarkesh goes on to ask why labs aren’t reallocating some of the compute they have from running large runs/serving clients to doing experiments if this is such a massive bottleneck.
Sholto replies:
So one of the strategic decisions that every pre-training team has to make is exactly what amount of compute do you allocate to different training runs, to your research program versus scaling the last best thing that you landed on. They’re all trying to arrive at an optimal point here. One of the reasons why you need to still keep training big models is that you get information there that you don’t get otherwise. So scale has all these emergent properties which you want to understand better.
Remember what I said before about not being sure what’s going to fall off the curve. If you keep doing research in this regime and keep on getting more and more compute efficient, you may have actually gone off the path to actually eventually scale. So you need to constantly be investing in doing big runs too, at the frontier of what you sort of expect to work.
What does this actual breakdown look like within Deepmind? Well, obviously Sholto doesn’t give us details about that. If you get actual first-hand details about the allocation of compute budgets from this question, I’d be rather surprised...
Well, actually, not terribly surprised. These are modern AI labs, not Eliezer’s fantasy-football AI lab from Six Dimensions Of Operational Adequacy. They may just DM you with a more detailed breakdown of what stuff looks like on the inside. I doubt someone will answer publicly in a way that could be tied back to them. That would probably breach a bunch of clauses on a bunch of contracts and get them in actual serious trouble.
What do I infer from this?
Well, first, you can watch the interview and pick up the rhythm. When I’ve done that, I get the impression that there are some relatively independent researchers who work under the umbrella of departments which have some amount of compute budgeted to them. It seems to me likely that this compute is not budgeted as strictly as something like timeslots on orbital telescopes—such that an individual researchers can have a brilliant idea one day and just go try it using some very-small fraction of their organization’s compute for a short period of time. I think there is probably a range of experiment sizes above a certain threshold where you’re going to have to have a strong case and make that case to those involved in compute-budgeting in order to get the compute-time to do experiments of that scale.
Does that level of friction with compute available to individual researchers account for the “0.5 elasticity” that Sholto was talking about? I’m not sure. Plausibly there is no “do whatever you want with this” compute-budget for individual researchers beyond what they have plugged into their individual work-stations. This would surprise me, I think? That seems like a dumb decision when you take the picture Sholto was sketching about how progress gets made at face-value. Still, it seems to me like a characteristic dumb decision of large organizations—where they try really hard to have any resource expenditures accounted for ahead of time, such that intangibles like “ability to just go try stuff” get squashed by considerations like “are we utilizing all of our resources with maximum efficiency?”
Hopefully this interview and my analysis is helpful to answering this question. I can probably discuss more, but I’ve noticed this comment is already rather long, and my brain is telling me that further writing will likely just be meandering and hand-waving.
If there is more content relevant to this discussion able to be mined from this interview, perhaps others will be able to iterate on my attempt and help flesh out all of the parts which seem easy to update our models on.
Dwarkesh’s interview with Sholto sounds well worth watching in full, but the segments you’ve highlighted and your analyses are very helpful on their own. Thanks for the time and thought you put into this comment!
Agreed that the interview is worth watching in full for those interested in the topic. I don’t think it answers your question in full detail, unless I’ve forgotten something they said—but it is evidence.
(Edit: Dwarkesh also posts full transcripts of his interviews to his website. They aren’t obviously machine-transcribed or anything, more like what you’d expect from a transcribed interview in a news publication. You’ll lose some body language/tone details from the video interview, but may be worth it for some people, since most can probably read the whole thing in less time than just watching the interview at normal speed.)
Most of this seems to be subsumed in the general question of how do you do research, and there’s lot of advice, but it’s (ironically) not at all a science. From my limited understanding of what goes on in the research groups inside these companies, it’s a combination of research intuition, small scale testing, checking with others and discussing the new approach, validating your ideas, and getting buy-in from people higher up that it’s worth your and their time to try the new idea. Which is the same as research generally.
At that point, I’ll speculate and assume whatever idea they have is validated in smaller but still relatively large settings. For things like sample efficiency, they might, say, train a GPT-3 size model, which now cost only a fraction of the researcher’s salary to do. (Yes, I’m sure they all have very large compute budgets for their research.) If the results are still impressive, I’m sure there is lots more discussion and testing before actually using the method in training the next round of frontier models that cost huge amounts of money—and those decisions are ultimately made by the teams building those models, and management.
[Question] How do top AI labs vet architecture/algorithm changes?
How do labs working at or near the frontier assess major architecture and/or algorithm changes before committing huge compute resources to try them out? For example, how do they assess stability and sample efficiency without having to do full-scale runs?
I am not an AI researcher, nor do I have direct access to any AI research processes. So, instead of submitting an answer, I am writing this in the comment section.
I have one definite easily sharable observation. I drew from this a lot of inferences, which I will separate out so that the reader can condition their world-model on their own interpretations of whatever pieces of evidence—if any—are unshared.
This interview in this particular segment, with the most seemingly relevant part to me occuring around roughly the timestamp 40:15.
So, in this segment Dwarkesh is asking Sholto Douglas, a researcher at Google Deepmind a sub-question in a discussion about how researchers see the feasibility of “The Intelligence Explosion.”
The intent of this question seems to be to get an object-level description of the workflow of an AI researcher, in order to inform the meta-question of “how is AI going to increase the rate of AI research.”
Potentially important additional detail, the other person at that table is Trenton Bricken, a “member of Technical Staff on the Mechanistic Interpretability team at Anthropic” (description according to his website.)
Sholto makes some kind of allusion to the fact that the bulk of his work at the time of this interview does not appear directly relevant to the question, so he seems to be answering for some more generic case of AI researcher.
Sholto’s description of his work excerpted from the “About” section of his blog hosted on GitHub.
In this segment of the podcast, Sholto talks about “scaling laws inference”—seemingly alluding to the fact that researchers will have some compute budget to run experiments, and there will be agreed upon desideratum in the metrics of these experiments which could be used in the process of selecting features for programs which will then be given much larger training runs.
How do the researchers get this compute budget? Do all researchers have some compute resources available beyond just their personal workstation hardware? What does the process look like for spinning up a small-scale training run and reporting its results?
I am unsure, but from context will draw some guesses.
Sholto mentions, in providing further context in this segment:
He continues to give a few sentences that seem to gesture at a part of this internal process:
This seems to imply that a part of this process is receiving some ‘mission’ or set of ‘missions’ (my words not theirs, you could say quests or tasks or assignments) - and then some group(s) of researchers propose and test small scale tests for solutions to those.
Does this involve taking snapshots of these models at the scale where “behaviors or issues” appear and branching them to run shorter, lower compute, continuations of training/reinforcement learning?
Presumably this list of “grand problems” may include some items like:
hallucinations
failures in reasoning on specific tasks
learning patterns which do not generalize well in new domains (learning by ‘wrote’ instead of learning simpler underlying patterns which can be invoked usefully in a different distribution)
Possibly the “behaviors and issues” which occur “when you scale” include:
unforseen differences between observed metrics and projected metrics
persistent failures to achieve lower loss on certain sections of the training data
tokens or sequences of tokens which cause degenerate behavior (not the human type) across some number of different contexts
Sholto continues:
This seems to imply that these AI labs have put their finger on the problem of doing work in large teams/titled sub-projects introduces a lot of friction. This could be Sholto’s take on the ideal way to run an AI lab which could be informed by AI labs not actually working this way—but I presume Google Deepmind, at least, has a culture where they attempt to prevent individual researchers grumbling a lot about organizational stuff slowing down their projects. It seems, to me, that Sholto is right about it being much faster to do more in “parallel”—where individual researchers can work on these sub problems without having to organize a meeting, submit paperwork, and write memos to 3 other teams to get access to relevant pieces of their work.
The trio continues to talk about the meta-level question and sections relevant to “what does AI research look like” return to being as diffuse as you may expect in a conversation which includes 2/3rds AI researchers and focuses on topics associated with AI research.
One other particular quote that may be relevant to people drawing some inferences—Dwarkesh asks:
Sholto:
Dwarkesh:
Sholto:
Dwarkesh goes on to ask why labs aren’t reallocating some of the compute they have from running large runs/serving clients to doing experiments if this is such a massive bottleneck.
Sholto replies:
What does this actual breakdown look like within Deepmind? Well, obviously Sholto doesn’t give us details about that. If you get actual first-hand details about the allocation of compute budgets from this question, I’d be rather surprised...
Well, actually, not terribly surprised. These are modern AI labs, not Eliezer’s fantasy-football AI lab from Six Dimensions Of Operational Adequacy. They may just DM you with a more detailed breakdown of what stuff looks like on the inside. I doubt someone will answer publicly in a way that could be tied back to them. That would probably breach a bunch of clauses on a bunch of contracts and get them in actual serious trouble.
What do I infer from this?
Well, first, you can watch the interview and pick up the rhythm. When I’ve done that, I get the impression that there are some relatively independent researchers who work under the umbrella of departments which have some amount of compute budgeted to them. It seems to me likely that this compute is not budgeted as strictly as something like timeslots on orbital telescopes—such that an individual researchers can have a brilliant idea one day and just go try it using some very-small fraction of their organization’s compute for a short period of time. I think there is probably a range of experiment sizes above a certain threshold where you’re going to have to have a strong case and make that case to those involved in compute-budgeting in order to get the compute-time to do experiments of that scale.
Does that level of friction with compute available to individual researchers account for the “0.5 elasticity” that Sholto was talking about? I’m not sure. Plausibly there is no “do whatever you want with this” compute-budget for individual researchers beyond what they have plugged into their individual work-stations. This would surprise me, I think? That seems like a dumb decision when you take the picture Sholto was sketching about how progress gets made at face-value. Still, it seems to me like a characteristic dumb decision of large organizations—where they try really hard to have any resource expenditures accounted for ahead of time, such that intangibles like “ability to just go try stuff” get squashed by considerations like “are we utilizing all of our resources with maximum efficiency?”
Hopefully this interview and my analysis is helpful to answering this question. I can probably discuss more, but I’ve noticed this comment is already rather long, and my brain is telling me that further writing will likely just be meandering and hand-waving.
If there is more content relevant to this discussion able to be mined from this interview, perhaps others will be able to iterate on my attempt and help flesh out all of the parts which seem easy to update our models on.
Dwarkesh’s interview with Sholto sounds well worth watching in full, but the segments you’ve highlighted and your analyses are very helpful on their own. Thanks for the time and thought you put into this comment!
Thanks! It’s no problem :)
Agreed that the interview is worth watching in full for those interested in the topic. I don’t think it answers your question in full detail, unless I’ve forgotten something they said—but it is evidence.
(Edit: Dwarkesh also posts full transcripts of his interviews to his website. They aren’t obviously machine-transcribed or anything, more like what you’d expect from a transcribed interview in a news publication. You’ll lose some body language/tone details from the video interview, but may be worth it for some people, since most can probably read the whole thing in less time than just watching the interview at normal speed.)
Most of this seems to be subsumed in the general question of how do you do research, and there’s lot of advice, but it’s (ironically) not at all a science. From my limited understanding of what goes on in the research groups inside these companies, it’s a combination of research intuition, small scale testing, checking with others and discussing the new approach, validating your ideas, and getting buy-in from people higher up that it’s worth your and their time to try the new idea. Which is the same as research generally.
At that point, I’ll speculate and assume whatever idea they have is validated in smaller but still relatively large settings. For things like sample efficiency, they might, say, train a GPT-3 size model, which now cost only a fraction of the researcher’s salary to do. (Yes, I’m sure they all have very large compute budgets for their research.) If the results are still impressive, I’m sure there is lots more discussion and testing before actually using the method in training the next round of frontier models that cost huge amounts of money—and those decisions are ultimately made by the teams building those models, and management.
Thanks for this answer! Interesting. It sounds like the process may be less systematized than how I imagined it to be.