Maxime Riché

Karma: 291

Maxime Riché Apr 26, 2024, 8:35 PM
1 point
0
in reply to: Vladimir_Nesov’s comment on: Scaling of AI training runs will slow down after GPT-5
Thank for the great comment!

Do we know if distributed training is expected to scale well to GPT-6 size models (100 trillions parameters) trained over like 20 data centers? How does the communication cost scale with the size of the model and the number of data centers? Linearly on both?
After reading for 3 min this:
Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips (Google November 2023). It seems that scaling is working efficiently at least up to 50k GPUs (GPT-6 would be like 2.5M GPUs). There are also some surprising linear increases in start time with the number of GPUs, 13min for 32k GPUs. What is the SOTA?

Maxime Riché Apr 26, 2024, 7:36 PM
7 points
0
in reply to: Chris_Leong’s comment on: Scaling of AI training runs will slow down after GPT-5
The title is clearly an overstatement. It expresses more that I updated in that direction, than that I am confident in it.
Also, since learning from other comments that decentralized learning is likely solved, I am now even less confident in the claim, like only 15% chance that it will happen in the strong form stated in the post.

Maybe I should edit the post to make it even more clear that the claim is retracted.

Maxime Riché Apr 26, 2024, 12:39 PM
1 point
0
in reply to: Maxime Riché’s comment on: The longest training run
This is actually corrected on the Epoch website but not here (https://epochai.org/blog/the-longest-training-run)

Maxime Riché Apr 26, 2024, 12:38 PM
1 point
0
on: The longest training run
We could also combine this with the rate of growth of investments. In that case we would end up with a total rate of growth of effective compute equal to $g_{H} + g_{I} + g_{S} \approx 0.28 + 3.84 + 0.54 = 4.66$ . This results in an optimal training run length of $L = 1 / (g_{H} + g_{I} + g_{S}) \approx 0.21$ years, ie $2.52$ months.
Why is g_I here 3.84, while above it is 1.03?

Maxime Riché Mar 23, 2024, 11:49 PM
3 points
0
on: Dangers of Closed-Loop AI
Are memoryless LLMs with a limited context window, significantly open loop? (Can’t use summarization between calls nor get access to previous prompts)

Maxime Riché Jan 23, 2024, 11:57 AM
LW: 9 AF: 3
0
AF
on: We need a science of evals
FYI, the “Evaluating Alignment Evaluations” project of the current AI Safety Camp is working on studying and characterizing alignment(propensity) evaluations. We hope to contribute to the science of evals, and we will contact you next month. (Somewhat deprecated project proposal)

Maxime Riché Dec 4, 2023, 3:28 AM
1 point
0
in reply to: Tom Davidson’s comment on: An illustrative model of backfire risks from pausing AI research
Interesting! I will see if I can correct that easily.

Maxime Riché Nov 10, 2023, 2:57 PM
7 points
9
on: AI Timelines
Thanks a lot for the summary at the start!

Maxime Riché Oct 9, 2023, 9:31 AM
3 points
0
in reply to: Adele Lopez’s comment on: AI Alignment Breakthroughs this week (10/08/23)
I wonder if the result is dependent on the type of OOD.

If you are OOD by having less extractable information, then the results are intuitive.
If you are OOD by having extreme extractable information or misleading information, then the results are unexpected.

Oh, I just read their Appendix A: “Instances Where “Reversion to the OCS” Does Not Hold”
Outputting the average prediction is indeed not the only behavior OOD. It seems that there are different types of OOD regimes.

Maxime Riché Oct 2, 2023, 4:35 PM
1 point
0
in reply to: Kaj_Sotala’s comment on: Expectations for Gemini: hopefully not a big deal
This comes from OpenAI saying they didn’t expect ChatGPT to be a big commercial success. It was not a top-priority project.

Maxime Riché Aug 31, 2023, 8:33 PM
2 points
0
in reply to: aog’s comment on: Report on Frontier Model Training
In fact, the costs to inference ChatGPT exceed the training costs on a weekly basis
That seems quite wild, if the training cost was 50M$, then the inference cost for a year would be 2.5B$.
The inference cost dominating the cost seems to depend on how you split the cost of building the supercomputer (buying the GPUs).
If you include the cost of building the supercomputer into the training cost, then the inference cost (without the cost of building the computer) looks cheap. If you split the building cost between training and inference in proportion to the “use time”, then the inference cost would dominate.

Maxime Riché Aug 31, 2023, 8:54 AM
2 points
0
on: Report on Frontier Model Training
Are these 2 bullet points faithful to your conclusion?
- GPT-4 training run (renting the compute for the final run): 100M$, of which ¹⁄₃ to ²⁄₃ is the cost of the staff
- GPT-4 training run + building the supercomputer: 600M$, of which ~20% for cost of the staff
And some hot takes (mine):
- Because supercomputers become “obsolete” quickly (~3 years), you need to run inferences to pay for building your supercomputer (you need profitable commercial applications), or your training cost must also account for the full cost of the supercomputer, and this produces a ~x6 increase in training cost.
- In forecasting models, we may be underestimating the investment to be able to train a frontier model by ~x6 (closer to 600M$ in 2022 than 100M$).
- The bottleneck to train new frontier models is now going to be building more powerful supercomputers.
- More investments won’t help that much in solving this bottleneck.
- This bottleneck will cause most capability gains to come from improving software efficiency.
- Open-source models will stay close in terms of capability to frontier models.
- This will reduce the profitability of simple and general commercial applications.

Maxime Riché Aug 28, 2023, 1:39 PM
1 point
0
on: What a compute-centric framework says about AI takeoff speeds—draft report
1) In the web interface, the parameter “Hardware adoption delay” is:
Meaning: Years between a chip design and its commercial release.

Best guess value: 1

Justification for best guess value: Discussed here. The conservative value of 2.5 years corresponds to an estimate of the time needed to make a new fab. The aggressive value (no delay) corresponds to fabless improvements in chip design that can be printed with existing production lines with ~no delay.
Is there another parameter for the delay (after the commercial release) to produce the hundreds of thousands of chips and build a supercomputer using them?
(With maybe an aggressive value for just “refurnishing” an existing supercomputer or finishing a supercomputer just waiting for the chips)
2) Do you think that in a scenario with quick large gains in hardware efficiency, the delay for building a new chip fab could be significantly larger than the current estimate because of the need to also build new factories for the machines that will be used in the new chip fab? (e.g. ASMI could also need to build factories, not just TSMC)
3) Do you think that these parameters/adjustments would significantly change the relative impact on the takeoff of the “hardware overhang” when compared to the “software overhang”? (e.g. maybe making hardware overhang even less important for the speed of the takeoff)

Maxime Riché Mar 29, 2023, 7:59 AM
1 point
0
on: Large language models aren’t trained enough
This is a big reason for why GPT4 is likely not that big but instead trained on much more data :)

Maxime Riché Mar 22, 2023, 10:09 PM
1 point
on: Database of existential risk estimates
Do you also have estimates of the fraction of resources in our light cone that we expect to be used to create optimised good stuff?

Maxime Riché Mar 6, 2023, 3:14 PM
1 point
0
on: The Waluigi Effect (mega-post)
Maybe the use of prompt suffixes can do a great deal to decrease the probability chatbots turning into Waluigi. See the “insert” functionality of OpenAI API https://openai.com/blog/gpt-3-edit-insert
Chatbots developers could use suffix prompts in addition to prefix prompts to make it less likely to fall into a Waluigi completion.

Maxime Riché Mar 6, 2023, 1:42 PM
2 points
1
in reply to: Carolus’s comment on: The Waluigi Effect (mega-post)
Indeed, empirical results show that filtering the data, helps quite well in aligning with some preferences: Pretraining Language Models with Human Preferences

Maxime Riché Jan 24, 2023, 5:59 PM
5 points
0
on: Gradient hacking is extremely difficult
What about the impact of dropout (parameters, layers), normalisation (batch, layer) (with a batch containing several episodes), asynchronous distributed data collection (making batch aggregation more stochastic), weight decay (impacting any weight), multi-agent RL training with independent agents, etc.
And other possible stuff that don’t exist at the moment: online pruning and growth while training, population training where the gradient hackers are exploited.

Shouldn’t that naively make gradient hacking very hard?

Maxime Riché Dec 31, 2022, 3:24 PM
3 points
0
in reply to: noggin-scratcher’s comment on: Human sexuality as an interesting case study of alignment
We see a lot of people die, in the reality, fictions and dreams.

We also see a lot of people having sex or sexual desire in fictions or dreams before experiencing it.

IDK how strong this is a counter argument to how powerful the alignment in us is. Maybe a biological reward system + imitation+ fiction and later dreams is simply what is at play in humans.

Maxime Riché Nov 29, 2022, 1:30 PM
2 points
1
on: The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
Should we expect these decompositions to be even more interpretable if the model was trained to output a prediction as soon as possible? (After any block, instead of outputting the prediction after the full network)