Maxime Riché

Karma: 291

Maxime Riché Mar 20, 2025, 2:41 PM
3 points
0
in reply to: mishka’s comment on: Longtermist Implications of the Existence Neutrality Hypothesis
The implications are stronger in that case right.
The post is about implications for impartial longtermists. So either under moral realism it means something like finding the best values to pursue. And under moral anti realism it means that an impartial utility function is kind of symmetrical with aliens. For example if you value something only because humans value it, then an impartial version is to also value things that alien value only because their species value it.
Though because of reasons introduced in The Convergent Path to the Stars, I think that these implications are also relevant for non-impartial longtermists.

Maxime Riché Mar 7, 2025, 1:37 PM
1 point
0
on: Maxime Riché′s Shortform
Truth-seeking AIs by default? One hope for alignment by default is that AI developers may have to train their models to be truth-seeking to be able to make them contribute to scientific and technological progress, including RSI. Truth-seeking about the world model may generalize to truth-seeking for moral values, as observed in humans, and that’s an important meta-value guiding moral values towards alignment.
In humans, truth-seeking is maybe pushed back from being a revealed preference at work to being a stated preference outside of work, because of status competitions and fighting for resources. For early artificial researchers, they may not have the same selection pressures. Their moral values may focus on working alone (truth-seeking trend), not on replicating via competing for resources. Artifical researchers won’t be selected because they are able to acquire resources, they will be selected by AI developers because they are the best at achieving technical progress, which includes being truth-seeking.

Maxime Riché Jan 29, 2025, 8:36 AM
3 points
0
in reply to: Vladimir_Nesov’s comment on: Maxime Riché′s Shortform
Thanks for your corrections, that’s welcome

> 32B active parameters instead of likely ~220B for GPT4 ⇒ 6.8x lower training … cost
Doesn’t follow, training cost scales with the number of training tokens. In this case DeepSeek-V3 uses maybe 1.5x-2x more tokens than original GPT-4.
Each of the points above is a relative comparison with more or less everything else kept constant. In this bullet point, by “training cost”, I mostly had in mind “training cost per token”:
- 32B active parameters instead of likely ~ ~~220~~ 280B for GPT4 ⇒ ~~6.8~~ 8.7x lower training cost per token.
If this wasn’t an issue, why not 8B active parameters, or 1M active parameters?
From what I remember, the training-compute optimal number of experts was like 64, given implementations a few years old (I don’t remember how many activated at the same time in this old paper). Given newer implementations and aiming for inference-compute optimality, it seems logical that more than 64 experts could be great.
You still train on every token.
Right, that’s why I wrote: “possibly 4x fewer training steps for the same number of tokens if predicting tokens only once” (assuming predicting 4 tokens at a time), but that’s not demonstrated nor published (given my limited knowledge on this).

Maxime Riché Jan 28, 2025, 8:48 PM
3 points
−4
on: Maxime Riché′s Shortform
Simple reasons for DeepSeek V3 and R1 efficiencies:
- 32B active parameters instead of likely ~220B for GPT4 ⇒ 6.8x lower training and inference cost
- 8bits training instead of 16bits ⇒ 4x lower training cost
- No margin on commercial inference ⇒ ?x maybe 3x
- Multi-token training ⇒ ~2x training efficiency, ~3x inference efficiency, and lower inference latency by baking in “predictive decoding’, possibly 4x fewer training steps for the same number of tokens if predicting tokens only once
- And additional cost savings from memory optimization, especially for long contexts ( Multi Head Latent Attention) ⇒ ?x
Nothing is very surprising (maybe the last bullet point for me because I know less about it).
The surprising part is why big AI labs were not pursuing these obvious strategies.
Int8 was obvious, the multi-token prediction was obvious, and more and smaller experts in MoE were obvious. All three have already been demonstrated and published in the literature. May be bottlenecked by communication, GPU usage, and memory for the largest models.

Maxime Riché Dec 24, 2024, 12:30 AM
1 point
0
in reply to: habryka’s comment on: leogao’s Shortform
It seems that your point applies significantly more to “zero-sum markets”. So it may be good to notice it may not apply for altruistic people when non-instrumentally working on AI safety.

Maxime Riché Dec 21, 2024, 5:53 PM
2 points
1
on: Alignment Faking in Large Language Models
Models trained for HHH are likely not trained to be corrigible. Models should be trained to be corrigible too in addition to other propensities.

Corrigibility may be included in Helpfulness (alone) but when adding Harmlessness then Corrigibility conditional on being changed to be harmful is removed. So the result is not that surprising from that point of view.

Maxime Riché Nov 13, 2024, 11:10 PM
3 points
−6
on: Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever’s Recent Claims
People may be blind to the fact that improvements from gpt2 to 3 to 4 were both driven by scaling training compute (by 2 OOM between each generation) and (the hidden part) by scaling test compute through long context and CoT (like 1.5-2 OOM between each generations too).

If gpt5 uses just 2 OOM more training compute than gpt4 but the same test compute, then we should not expect “similar” gains, we should expect “half”.

O1 may use 2 OOM more test compute than gpt4. So gpt4=>O1+gpt5 could be expected to be similar to gpt3=>gpt4

Maxime Riché Oct 21, 2024, 10:59 AM
1 point
0
on: Maxime Riché′s Shortform
Speculations on (near) Out-Of-Distribution (OOD) regimes
- [Absence of extractable information] The model can no longer extract any relevant information. Models may behave more and more similarly to their baseline behavior in this regime. Models may learn the heuristic to ignore uninformative data, and this heuristic may generalize pretty far. Publication supporting this regime: Deep Neural Networks Tend To Extrapolate Predictably
- [Extreme information] The model can still extract information, but the features extracted are becoming extreme in value (“extreme” = range never seen during training). Models may keep behaving in the same way as “at the In-Distrubution (ID) border”. Models may learn the heuristic that for extreme inputs, you should keep behaving as if you were still in the embedding same direction but still ID.
- [Inner OOD] The model observes a mix of features-values that it never saw during training, but none of these features-values are by themselves OOD. For example, the input is located between two populated planes. Models may learn the heuristic to use a (mixed) policy composed of closest ID behaviors.
- [Far/Disrupting OOD]: This happens in one of the other three regimes when the inputs break the OOD heuristics learned by the model. These can be found by adversarial search or by moving extremely OOD.
- [Fine-Tuning (FT) or Jailbreaking OOD] The inference distribution is OOD of the distribution during the FT. The model then stops using heuristics defined during the FT and starts using those learned during pretraining (the inference is still ID with respect to the pretraining distribution).

Maxime Riché Sep 29, 2024, 7:04 AM
3 points
0
in reply to: Vladimir_Nesov’s comment on: COT Scaling implies slower takeoff speeds

If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can’t solve it within 20 years of effort can still succeed in 40 years

Since the scaling is logarithmic, your example seems to be a strawman.

The real claim debated is more something like:

“If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can’t solve it within 100 months of effort can still succeed in 10 000 months” And this formulation doesn’t seem obviously true.

Maxime Riché Sep 12, 2024, 1:08 PM
4 points
0
on: Grokking, memorization, and generalization — a discussion
Ten months later, which papers would you recommend for SOTA explanations of how generalisation works?

From my quick research:
- “Explaining grokking through circuit efficiency” seems great at explaining and describing grokking
- “Unified View of Grokking, Double Descent and Emergent Abilities: A Comprehensive Study on Algorithm Task” proposes a plausible unified view of grokking and double descent (and a guess at a link with emergent capabilities and multi-task training). I especially like their summary plot:

Maxime Riché Sep 5, 2024, 9:56 AM
2 points
1
on: The Fragility of Life Hypothesis and the Evolution of Cooperation
For information to the readers and author: I am (independently) working on a project about narrowing down the moral values of alien civilizations on the verge of creating an ASI and becoming space-faring. The goal is to inform the prioritization of longtermist interventions.

I will gladly build on your content, which aggregates and beautifully expands several key mechanisms (individual selection (“Darwinian demon”), kin selection/multilevel selection (“Darwinian angel”), filters (“Fragility of Life Hypothesis)) that I use among others (e.g. sequential races, cultural evolution, accelerating growth stages, etc.).
Thanks for the post!

Maxime Riché Aug 13, 2024, 9:51 AM
8 points
2
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
If the following correlations are true, then the opposite may be true (slave morality being better for improving the world through history):
- Improving the world being strongly correlated with economic growth (this is probably less true when X-risk are significant)
- Economic growth being strongly correlated with Entrepreneurship incentives (property rights, autonomy, fairness, meritocracy, low rents)
- Master morality being strongly correlated with acquiring power and thus decreasing the power of others and decreasing their entrepreneurship incentives

Maxime Riché Jul 5, 2024, 3:48 PM
1 point
0
in reply to: ChristianKl’s comment on: Maxime Riché′s Shortform
Right 👍

So the effects are:

Effects that should increase Anthropic’s salaries relative to OpenAI: A) - The pool of AI safety focused candidates is smaller B) - AI safety focused candidates are more motivated

Effects that should decrease Anthropic’s salaries relative to OpenAI: C) - AI safety focused candidates should be willing to accept significantly lower wages

New notes: (B) and (C) could cancel each other but that would be a bit suspicious. Still a partial cancellation would make a difference between OpenAI and Anthropic lower and harder to properly observe. (B) May have a small effect, given that hires are already world level talents, it would be weird that they could significantly increase even more their performance by simply being more motivated. I.e. non AI safety focused candidates are also very motivated. The difference in motivation between both groups is possibly not large.

Maxime Riché Jul 5, 2024, 3:39 PM
1 point
0
in reply to: Violet Hour’s comment on: Maxime Riché′s Shortform
These forecasts are about the order under which functionalities see a jump in their generalization (how far OOD they work well).

By “Generalisable xxx” I meant the form of the functionality xxx that generalize far.

Maxime Riché Jul 5, 2024, 1:47 PM
2 points
0
on: Maxime Riché′s Shortform
Rambling about Forecasting the order in which functions are learned by NN

Idea:
Using function complexity and their “compoundness” (edit 11 september: these functions seem to be called “composite functions”), we may be able to forecast the order in which algorithms in NN are learned. And we may be able to forecast the temporal ordering of when some functions or behaviours will start generalising strongly.

Rambling:
What happens when training neural networks is similar to the selection of genes in genomes or any reinforcement optimization processes. Compound functions are much harder to learn. You need each part to be independently useful initially to provide enough signal for the compound system to be reinforced.

That means that learning any non-hardcoded algorithms with many variables and multiplicative steps is very difficult.
An important factor in this is the frequency at which an algorithm is useful and to which extent. An algorithm that can be very used in most situations will get much more training signals. The relative strength of the reward signal you get is important because of the noise in the training and because of catastrophic forgetting.

LLMs are not learning complex algorithms yet. They are learning something like a world model because this is used for most tasks and it can be built by first building each part separately and then assembling them.

Regarding building algorithms to exploit this world model, it can be learned later if the algorithm is composed first of very simple algorithms that can be later assembled. An extra difficulty for LLMs to learn algorithms is in situations where heuristics already work very well. In that case, you need to add significant regularisation pushing for simpler circuits. Then you may observe ~~grokking~~ learning and a transition from heuristics to algorithms.

An issue with this reasoning is that heuristics are 1-step algorithms (0 compoundness).

Effects:
- Frequency of reward
- Strength of the additional reward (above the “heuristic baseline”)
- Compoundness

Forecasting game:
(WIP, mostly a failure at that point)

Early to generalize well:
World models can be built from simple parts, and are most of the time valuable.
Generalizable algorithm for simple and frequent tasks on which heuristics fail dramatically: ??? (maybe) generating random numbers, ??

Medium to generalize well:
Generalizable deceptive alignment algorithms: They require several components to work. But they are useful for many tasks. The strength of the additional reward is not especially high or low.
Generalizable instrumental convergence algorithms: Same as deceptive alignment.
Generalizable short horizon algorithms: They, by definition, require fewer sequential steps, as such they should be less “compounded” functions and appear sooner.

Late:
Generalizable long horizon algorithms: They, by definition, require more sequential steps, as such they should be more “compounded” functions and appear later.

The latest:
Generalizable long horizon narrow capabilities: They are not frequently reinforced.

(Time spent on this: 45min)
July 6th update:
Here is a quick experiment trying to observe the effect of increasing “compoundness” on the ordering of ~~grokking~~ learning different functions: https://colab.research.google.com/drive/1B85mfCkqyQZSl1JGbLr0r5BrAS8LYUr5?usp=sharing

Quick results:
The task is predicting the sign of the product of 1 (function 1) to 8 (function 8) standard normal random variables.
Increasing the compoundness by 2 seems to delay the ~~grokking~~ learning by something like 1 OOM.

Maxime Riché Jul 4, 2024, 9:02 AM
2 points
0
on: Maxime Riché′s Shortform
Will we get to GPT-5 and GPT-6 soon?

This is a straightforward “follow the trend” model which tries to forecast when GPT-N-equivalent models will be first trained and deployed up to 2030.

Baseline forecast:
GPT-4.7 GPT-5.3 GPT-5.8 GPT-6.3
Start of training 2024.4 2025.5 2026.5 2028.5
Deployment 2025.2 2026.8 2028 2030.2

Bullish forecast:
GPT-5 GPT-5.5 GPT-6 GPT-6.5
Start of training 2024.4 2025 2026.5 2028.5
Deployment 2025.2 2026.5 2028 2030

FWIW, it predicts roughly similar growth in model size, energy cost and GPU count than described in https://situational-awareness.ai/ while being created the week before this was released.

I spent like 10 hours on this, so I expect to find lingering mistakes in the model.

Maxime Riché Jun 26, 2024, 8:10 AM
3 points
−5
on: Maxime Riché′s Shortform
Could Anthropic face an OpenAI drama 2.0?
I forecast that Anthropic would likely face a similar backlash from its employees than OpenAI in case Anthropic’s executives were to knowingly decrease the value of Anthropic shares significantly. E.g. if they were to switch from “scaling as fast as possible” to “safety-constrained scaling”. In that case, I would not find it surprising that a significant fraction of Anthropic’s staff threatened to leave or leave the company.
The reasoning is simple, given that we don’t observe significant differences in the wages of OpenAI and Anthropic employees and assuming that they are overall of the same distribution of skill and skill level. Then it seems that Anthropic is not able to use the argument of its AI safety focus as a bargaining argument to reduce the wages significantly. If true this would mean that safety is of relatively little importance to most of Anthropic’s employees.
Counter argument: Anthropic is hiring from a much more restricted pool of candidates. From only the safety-concerned candidates. In that case, Anthropic would have to pay a premium to hire these people. And it happens that this premium is roughly equivalent to the discount that these employees are willing to give to Anthropic because of its safety focus.

Maxime Riché Jun 10, 2024, 10:16 AM
2 points
1
on: Maxime Riché′s Shortform
What is the difference between Evaluation, Characterization, Experiments, and Observations?
The words evaluations, experiments, characterizations, and observations are somewhat confused or confusingly used in discussions about model evaluations (e.g., ref, ref).
Let’s define them more clearly:
- Observations provide information about an object (including systems).
  - This information can be informative (allowing the observer to update its beliefs significantly), or not.
- Characterizations describe distinctive features of an object (including properties).
  - Characterizations are observations that are actively designed and controlled to study an object.
- Evaluations evaluate the quality of distinctive features based on normative criteria.
  - Evaluations are composed of both characterizations and normative criteria.
  - Evaluations are normative, they inform about what is good or bad, desirable or undesirable.
  - Normative criteria (or “evaluation criterion”) are the element bringing the normativity. They are most of the time directional or simple thresholds.
  - Evaluations include both characterizations of the object studied and characterization of the characterization technique used (e.g., accuracy of measurement).
- Scientific experiments test hypotheses through controlled manipulation of variables.
  - Scientific experiments are composed of: characterizations, and hypothesis
In summary:
- Observations
- Characterizations = Designed and controlled Observations
- Evaluations = Characterization of object + Characterization of the characterization method + Normative criteria
- Scientific experiments = Characterizations + Hypothesis
Examples:
- An observation is an event in which the observer receives information about the AI system.
  - E.g., you read a completion returned by a model.
- A characterization is a tool or process used to describe an AI system.
  - E.g., you can characterize the latency of an AI system by measuring it. You can characterize how often a model is correct (without specifying that correctness is the goal).
- An AI system evaluation will associate characterizations and normative criteria to conclude about the quality of the AI system on the dimensions evaluated.
  - E.g., alignment evaluations use characterizations of models and the normative criteria of the alignment with X (e.g., humanity) to conclude on how well the model is aligned with X.
- An experiment will associate hypotheses, interventions, and finally characterizations to conclude on the veracity of the hypotheses about the AI system.
  - E.g., you can change the training algorithm and measure the impact using characterization techniques.
Clash of usage and definition:
These definitions slightly clash with the usage of the term evals or evaluations in the AI community. Regularly the normative criteria associated with an evaluation are not explicitly defined, and the focus is solely put on the characterizations included in the evaluation.

(Produced as part of the AI Safety Camp, within the project: Evaluating alignment evaluations)
What links here?

Maxime Riché May 15, 2024, 10:29 AM
2 points
2
in reply to: ektimo’s comment on: How to do conceptual research: Case study interview with Caspar Oesterheld
Likely: Path To Impact

Maxime Riché Apr 29, 2024, 8:49 AM
2 points
0
on: Refusal in LLMs is mediated by a single direction
Interestingly, after a certain layer, the first principle component becomes identical to the mean difference between harmful and harmless activations.
Do you think this can be interpreted as the model having its focus entirely on “refusing to answer” from layer 15 onwards? And if it can be interpreted as the model not evaluating other potential moves/choices coherently over these layers. The idea is that it could be evaluating other moves in a single layer (after layer 15) but not over several layers since the residual stream is not updated significantly.
Especially can we interpret that as the model not thinking coherently over several layers about other policies, it could choose (e.g., deceptive policies like defecting from the policy of “refusing to answer”)? I wonder if we would observe something different if the model was trained to defect from this policy conditional on some hard-to-predict trigger (e.g. whether the model is in training or deployment).

	`GPT-4.7`	`GPT-5.3`	`GPT-5.8`	`GPT-6.3`
Start of training	`2024.4`	`2025.5`	`2026.5`	`2028.5`
Deployment	`2025.2`	`2026.8`	`2028`	`2030.2`

	`GPT-5`	`GPT-5.5`	`GPT-6`	`GPT-6.5`
Start of training	`2024.4`	`2025`	`2026.5`	`2028.5`
Deployment	`2025.2`	`2026.5`	`2028`	`2030`

Maxime Riché

What is the difference between Evaluation, Characterization, Experiments, and Observations?