Non-temporal displacement model
Maybe rename this to “Simultaneous displacement model” or “Displacement model under simultaneous appearance”
Non-temporal displacement model
Maybe rename this to “Simultaneous displacement model” or “Displacement model under simultaneous appearance”
Density of Actual SFC Density
Add legends to the plots
.
Maybe only keep the plot for the density of potential SFC, at the start. Or merge both plots (using 2 Y axes then)
Little progress has been made on this question. Most discussions stop after the following arguments:
Move these quotes to an external document Appendix?
Existing discussions
Move this section to an external document with Appendix
Thanks for your corrections, that’s welcome
> 32B active parameters instead of likely ~220B for GPT4 ⇒ 6.8x lower training … cost
Doesn’t follow, training cost scales with the number of training tokens. In this case DeepSeek-V3 uses maybe 1.5x-2x more tokens than original GPT-4.
Each of the points above is a relative comparison with more or less everything else kept constant. In this bullet point, by “training cost”, I mostly had in mind “training cost per token”:
32B active parameters instead of likely ~ 220 280B for GPT4 ⇒ 6.8 8.7x lower training cost per token.
If this wasn’t an issue, why not 8B active parameters, or 1M active parameters?
From what I remember, the training-compute optimal number of experts was like 64, given implementations a few years old (I don’t remember how many activated at the same time in this old paper). Given newer implementations and aiming for inference-compute optimality, it seems logical that more than 64 experts could be great.
You still train on every token.
Right, that’s why I wrote: “possibly 4x fewer training steps for the same number of tokens if predicting tokens only once” (assuming predicting 4 tokens at a time), but that’s not demonstrated nor published (given my limited knowledge on this).
Simple reasons for DeepSeek V3 and R1 efficiencies:
32B active parameters instead of likely ~220B for GPT4 ⇒ 6.8x lower training and inference cost
8bits training instead of 16bits ⇒ 4x lower training cost
No margin on commercial inference ⇒ ?x maybe 3x
Multi-token training ⇒ ~2x training efficiency, ~3x inference efficiency, and lower inference latency by baking in “predictive decoding’, possibly 4x fewer training steps for the same number of tokens if predicting tokens only once
And additional cost savings from memory optimization, especially for long contexts ( Multi Head Latent Attention) ⇒ ?x
Nothing is very surprising (maybe the last bullet point for me because I know less about it).
The surprising part is why big AI labs were not pursuing these obvious strategies.
Int8 was obvious, the multi-token prediction was obvious, and more and smaller experts in MoE were obvious. All three have already been demonstrated and published in the literature. May be bottlenecked by communication, GPU usage, and memory for the largest models.
Types of SFCs: Actual, Potential, Precluded
Maybe move this section to be the 1st section of the post.
It seems that your point applies significantly more to “zero-sum markets”. So it may be good to notice it may not apply for altruistic people when non-instrumentally working on AI safety.
Models trained for HHH are likely not trained to be corrigible. Models should be trained to be corrigible too in addition to other propensities.
Corrigibility may be included in Helpfulness (alone) but when adding Harmlessness then Corrigibility conditional on being changed to be harmful is removed. So the result is not that surprising from that point of view.
People may be blind to the fact that improvements from gpt2 to 3 to 4 were both driven by scaling training compute (by 2 OOM between each generation) and (the hidden part) by scaling test compute through long context and CoT (like 1.5-2 OOM between each generations too).
If gpt5 uses just 2 OOM more training compute than gpt4 but the same test compute, then we should not expect “similar” gains, we should expect “half”.
O1 may use 2 OOM more test compute than gpt4. So gpt4=>O1+gpt5 could be expected to be similar to gpt3=>gpt4
Speculations on (near) Out-Of-Distribution (OOD) regimes
- [Absence of extractable information] The model can no longer extract any relevant information. Models may behave more and more similarly to their baseline behavior in this regime. Models may learn the heuristic to ignore uninformative data, and this heuristic may generalize pretty far. Publication supporting this regime: Deep Neural Networks Tend To Extrapolate Predictably
- [Extreme information] The model can still extract information, but the features extracted are becoming extreme in value (“extreme” = range never seen during training). Models may keep behaving in the same way as “at the In-Distrubution (ID) border”. Models may learn the heuristic that for extreme inputs, you should keep behaving as if you were still in the embedding same direction but still ID.
- [Inner OOD] The model observes a mix of features-values that it never saw during training, but none of these features-values are by themselves OOD. For example, the input is located between two populated planes. Models may learn the heuristic to use a (mixed) policy composed of closest ID behaviors.
- [Far/Disrupting OOD]: This happens in one of the other three regimes when the inputs break the OOD heuristics learned by the model. These can be found by adversarial search or by moving extremely OOD.
- [Fine-Tuning (FT) or Jailbreaking OOD] The inference distribution is OOD of the distribution during the FT. The model then stops using heuristics defined during the FT and starts using those learned during pretraining (the inference is still ID with respect to the pretraining distribution).
If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can’t solve it within 20 years of effort can still succeed in 40 years
Since the scaling is logarithmic, your example seems to be a strawman.
The real claim debated is more something like:
“If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can’t solve it within 100 months of effort can still succeed in 10 000 months” And this formulation doesn’t seem obviously true.
Ten months later, which papers would you recommend for SOTA explanations of how generalisation works?
From my quick research:
- “Explaining grokking through circuit efficiency” seems great at explaining and describing grokking
- “Unified View of Grokking, Double Descent and Emergent Abilities: A Comprehensive Study on Algorithm Task” proposes a plausible unified view of grokking and double descent (and a guess at a link with emergent capabilities and multi-task training). I especially like their summary plot:
For information to the readers and author: I am (independently) working on a project about narrowing down the moral values of alien civilizations on the verge of creating an ASI and becoming space-faring. The goal is to inform the prioritization of longtermist interventions.
I will gladly build on your content, which aggregates and beautifully expands several key mechanisms (individual selection (“Darwinian demon”), kin selection/multilevel selection (“Darwinian angel”), filters (“Fragility of Life Hypothesis)) that I use among others (e.g. sequential races, cultural evolution, accelerating growth stages, etc.).
Thanks for the post!
If the following correlations are true, then the opposite may be true (slave morality being better for improving the world through history):
Improving the world being strongly correlated with economic growth (this is probably less true when X-risk are significant)
Economic growth being strongly correlated with Entrepreneurship incentives (property rights, autonomy, fairness, meritocracy, low rents)
Master morality being strongly correlated with acquiring power and thus decreasing the power of others and decreasing their entrepreneurship incentives
Talk about the sequentially temporal nature of this race, such that the final winners won each race one after the other.