How does Anthropic and XAi’s compute compare over this period?
anaguma
Could you say more about how you think S-risks could arise from the first attractor state?
An LLM trained with a sufficient amount of RL maybe could learn to compress its thoughts into more efficient representations than english text, which seems consistent with the statement. I’m not sure if this is possible in practice; I’ve asked here if anyone knows of public examples.
Makes sense. Perhaps we’ll know more when o3 is released. If the model doesn’t offer a summary of CoT it makes neuralese more likely.
I’ve often heard it said that doing RL on chain of thought will lead to ‘neuralese’ (e.g. most recently in Ryan Greenblatt’s excellent post on the scheming). This seems important for alignment. Does anyone know of public examples of models developing or being trained to use neuralese?
(Based on public knowledge, it seems plausible (perhaps 25% likely) that o3 uses neuralese which could put it in this category.)
What public knowledge has led you to this estimate?
I was able to replicate this result. Given other impressive results of o1, I wonder if the model is intentionally sandbagging? If it’s trained to maximize human feedback, this might be an optimal strategy when playing zero sum games.
> I was grateful for the experiences and the details of how he prepares for conversations and framing AI that he imparted on me.
I’m curious, what was his strategy for preparing for these discussions? What did he discuss?
> This updated how I perceive the “show down” focused crowd
possible typo?
Also, I think under-elicitation is a current problem causing erroneously low results (false negatives) on dangerous capabilities evals. Seeing more robust elicitation (including fine-tuning!!) would make me more confident about the results of evals.
I’m confused about how to think about this. Are there any evals where fine-tuning on a sufficient amount of data wouldn’t saturate the eval? E.g. if there’s an eval measuring knowledge of virology, then I would predict that fine-tuning on 1B tokens of the relevant virology papers would lead to a large increase in performance. This might be true even if the 1B tokens were already in the pretraining dataset, because in some sense it’s the most recent data that the model has seen.
[Question] 2025 Alignment Predictions
I am also increasingly wondering if talking too much to LLMs is an infohazard akin to taking up psychedelics or Tiktok or meditation as a habit.
Why is meditation an infohazard?
For a human mind, most data it learns is probably to a large extent self-generated, synthetic, so only having access to much less external data is not a big issue.
Could you say more about this? What do you think is the ratio of external to internal data?
That’s a good point, it could be consensus.
Thoughts on o3 and search:
[Epistemic status: not new, but I thought I should share]
An intuition I’ve had for some time is that search is what enables an agent to control the future. I’m a chess player rated around 2000. The difference between me and Magnus Carlsen is that in complex positions, he can search much further for a win, such than I gave virtually no chance against him; the difference between me and an amateur chess player is similarly vast. It’s not just about winning either—in Shogi, the top professionals when they know they have won continue searching over future variations to find the most aesthetically appealing mate.
This is one of the reasons that I’m concerned about AI. It’s not bound by the same constraints of time, energy, and memory as humans, and as such it’s possible for it to search through possible futures very deeply to find the narrow path in which it achieves its goal. o3 looks to be on this path. It has both very long chains of though (depth of search), as well as the ability to parallelize across multiple instances (best-of-n sampling which solved ARC-AGI). To be clear, I don’t think this search is very efficient, and there are many obvious ways in which it can be improved. E.g. recurrent architectures which don’t waste as much compute computing logprobs for several tokens and sampling just one, or multi-token objectives for the base model as shown in Deepseek v3. But the basis for search is there. Till now, it seemed like AI was improving its intuition, now it can finally begin to think.
Concretely, I expect that by 2030, AI systems will use as much compute in inference time for hard problems as is currently used for pretraining the largest models. Possibly more if humanity is not around by then.
anaguma’s Shortform
This makes sense, I think you could be right. Llama 4 should give us more evidence on numerical precision and scaling of experts.
Deepseek v3 is one example, and semianalysis has claimed that most labs use FP8.
FP8 Training is important as it speeds up training compared to BF16 & most frontier labs use FP8 Training.
In 2024, there were multiple sightings of training systems at the scale of 100K H100. Microsoft’s 3 buildings in Goodyear, Arizona, xAI’s Memphis cluster, Meta’s training system for Llama 4. Such systems cost $5bn, need 150 MW, and can pretrain a 4e26 FLOPs model in 4 months.
Then there are Google’s 100K TPUv6e clusters and Amazon’s 400K Trn2 cluster. Performance of a TPUv6e in dense BF16 is close to that of an H100, while 400K Trn2 produce about as much compute as 250K H100.
Anthropic might need more time than the other players to gets its new hardware running, but there is also an advantage to Trn2 and TPUv6e over H100, larger scale-up domains that enable more tensor parallelism and smaller minibatch sizes. This might be an issue when training on H100 at this scale[1] and explain some scaling difficulties for labs that are not Google, or Anthropic later in 2025 once the Trn2 cluster becomes useful.
Do we know much about TPU and Trn2 performance at lower precision? I expect most training runs are using 4-8 bit precision by this point.
Note that “The AI Safety Community” is not part of this list. I think external people without much capital just won’t have that much leverage over what happens.
What would you advise for external people with some amount of capital, say $5M? How would this change for each of the years 2025-2027?
This is interesting. Can you say more about these experiments?