Compute is definitely important for experiments. The limits undoubtedly slow China’s progress, but what’s more difficult to determine is whether global progress is slower or not. In the toy scenario where China’s research focus is exactly parallel and duplicative of Western efforts, they contribute nothing to global progress unless they are faster. More realistically, research space is high-dimensional, and you are likely correct that the decreased magnitude of their research vector likely outweighs any extra orthogonality benefits, but I don’t know how to apply numbers to that tradeoff.
Archimedes
Archimedes’s Shortform
Limiting China’s computing power via export controls on hardware like GPUs might be accelerating global progress in AI capabilities.
When Chinese labs are compute-starved, their research will differentially focus on efficiency gains compared to counterfactual universes where they are less limited. So far, they’ve been publishing their research, and their tricks can be quickly be incorporated by anyone else. US players can leverage their compute power, focusing on experiments and scaling while effectively delegating research topics that China is motivated to handle.
Google and OpenAI benefit far more from DeepSeek than they do from Meta.
Nonprofit to retain control of OpenAI
Epoch.ai just released this today:
https://epoch.ai/gradient-updates/the-case-for-multi-decade-ai-timelines
Excerpt:
I would summarize the key cruxes that separate my views from people who have shorter timelines as follows:
I don’t see the trends that one would extrapolate in order to arrive at very short timelines on the order of a few years. The obvious trend extrapolations for AI’s economic impact give timelines to full remote work automation of around a decade, and I expect these trends to slow down by default.
I don’t buy the software-only singularity as a plausible mechanism for how existing rates of growth in AI’s real-world impact could suddenly and dramatically accelerate by an order of magnitude, mostly because I put much more weight on bottlenecks coming from experimental compute and real-world data. This kind of speedup is essential to popular accounts of why we should expect timelines much shorter than 10 years to remote work automation.
I think intuitions for how fast AI systems would be able to think and how many of them we would be able to deploy that come from narrow writing, coding, or reasoning tasks are very misguided due to Moravec’s paradox. In practice, I expect AI systems to become slower and more expensive as we ask them to perform agentic, multimodal, and long-context tasks. This has already been happening with the rise of AI agents, and I expect this trend to continue in the future.
Can you update the instructions? The following doesn’t work.
<details>
<summary> Summary here </summary>
Details here
</details>
This does work
+++ Summary here Details here +++
like this:
Summary here
Details here
For the 3x3 grid problem, Gemini 2.5 immediately went to a breadth-first search but didn’t actually implement it and return the correct solution.
Gemini 2.5 Pro Preview output:
Okay, let’s solve this step-by-step.
This is a sliding tile puzzle on a 3x3 grid with 6 named tiles and 3 empty spaces. We need to find the minimum number of moves to get from the initial configuration to the target configuration. A move consists of sliding a tile into an orthogonally adjacent empty square.
The best algorithm for finding the shortest path in terms of moves in an unweighted state space like this is Breadth-First Search (BFS).
Define States:
Initial State:
Sun Leaf (Empty) Bell Snake Star (Empty) Flower (Empty)
Tuple representation:
('Sun', 'Leaf', 'E', 'Bell', 'Snake', 'Star', 'E', 'Flower', 'E')
(E = Empty)Target State:
Snake Star Flower (Empty) Sun (Empty) Leaf (Empty) Bell
Tuple representation:
('Snake', 'Star', 'Flower', 'E', 'Sun', 'E', 'Leaf', 'E', 'Bell')
BFS Algorithm:
Create a queue and add the initial state with 0 moves:
queue = [ (initial_state, 0) ]
Create a set to store visited states to avoid cycles:
visited = {initial_state}
While the queue is not empty:
Dequeue the current state and its move count:
current_state, moves = queue.pop(0)
If
current_state
is thetarget_state
, returnmoves
.Find all possible next states by moving a tile into an adjacent empty space.
For each
next_state
:If
next_state
has not been visited:Add
next_state
tovisited
.Enqueue
(next_state, moves + 1)
.
Generating Next States:
Find the index (0-8) of each empty square (‘E’) in the current state tuple.
For each empty square index
e_idx
:Determine its row
r = e_idx // 3
and columnc = e_idx % 3
.Consider its orthogonal neighbors (up, down, left, right):
(r-1, c)
,(r+1, c)
,(r, c-1)
,(r, c+1)
.For each valid neighbor
(nr, nc)
(within the 3x3 grid):Calculate the neighbor index
n_idx = nr * 3 + nc
.Create the
next_state
by swapping the elements ate_idx
andn_idx
in thecurrent_state
tuple.Yield this
next_state
.
Execution (Conceptual):
Start with the initial state at level 0.
Find all states reachable in 1 move (level 1).
Find all new states reachable from level 1 states in 1 move (level 2).
Continue level by level until the target state is found.
Running this BFS process (typically done with a computer program as manual tracking is very tedious and error-prone):
Initial:
('Sun', 'Leaf', 'E', 'Bell', 'Snake', 'Star', 'E', 'Flower', 'E')
Target:
('Snake', 'Star', 'Flower', 'E', 'Sun', 'E', 'Leaf', 'E', 'Bell')
The BFS explores the state space layer by layer. Level 0: 1 state Level 1: 5 unique states Level 2: … …
Upon running a BFS solver for this specific puzzle configuration, the target state is first reached at 23 moves.
Final Answer: The final answer is
I may feel smug if the “novel idea” is basically a worse version of an existing one, but there are more interesting possibilities to probe for.
The novel idea is a meaningful extension/generalization of an existing concept. E.g., Riemann --> Lebesgue integration
The novel idea is equivalent to an existing concept but formulated differently. E.g., Newton and Leibniz versions of calculus.
The novel idea is a more detailed explanation of an existing concept. E.g., chemical bonding --> molecular orbital theory.
Less likely to be rounded away:
The novel idea overlaps with existing concepts but is neither a subset nor an extension. E.g., General Relativity and Quantum Mechanics.
The novel idea applies existing concepts to a new domain. E.g., applying information theory to DNA.
The novel idea synthesizes multiple existing concepts into a greater whole. E.g., Darwinian evolution as a combination of Malthusian population dynamics and natural variation.
The novel idea provides a unifying framework for previously disconnected concepts. E.g., Maxwell’s equations unifying electricity, magnetism, and optics.
Nearly all conceptual rounding errors will not be anything as grand as the extreme examples I gave, but often there is still something worth examining.
In the poetry case study, we had set out to show that the model didn’t plan ahead, and found instead that it did.
I found it shocking they didn’t think the model plans ahead. The poetry ability of LLMs since at least GPT2 is well beyond what feels possible without anticipating a rhyme by planning at least a handful of tokens in advance.
It’s also worth trying a different model. I was going back and forth with an OpenAI model (I don’t remember which one) and couldn’t get it to do what I needed at all, even with multiple fresh threads. Then I tried Claude and it just worked.
Yep. Meme NFTs are an existence proof of such people.
https://en.wikipedia.org/wiki/List_of_most_expensive_non-fungible_tokens
Strongly subsidizing the costs of raising children (and not just in financial terms) would likely provide more pro-social results than a large one-time lump payment. However, that won’t do much for folks skipping out on children because they think humanity is doomed shortly anyway.
I suspect that LLMs likely can write blogs on par with most humans if we trained and scaffolded them appropriately, but is that really what we want from LLMs?
Claude 3.7 might not write outstanding blogs but he can help explain why not:
The fundamental mismatch between LLMs and blogging isn’t primarily about capabilities, but about design and motivation:
Current LLMs are RLHF-tuned to be balanced, helpful assistants—essentially the opposite of good bloggers. Assistants hedge, acknowledge all perspectives, and avoid strong stances. Good bloggers take intellectual risks, have distinctive voices, and present unique viewpoints.
Humans blog for reasons LLMs simply don’t have:
Building intellectual reputation in a community
Working through personal confusions
Creative self-expression
The social reward of changing minds
The metrics we use to evaluate LLMs (helpfulness, accuracy, harmlessness) don’t capture what makes blogs compelling (novelty, intellectual risk-taking, personality).
Simply making LLMs more capable won’t bridge this gap. We’d need systems with fundamentally different optimization targets—ones trained to be interesting rather than helpful, to develop consistent viewpoints rather than being balanced, and to prioritize novel insights over comprehensive coverage.
FYI, there has been even further progress with Leela odds nets. Here are some recent quotes from GM Larry Kaufman (a.k.a. Hissha) found on the Leela Chess Zero Discord:
(2025-03-04) I completed an analysis of how the Leela odds nets have performed on LiChess since the search-contempt upgrade on Feb. 27. [...] I believe these are reasonable estimates of the LiChess Blitz rating needed to break even with the bots at 5′3“ in serious play. Queen and move odds (means Leela plays Black) 2400, Queen odds (Leela White) 2550, [...] Rook and move odds (Leela Black); 3000. Rook odds (Leela White) 3050, knight odds 3200. For comparison only a few top humans exceed 3000, with Magnus at 3131. So based on this, even Magnus would lose a match at 5′3” with knight odds, while perhaps the top five blitz players in the world would win a match at rook odds. Maybe about top fifty could win a match at queen for knight. At queen odds (Leela White), a “par” (FIDE 2400) IM should come out ahead, while a “par” (FIDE 2300) FM should come out behind.
(2025-03-07) Yes, there have to be limits to what is possible, but we keep blowing by what we thought those limits were! A decade ago, blitz games (3′2″) were pretty even between the best engine (then Komodo) and “par” GMs at knight odds. Maybe some people imagined that some day we could push that to being even at rook odds, but if anyone had suggested queen odds that would have been taken as a joke. And yet, if we’re not there already, we are closing in on it. Similarly at Classical time controls, we could barely give knight odds to players with ratings like FIDE 2100 back then, giving knight odds to “par” GMs in Classical seemed like an impossible goal. Now I think we are already there, and giving rook odds to players in Classical at least seems a realistic goal. What it means is that chess is more complicated than we thought it was.
I have so many mixed feelings about schooling that I’m glad I don’t have my own children to worry about. There is enormous potential for improving things, yet so little of that potential gets realized.
The thing about school choice is that funding is largely zero sum. Those with the means to choose better options than public schools take advantage of those means and leave underfunded public schools to serve the least privileged remainder. My public school teacher friends end up with disproportionately large fractions of children with special needs who need extra care and attention but don’t have the support to care for them effectively. As a result, all the students and all the teachers suffer. How do we do right by all these individuals? Private schools can largely avoid accommodating these students. The private schools can largely choose their students, but the public schools cannot. It reminds me of insurance companies choosing not to cover certain people, who then have no affordable coverage options. It’s not an unsolvable problem in theory, but reality is messy and politically fraught.
I don’t think it’s accurate to model breakdowns as a linear function of journeys or train-miles unless irregular effects like extreme weather are a negligible fraction of breakdowns.
How does the falling price factor into an investor’s decision to enter the market? Should they wait for batteries to get even cheaper, or should they invest immediately and hope the arbitrage rates hold up long enough to provide a good return on investment? The longer the payback period, the more these dynamics matter.
“10x engineers” are a thing, and if we assume they’re high-agency people always looking to streamline and improve their workflows, we should expect them to be precisely the people who get a further 10x boost from LLMs.
I highly doubt this. A 10x engineer is likely already bottlenecked by non-coding work that AI can’t help with, so even if they 10x their coding, they may not increase overall productivity much.
I’d rather see the prison system less barbaric than try to find ways of intentionally inflicting that level of barbarism in a compressed form.
Regardless, I think you still need confinement of some sort for people who are dangerous but not deserving of the death penalty.
It may or may not be the first steps toward foom, but automated improvements are still improvements regardless of how “innovative-in-themselves” we consider them to be. Improving on an algorithm that’s been the SOTA since 1969 is cool, even if it was done purely via brute force.
For now, it looks like it “only” found minor improvements on various SOTA, but this was done with previous generation models (a mix of Gemini 2.0 Flash and Pro)[1]. I’d expect next-gen models and next-gen scaffolds to be another step up.
Models used. AlphaEvolve employs an ensemble of large language models. Specifically, we utilize a combination of Gemini 2.0 Flash and Gemini 2.0 Pro. This ensemble approach allows us to balance computational throughput with the quality of generated solutions.