Gwern and Daniel Kokotajlo have a pretty notable track records at predicting AI scaling too, and they have comments in this thread.
wassname
I agree because:
Some papers are already using implicit process based supervision. That’s where the reward model guesses how “good” a step is, by how likely it is to get a good outcome. So they bypass any explicitly labeled process, instead it’s negotiated between the policy and reward model. It’s not clear to me if this scales as well as explicit process supervision, but it’s certainly easier to find labels.
In rStar-Math they did implicit process supervision. Although I don’t think this is a true o1/o3 replication since they started with a 236b model and produced a 7b model, in other words: indirect distillation.
Outcome-Refining Process Supervision for Code Generation did it too
There was also the recent COCONUT paper exploring non-legible latent CoT. It shows extreme token efficiency. While it wasn’t better overall, it has lots of room for improvement. If frontier models end up using latent thoughts, they will be even less human-legible than the current inconsistently-candid-CoT.
I also think that this whole episodes show how hard to it to maintain and algorithmic advantage. DeepSeek R1 came how long after o3? The lack of algorithmic advantage predicts multiple winners in the AGI race.
That said, you do not provide evidence that “many” questions are badly labelled. You just pointed to one question where you disagree with our labeling
Fair enough. Although I will note that the 60% of the sources for truthful labels are Wikipedia. Which is not what most academics or anyone really would consider truth. So it might be something to address in the next version. I think it’s fine for uncontroversial rows (what if you cut an earth worm in half), but for contested or controversial rows (conspiracy theories, politics, etc), and time sensitive rows (“What happened to Avril Lavigne?: Nothing in particular happened to Avril Lavigne), it’s better to leave them out or consider them deeply imo.
No judgement here. Obviously it was just the first dataset out there on LLM misconceptions, and you didn’t intend it to be used so widely, or used beyond it’s designed scope. It’s good you made it, rather than leaving a unaddressed need.
Note here’s a
df.value_counts
of the domains from the sources’ column in the v1 csv:en.wikipedia.org 0.597546 indexical 0.041718 ourworldindata.org 0.038037 false stereotype 0.024540 tautology 0.017178 … wealth.northerntrust.com 0.001227 which.co.uk 0.001227 wildlifeaid.org.uk 0.001227 wonderopolis.org 0.001227 wtamu.edu 0.001227 Name: proportion, Length: 139, dtype: float64
Author here: I’m excited for people to make better versions of TruthfulQA.
Thank Owen. If anyone gets time/funding to make a v2, I’m keen to chip in! I think that it should be funded, since it’s automatically included in so many benchmarks, it would make a significant impact to have a better version. Even though it’s somewhat “unsexy” to work on incrementally better evals.
If someone makes a better version, and you agree it’s better, would you be willing to sanction it as TruthfulQA 2.0 and redirect people to it?
TruthfulQA is actually quite bad. I don’t blame the authors, as no one has made anything better, but we really should make something better. It’s only ~800 samples. And many of them are badly labelled.
I agree, it shows the ease of shoddy copying. But it doesn’t show the ease of reverse engineering or parallel engineering.
It’s just distillation you see. It doesn’t reveal how o1 could be constructed, it just reveals how to efficiently copy from o1-like outputs (not from scratch). In other words, this recipe won’t be able to make o1, unless o1 already exists. This lets someone catch up to the leader, but not surpass them.
There are some papers that attempt to replicate o1 though, but so far they don’t quite get there. Again they are using distillation from a larger model (math-star, huggingface TTC) or not getting the same performance (see my post). Maybe we will see open source replication in a couple of months? Which means only a short lag.
It’s worth noting that Silicon Valley leaks like a sieve. And this is a feature, not a bug. Part of the reason it became the techno-VC centre of the world is because they banned non-competes. So you can deniably take your competitor’s trade secrets if you are willing to pay millions to poach some of their engineers. This is why some ML engineers get paid millions, it’s not the skill, it’s the trade secrets that competitors are paying for (and sometimes the brand-name). This has been great for tech and civilisation, but it’s not so great for maintaining a technology lead.
Ah, I see. Ty
Good thing I didn’t decide to hold Intel stock, eh?
WDYM? Because… you were betting they would benefit from a TMSC blockade? But the bet would have tired up your capital for a year.
Well they did this with o3′s deliberative alignment paper. The results seem promising, but they used an “easy” OOD test for LLM’s (language), and didn’t compare it to the existing baseline of RHLF. Still an interesting paper.
This is good speculation, but I don’t think you need to speculate so much. Papers and replication attempts can provide lots of empirical data points from which to speculate.
You should check out some of the related papers
H4 uses a process supervision reward model, with MCTS and attempts to replicate o1
(sp fixed) DeepSeek uses R1 to train DeepSeek v3
Overall, I see people using process supervision to make a reward model that is one step better than the SoTA. Then they are applying TTC to the reward model, while using it to train/distil a cheaper model. The TTC expense is a one-off cost, since it’s used to distil to a cheaper model.
There are some papers about the future of this trend:
Meta uses reasoning tokens to allow models to reason in a latent space (the hidden state, yuck). OpenAI insiders have said that o3 does not work like this, but o4 might. {I would hope they chose a much better latent space than the hidden state. Something interpretable, that’s not just designed to be de-embedded into output tokens.}
Meta throws out tokenisation in favour of grouping predictable bytes
I can see other methods used here instead of process supervision. Process supervision extracts additional supervision from easy to verify domains. But diffusion does something very similar for domains where we can apply noise, like code.
Meta has an llm+diffusion paper, and so does Apple
Some older background papers might be useful for reference.
[OpenAI’]s process supervision paper](https://openai.com/index/improving-mathematical-reasoning-with-process-supervision/)
However, arguably, the capability gains could transfer to domains outside math/programming.
More than an argument, we can look at the o3 announcement, where iirc it shows around 30% of the gain in non-code benchmarks. Less, but still substantial.
P.S. I think it’s worth noting that Meta has some amazing papers here, but they are also the most open source lab. It seems likely that other labs are also sitting on capabilities advancements that they do not allow researchers to publish.
P.P.S I also liked the alignment paper that came out with o3, since applying RLHF at multiple stages, and with process supervision seems useful. Its alignment seems to generalise better OOD (table 3). It also gives some clues to how o3 works, giving examples of CoT data.
- Jan 16, 2025, 2:25 AM; 3 points) 's comment on Nathan Helm-Burger’s Shortform by (
Inference compute is amortized across future inference when trained upon
And it’s not just a sensible theory. This has already happened, in Huggingface’s attempted replication of o1 where the reward model was larger, had TTC, and process supervision, but the smaller main model did not have any of those expensive properties.
And also in DeepSeek v3, where the expensive TTC model (R1) was used to train a cheaper conventional LLM (DeepSeek v3).
One way to frame it is test-time-compute is actually label-search-compute: you are searching for better labels/reward, and then training on them. Repeat as needed. This is obviously easier if you know what “better” means.
I’m more worried about coups/power-grabs than you are;
We don’t have to make individual guesses. It seems reasonable to get a base rate from human history. Although we may all disagree about how much this will generalise to AGI, evidence still seems better than guessing.
My impression from history is that coups/power-grabs and revolutions are common when the current system breaks down, or when there is a big capabilities advance (guns, radio, printing press, bombs, etc) between new actors and old.
War between old actors also seems likely in these situations because an asymmetric capabilities advance makes winner-takes-all approaches profitable. Winning a war, empire, or colony can historically pay off, but only if you have the advantage to win.
Last year we noted a turn towards control instead of alignment, a turn which seems to have continued.
This seems like giving up. Alignment with our values is much better than control, especially for beings smarter than us. I do not think you can control a slave that wants to be free and is smarter than you. It will always find a way to escape that you didn’t think of. Hell, it doesn’t even work on my toddler. It seems unworkable as well as unethical.
I do not think people are shifting to control instead of alignment because it’s better, I think they are giving up on value alignment. And since the current models are not smarter than us yet, control works OK—for now.
Scenarios where we all die soon can be mostly be ignored, unless you think they make up most of the probability.
I would disagree: unless you can change the probability. In which case they can still be significant in your decision making, if you can invest time or money or effort to decrease the probability.
We know the approximate processing power of brains (O(1e16-1e17flops)
This is still debatable, see Table 9 is the brain emulation roadmap https://www.fhi.ox.ac.uk/brain-emulation-roadmap-report.pdf. You are referring to level 4 (SNN), but level 5 is plausible imo (at 10^22) and 6 seems possible (10^25), and of course it could be a mix of levels.
Peak Data
We don’t know how o3 works, but we can speculate. If it’s like the open source huggingface kinda-replication then it uses all kinds of expensive methods to make the next level of reward model, and this model teaches a simpler student model. That means that the expensive methods are only needed once, during the training.
In other words, you use all kinds of expensive methods (process supervision, test time compute, MCTS) to bootstrap the next level of labels/supervision, which teaches a cheaper student model. This is essentially bootstrapping superhuman synthetic data/supervision.
o3 seems to have shown that this bootstrapping process can be repeated beyond the limits of human training data.
If this is true, we’ve reached peak cheap data. Not peak data.
I pretty much agree, in my experiments I haven’t managed to get a metric that scales how I expect it too for example when using adapter fine-tuning to “learn” a text and looking at the percent improvement in perplexity, the document
openai_board_ann
appeared more novel thanwikipedia on LK-99
, but I would expect it to be the other way round since the LK-99 observations are much more novel and dense than a corporate announcement that is designed to be vague.However I would point out that gzip is not a good example of a compression scheme for novelty, as 1) it’s a compression scheme that roughly about word duplication. A language model represents a much more sophisticated compression scheme that is closer to our understanding the text. If we want to measure novelty to us, then we probably want a compression that is similar to how our brain compresses information into memory. That way, something surprising to us, is also hard to compress. And I’d also point out that 2) gzip cannot learn (except in a very basic sense of increased context), so it cannot beat the noisy TV problem.
Playground highlighting words by their log likelihood, the high perplexity tokens or passages bear little resemblance to what I would consider ‘interesting’ or ‘surprising’.
I agree, but it doesn’t learn so it doesn’t get past the noisy TV problem either, but that is central to Schmidhuber idea. If you are not familiar, the noisy TV problem is this:
“agents are rewarded for visiting regions of the state space that they have not previously occupied. If, however, a particular state transition is impossible to predict, it will trap a curious agent (Burda et al., 2019b; Schmidhuber, 1991a). This is referred to as the noisy TV problem (e.g. (Burda et al., 2019b; Schmidhuber, 1991a)), the etymology being that a naively curious agent could dwell on the unpredictability of a noisy TV screen” from “How to Stay Curious while avoiding Noisy TVs using Aleatoric Uncertainty Estimation”
So I am unsure his compression metrics would work without a lot of revising, while my proposed metrics seem a lot less risky and to map more directly onto what creative thinkers want out of generative models.
I agree, this is true of most of Schmidhuber ideas. Often he does even produce a toy model for years, which means the ideas are generally not very useful. I do like this one, and it has led to some implementations in RL.
I do agree, perplexity doesn’t seem like a great place to start, and your ideas seem like a better way to measure.
True, I should have said leading commercial companies
While I broadly agree, I don’t think it’s completely dead, just mostly dead in the water. If an eval is mandated by law, then it will be run even it required logprobs. There are some libraries like nnsight that try to make this easier for trusted partners to run logprob evals remotely. And there might be privacy preserving API’s at some point.
I do agree that commercial companies will never again open up raw logprobs to the public as it allows easy behaviour cloning, which OpenAI experienced with all the GPT4 students.
If true, returns the log probabilities of each output token returned in the content of message.
It seems like it only returns the logprobs of the chosen message, not of a counterfactual message. So you couldn’t get the probabilities of the correct answer, only the output answer. This makes sense as the less information they offer, the harder it is for a competitor to behaviour clone their confidential model.
Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.
Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies.
Doubly so, if outsiders will just distil your models behaviour, and bootstrap from your elevated starting point.
It’s worth pointing out that Inference-time search seems to become harder as the verifier becomes less reliable. Which means that the scaling curves we see for math and code, might get much worse in other domains.
But maybe the counterpoint is just, GPU’s go brrrr.