Any chance you can post (or PM me) the three problems AIs have already beaten?
Thane Ruthenis
This really depends on the definition of “smarter”. There is a valid sense in which Stockfish is “smarter” than any human. Likewise, there are many valid senses in which GPT-4 is “smarter” than some humans, and some valid senses in which GPT-4 is “smarter” than all humans (e. g., at token prediction). There will be senses in which GPT-5 will be “smarter” than a bigger fraction of humans compared to GPT-4, perhaps being smarter than Sam Altman under a bigger set of possible definitions of “smarter”.
Will that actually mean anything? Who knows.
By playing with definitions like this, Sam Altman can simultaneously inspire hype by implication (“GPT-5 will be a superintelligent AGI!!!”) and then, if GPT-5 underdelivers, avoid significant reputational losses by assigning a different meaning to his past words (“here’s a very specific sense in which GPT-5 is smarter than me, that’s what I meant, hype is out of control again, smh”). This is a classic tactic; basically a motte-and-bailey variant.
If anyone is game for creating an agentic research scaffold like that Thane describes
Here’s a more detailed the basic structure as envisioned after 5 minutes’ thought:
You feed a research prompt to the “Outer Loop” of a model, maybe have a back-and-forth fleshing out the details.
The Outer Loop decomposes the research into several promising research directions/parallel subproblems.
Each research direction/subproblem is handed off to a “Subagent” instance of the model.
Each Subagent runs search queries on the web and analyses the results, up to the limits of its context window. After the analysis, it’s prompted to evaluate (1) which of the results/sources are most relevant and which should be thrown out, (2) whether this research direction is promising and what follow-up questions are worth asking.
If a Subagent is very eager to pursue a follow-up question, it can either run a subsequent search query (if there’s enough space in the context window), or it’s prompted to distill its current findings and replace itself with a next-iteration Subagent, in whose context it loads only the most important results + its analyses + the follow-up question.
This is allowed up to some iteration count.
Once all Subagents have completed their research, instantiate an Evaluator instance, into whose context window we dump the final results of each Subagent’s efforts (distilling if necessary). The Evaluator integrates the information from all parallel research directions and determines whether the research prompt has been satisfactorily addressed, and if not, what follow-up questions are worth pursuing.
The Evaluator’s final conclusions are dumped into the Outer Loop’s context (without the source documents, to not overload the context window).
If the Evaluator did not choose to terminate, the next generation of Subagents is spawned, each prompted with whatever contexts are recommended by the Evaluator.
Iterate, spawning further Evaluator instances and Subproblem instances as needed.
Once the Evaluator chooses to terminate, or some compute upper-bound is reached, the Outer Loop instantiates a Summarizer, into which it dumps all of the Evaluator’s analyses + all of the most important search results.
The Summarizer is prompted to generate a high-level report outline, then write out each subsection, then the subsections are patched together into a final report.
Here’s what this complicated setup is supposed to achieve:
Do a pass over an actually large, diverse amount of sources. Most of such web-search tools (Google’s, DeepSeek’s, or this shallow replication) are basically only allowed to make one search query, and then they have to contend with whatever got dumped into their context window. If the first search query turns out poorly targeted, in a way that becomes obvious after looking through the results, the AI’s out of luck.
Avoid falling into “rabbit holes”, i. e. some arcane-and-useless subproblems the model becomes obsessed with. Subagents would be allowed to fall into them, but presumably Evaluator steps, with a bird’s-eye view of the picture, would recognize that for a failure mode and not recommend the Outer Loop to follow it.
Attempt to patch together a “bird’s eye view” on the entirety of the results given the limitations of the context window. Subagents and Evaluators would use their limited scope to figure out what results are most important, then provide summaries + recommendations based on what’s visible from their limited vantage points. The Outer Loop and then the Summarizer instances, prompted with information-dense distillations of what’s visible from each lower-level vantage point, should effectively be looking at (a decent approximation of) the full scope of the problem.
Has anything like this been implemented in the open-source community or one of the countless LLM-wrapper startups? I would expect so, since it seems like an obvious thing to try + the old AutoGPT scaffolds worked in a somewhat similar manner… But it’s possible the market’s totally failed to deliver.
It should be relatively easy to set up using Flowise plus e. g. Firecrawl. If nothing like this has been implemented, I’m putting a
$500$250 bounty on it.Edit: Alright, this seems like something very close to it.
I really wonder how much of the perceived performance improvement comes from agent-y training, as opposed to not sabotaging the format of the model’s answer + letting it do multi-turn search.
Compared to most other LLMs, Deep Research is able to generate reports up to 10k words, which is very helpful for providing comprehensive-feeling summaries. And unlike Google’s analogue, it’s not been fine-tuned to follow some awful deliberately shallow report template[1].
In other words: I wonder how much of Deep Research’s capability can be replicated by putting Claude 3.5.1 in some basic agency scaffold where it’s able to do multi-turn web search, and is then structurally forced to generate a long-form response (say, by always prompting with something like “now generate a high-level report outline”, then with “now write section 1″, “now write section 2”, and then just patching those responses together).
Have there been some open-source projects that already did so? If not, anyone willing to try? This would provide a “baseline” for estimating how well OpenAI’s agency training actually works, how much improvement is from it.
Subjectively, its use as a research tool is limited—it found only 18 sources in a five-minute search
Another test to run: if you give those sources to Sonnet 3.5.1 (or a Gemini model, if Sonnet’s context window is too small) and ask it to provide a summary, how far from Deep Research’s does it feel in quality/insight?
- ^
Which is the vibe I got from Google’s Deep Research when I’d been experimenting. I think its main issue isn’t any lack of capability, but that the fine-tuning dataset for how the final reports are supposed to look had been bad: it’s been taught to have an atrocious aesthetic taste.
- ^
Ah, I think I see my error.
Consider a CoT of the form “[problem statement]: …........”, which always outputs dots regardless of the problem statement. Can that still encode useful computations, with the information being passed through intermediate self-attention? Yes, but: this only increases parallel-computing capabilities, not serial-computing capabilities.
And in the framework I’d outlined, under the Paraphraser, for the purposes of scheme-y thinking, even a legible CoT would consist of such “filler” tokens.
So yep, you’re right, this does force all scheming to be taking place within one forward pass.
I don’t see why it would be harder to do.
Suppose you tasked a model with refactoring some database, and it secretly wants to sabotage its functionality. It outputs a CoT segment that goes, “I need to design unit tests for this module”. Each word of this CoT is mapped to some internal representation, and the internal representation consists of “on-task” human-intended reasoning , and misaligned scheming (corresponding to, e. g., a tactical decision to fail to include unit tests for some subtle issue). “I” is mapped to , “need” is mapped to , etc. Each iterates on , each iterates on , we have a sequence of serial computations.
In other words: as the model outputs, “I need to design unit tests for this module”, there’s an unspoken subtext of “I need to sabotage the unit tests here”, and the model packs nine words’ worth of serial computations into both the “textual” and the “subtextual” aspects of it.
Now you use the Paraphraser on the string, generating alternatives like “the next step is to generate some unit tests” and “several unit tests for this module are needed here”. The intent, if I understand you correctly, is that the semantic-equivalence-but-syntactic-difference would ensure that the th token of this phrase is mapped to , as if the paraphrasing didn’t happen, but that the mapping of would be broken.
Except, why would that happen? What privileges -representations over -representations? Why would be preserved whereas is broken, instead of both of them remaining functional (and so preserving scheming), or both breaking (and so compromising the model’s capabilities)?
In other words: We’re basically just teaching the model to fail to distinguish between equivalent-to-a-human rephrasings. Why would this lead to the task-related representations becoming indifferent to the specifics of a phrasing whereas the scheming-related representations remain sensitive to it? Why wouldn’t this just map the subtextual “I need to sabotage the unit tests here” to similarly semantically-equivalent-but-syntactically-different “let us compromise the relevant unit-test suite” or “I shall ensure the unit tests don’t work properly”?
I expect it’d just lead, to wit, to the model’s computations becoming indifferent to the details of a phrasing in full generality. And if the model were inclined to associate the general idea of generating unit tests with tactically screwing the task up, the model would come up with this idea regardless of the details of a phrasing.
It wouldn’t interfere with its ability to engage in scheming-relevant serial computations, much like it doesn’t interfere with its ability to engage in task-relevant serial computations, for the exact same reasons.
Some thoughts on protecting against LLM steganography.
tl;dr: I suspect the Paraphraser idea is near-completely ineffectual as you scale the capabilities, both for scheming and for general CoT interpretability.
In the reasoning-as-communication framework, the amount of information a fixed-length message/CoT can transfer is upper-bounded by information-theoretic constraints. That is, if we have a ten-bit message, it cannot communicate more than ten bits of information: it cannot always uniquely specify one internal state of an AI model from more than internal states.
However, the amount of information is not lower-bounded. Depending on the decoder, a ten-bit message can communicate between zero and ten bits of information. (A decoder that’s just a constant function recovers zero bits, a decoder that only pays attention to the first bit recovers only one bit, et cetera.)
The idea of steganography is that, if we pick a CoT string, and run both an AI model’s internal decoders and a human’s language-comprehension decoders on it, the AI model would extract more bits of information from it. Paraphrasing, humans’ decoders are “coarser”: they have broader equivalence classes of messages that map to the same internal representation, compared to AI models.
For example, “a dog crossed the street” and “a dog traveled across the street” might map to the same internal representation for a human, but to different ones for an AI model. Similarly, things like “quantum butterflies juggled metaphysical spaghetti” and “ambitious lampshade collected vintage moonbeams” can map to the same “nonsense” internal representation in humans, but have some useful meaning for AI models. The bits a human fails to make use of correspond to “steganography” or “superhuman latent knowledge”.
Proposals such as Shoggoth + Paraphraser + Face aim to destroy any information in CoTs that cannot be recovered by humans, by mapping each CoT string to the equivalence class of strings that all map to the same internal representation in humans.
However, there’s an issue: what if the empty string already communicates plenty of steganographically encoded information that humans aren’t privy to? That is: what if some knowledge is hard-coded into an AI model, such that it’s elicited on any activation, and leads to it converging towards reaching the same hidden conclusions even after we destroy its ability to sneak-in additional query-relevant information?
For example:
If the AI model implicitly learned the Theory of Everything, that knowledge is going to be activated on any prompt – and so, no amount of paraphrasing is going to force it to talk about the Theory of Everything in humanese. Communicating the ToE from one instance of the model to its future self takes up zero bits.
If the AI model has internalized some hidden goals, the heuristics for pursuing those goals are likewise going to be activated on any prompt, and it would engage in massively productive hidden scheming on any prompt. Importantly, the information-bottleneck constraints imposed by the Paraphraser wouldn’t even be hitting the hidden-plotting compartment specifically. They’d just uniformly lower the model’s computational power.
E. g., in the Shoggoth + Paraphraser + Face paradigm, that’d just mean the model learns to think in longer CoTs (because it needs longer messages to communicate the same amount of information), not to do fewer “sneaky” computations.[1]
Like, imagine that we have two different internal-representation sets: , containing problem-solving strategies for the overt task we’re presenting the model, and , related to some misaligned scheming. We would like to constrain the model such that it cannot plan about ; such that its CoTs contain zero bits for narrowing down which state out of all it must pick.
This assumes something like the following model: A given CoT string communicates some bits of information. Of them, bits are human-legible. The model then follows the algorithm of: use task-relevant bits on picking a state out of , and use any leftover bits to pick a state out of . We then destroy the string’s information content until (via the Paraphraser). Now the model doesn’t have the ability to encode any non-human-legible bits, and if it uses any bits on picking a state out of , they would necessarily be human-legible, and we can spot the scheming by reading the CoT.
But that’s not how this works. Instead, the model picks both simultaneously: a CoT is directly mapped to an two-tuple. Which means if the model can narrow down the state of to one out of , it can narrow down the state of to one out of in the same move. Even if we scramble the CoT until it communicates no more bits than a human can extract out of it, that doesn’t differentially constrain the model’s ability to think about scheming (= communicate scheming-relevant information to its future self).
It’s a uniform communication constraint/computational-complexity penalty.
Generalizing beyond scheming: if the model were inclined to not put any of its thoughts into humanese (due to superhuman latent knowledge, say), the Paraphraser wouldn’t address this issue either. If a model’s plans contain some catastrophic side-effects (which aren’t even malign, just something we didn’t tell it not to cause), the Paraphraser wouldn’t reveal that either.
@Daniel Kokotajlo, thoughts?
(This doesn’t touch on the Face part of the proposal, i. e., the idea that we want to not pressure the model to conceal any scheme-y cognition in the hopes it’d naively reveal them. But if there’s some pressure to conceal that, the Paraphraser would not help with it at all. It’d just teach the model to map humanese-equivalent strings to equivalent scheming-states, which isn’t particularly useful.)
- ^
Arguably, if you also penalize the CoT length, this should cause a pressure towards throwing out any computations unrelated to the object-level problem the model’s being trained on… But then the whole “paraphraser” bit doesn’t actually do anything, you can achieve the same outcome by just directly penalizing the CoT length.
Which, inasmuch as DL training loops use some form of simplicity prior, should be happening by default anyway. I. e., this doesn’t address the pressures that we expect to lead to steganography/hidden plotting despite the simplicity prior: which is to say, arguments that steganography/plotting are the actual optimal solutions to the training problem.
Consider the Theory of Everything example: those hidden computations are not actually “irrelevant computations” that we need to displace, they’re directly useful. Similarly, any misaligned mesa-objectives the model forms would presumably be convergently useful for performance on the training objective: that would be why they formed to begin with. So the information-bottleneck trickery would do ~nothing against them.
Such an entity as described could absolutely be an existential threat to humanity
I agree. I think you don’t even need most of the stuff on the “superhuman” list, the equivalent of a competent IQ-130 human upload probably does it, as long as it has the speed + self-copying advantages.
The problem with this neat picture is reward-hacking. This process wouldn’t optimize for better performance on fuzzy tasks, it would optimize for performance on fuzzy tasks that looks better to the underlying model. And much like RLHF doesn’t scale to superintelligence, this doesn’t scale to superhuman fuzzy-task performance.
It can improve the performance a bit. But once you ramp up the optimization pressure, “better performance” and “looks like better performance” would decouple from each other and the model would train itself into idiosyncratic uselessness. (Indeed: if it were this easy, doesn’t this mean you should be able to self-modify into a master tactician or martial artist by running some simulated scenarios in your mind, improving without bound, and without any need to contact reality?)
… Or so my intuition goes. It’s possible that this totally works for some dumb reason. But I don’t think so. RL has a long-standing history of problems with reward-hacking, and LLMs’ judgement is one of the most easily hackable things out there.
(Note that I’m not arguing that recursive self-improvement is impossible in general. But RLAIF, specifically, just doesn’t look like the way.)
The above toy model assumed that we’re picking one signal at a time, and that each such “signal” specifies the intended behavior for all organs simultaneously...
… But you’re right that the underlying assumption there was that the set of possible desired behaviors is discrete (i. e., that X in “kidneys do X” is a discrete variable, not a vector of reals). That might’ve indeed assumed me straight out of the space of reasonable toy models for biological signals, oops.
Yeah but if something is in the general circulation (bloodstream), then it’s going everywhere in the body. I don’t think there’s any way to specifically direct it.
The point wouldn’t be to direct it, but to have different mixtures of chemicals (and timings) to mean different things to different organs.
Loose analogy: Suppose that the intended body behaviors (“kidneys do X, heart does Y, brain does Z” for all combinations of X, Y, Z) are latent features, basic chemical substances and timings are components of the input vector, and there are dramatically more intended behaviors than input-vector components. Can we define the behavior-controlling function of organs (distributed across organs) such that, for any intended body behavior, there’s a signal that sets the body into approximately this state?
It seems that yes. The number of almost-orthogonal vectors in dimensions scales exponentially with , so we simply need to make the behavior-controlling function sensitive to these almost-orthogonal directions, rather than the chemical-basis vectors. The mappings from the input vector to the output behaviors, for each organ, would then be some complicated mixtures, not a simple “chemical A sets all organs into behavior X”.
This analogy seems flawed in many ways, but I think something directionally-like-this might be happening?
Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems?
Certainly (experimenting with r1′s CoTs right now, in fact). I agree that they’re not doing the brute-force stuff I mentioned; that was just me outlining a scenario in which a system “technically” clears the bar you’d outlined, yet I end up unmoved (I don’t want to end up goalpost-moving).
Though neither are they being “strategic” in the way I expect they’d need to be in order to productively use a billion-token CoT.
Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI
Yeah, I’m also glad to finally have something concrete-ish to watch out for. Thanks for prompting me!
Not for math benchmarks. Here’s one way it can “cheat” at them: suppose that the CoT would involve the model generating candidate proofs/derivations, then running an internal (learned, not hard-coded) proof verifier on them, and either rejecting the candidate proof and trying to generate a new one, or outputting it. We know that this is possible, since we know that proof verifiers can be compactly specified.
This wouldn’t actually show “agency” and strategic thinking of the kinds that might generalize to open-ended domains and “true” long-horizon tasks. In particular, this would mostly fail the condition (2) from my previous comment.
Something more open-ended and requiring “research taste” would be needed. Maybe a comparable performance on METR’s benchmark would work for this (i. e., the model can beat a significantly larger fraction of it at 1 billion tokens compared to 1 million)? Or some other benchmark that comes closer to evaluating real-world performance.
Edit: Oh, math-benchmark performance would convince me if we get access to a CoT sample and it shows that the model doesn’t follow the above “cheating” approach, but instead approaches the problem strategically (in some sense). (Which would also require this CoT not to be hopelessly steganographied, obviously.)
Prooobably ~simultaneously, but I can maybe see it coming earlier and in a way that isn’t wholly convincing to me. In particular, it would still be a fixed-length task; much longer-length than what the contemporary models can reliably manage today, but still hackable using poorly-generalizing “agency templates” instead of fully general “compact generators of agenty behavior” (which I speculate humans to have and RL’d LLMs not to). It would be some evidence in favor of “AI can accelerate AI R&D”, but not necessarily “LLMs trained via SSL+RL are AGI-complete”.
Actually, I can also see it coming later. For example, some suppose that the capability researchers invent some method for reliably-and-indefinitely extending the amount of serial computations a reasoning model can productively make use of, but the compute or memory requirements grow very fast with the length of a CoT. Some fairly solid empirical evidence and theoretical arguments in favor of boundless scaling can appear quickly, well before the algorithms are made optimal enough to (1) handle weeks-long CoTs and/or (2) allow wide adoption (thus making it available to you).
I think the second scenario is more plausible, actually.
I wish we had something to bet on better than “inventing a new field of science,”
I’ve thought of one potential observable that is concrete, should be relatively low-capability, and should provoke a strong update towards your model for me:
If there is an AI model such that the complexity of R&D problems it can solve (1) scales basically boundlessly with the amount of serial compute provided to it (or to a “research fleet” based on it), (2) scales much faster with serial compute than with parallel compute, and (3) the required amount of human attention (“babysitting”) is constant or grows very slowly with the amount of serial compute.
This attempts to directly get at the “autonomous self-correction” and “ability to think about R&D problems strategically” ideas.
I’ve not fully thought through all possible ways reality could Goodhart to this benchmark, i. e. “technically” pass it but in a way I find unconvincing. For example, if I failed to include the condition (2), o3 would have probably already “passed” it (since it potentially achieved better performance on ARC-AGI and FrontierMath by sampling thousands of CoTs then outputting the most frequent answer). There might be other loopholes like this...
But it currently seems reasonable and True-Name-y to me.
- Jan 29, 2025, 2:18 AM; 7 points) 's comment on What Indicators Should We Watch to Disambiguate AGI Timelines? by (
Well, he didn’t do it yet either, did he? His new announcement is, likewise, just that: an announcement. Manifold is still 35% on him not following through on it, for example.
“Intensely competent and creative”, basically, maybe with a side of “obsessed” (with whatever they’re cracked at).
Supposedly Trump announced that back in October, so it should already be priced in.
(Here’s my attempt at making sense of it, for what it’s worth.)
Here’s a potential interpretation of the market’s apparent strange reaction to DeepSeek-R1 (shorting Nvidia).
I don’t fully endorse this explanation, and the shorting may or may not have actually been due to Trump’s tariffs + insider trading, rather than DeepSeek-R1. But I see a world in which reacting this way to R1 arguably makes sense, and I don’t think it’s an altogether implausible world.
If I recall correctly, the amount of money globally spent on inference dwarfs the amount of money spent on training. Most economic entities are not AGI labs training new models, after all. So the impact of DeepSeek-R1 on the pretraining scaling laws is irrelevant: sure, it did not show that you don’t need bigger data centers to get better base models, but that’s not where most of the money was anyway.
And my understanding is that, on the inference-time scaling paradigm, there isn’t yet any proven method of transforming arbitrary quantities of compute into better performance:
Reasoning models are bottlenecked on the length of CoTs that they’ve been trained to productively make use of. They can’t fully utilize even their context windows; the RL pipelines just aren’t up to that task yet. And if that bottleneck were resolved, the context-window bottleneck would be next: my understanding is that infinite context/”long-term” memories haven’t been properly solved either, and it’s unknown how they’d interact with the RL stage (probably they’d interact okay, but maybe not).
o3 did manage to boost its ARC-AGI and (maybe?) FrontierMath performance by… generating a thousand guesses and then picking the most common one...? But who knows how that really worked, and how practically useful it is. (See e. g. this, although that paper examines a somewhat different regime.)
Agents, from Devin to Operator to random open-source projects, are still pretty terrible. You can’t set up an ecosystem of agents in a big data center and let them rip, such that the ecosystem’s power scales boundlessly with the data center’s size. For all but the most formulaic tasks, you still need a competent human closely babysitting everything they do, which means you’re still mostly bottlenecked on competent human attention.
Suppose that you don’t expect the situation to improve: that the inference-time scaling paradigm would hit a ceiling pretty soon, or that it’d converge to distilling search into forward passes (such that the end users end up using very little compute on inference, like today), and that agents just aren’t going to work out the way the AGI labs promise.
In such a world, a given task can either be completed automatically by an AI for some fixed quantity of compute X, or it cannot be completed by an AI at all. Pouring ten times more compute on it does nothing.
In such a world, if it were shown that the compute needs of a task can be met with ten times less compute than previously expected, this would decrease the expected demand for compute.
The fact that capable models can be run locally might increase the number of people willing to use them (e. g., those very concerned about data privacy), as might the ability to automatically complete 10x as many trivial tasks. But it’s not obvious that this demand spike will be bigger than the simultaneous demand drop.
And I, at least, when researching ways to set up DeepSeek-R1 locally, found myself more drawn to the “wire a bunch of Macs together” option, compared to “wire a bunch of GPUs together” (due to the compactness). If many people are like this, it makes sense why Nvidia is down while Apple is (slightly) up. (Moreover, it’s apparently possible to run the full 671b-parameter version locally, and at a decent speed, using a pure RAM+CPU setup; indeed, it appears cheaper than mucking about with GPUs/Macs, just $6,000.)
This world doesn’t seem outright implausible to me. I’m bearish on agents and somewhat skeptical of inference-time scaling. And if inference-time scaling does deliver on its promises, it’ll likely go the way of search-and-distill.
On balance, I don’t actually expect the market to have any idea what’s going on, so I don’t know that its reasoning is this specific flavor of “well-informed but skeptical”. And again, it’s possible the drop was due to Trump, nothing to do with DeepSeek at all.
But as I’d said, this reaction to DeepSeek-R1 does not seem necessarily irrational/incoherent to me.
You’re not accounting for enemy action. They couldn’t have been sure, at the onset, how successful the AI Notkilleveryoneism faction will be at raising alarm, and in general, how blatant the risks will become to the outsiders as capabilities progress. And they have been intimately familiar with the relevant discussions, after all.
So they might’ve overcorrected, and considered that the “strategic middle ground” would be to admit the risk is plausible (but not as certain as the “doomers” say), rather than to deny it (which they might’ve expected to become a delusional-looking position in the future, so not a PR-friendly stance to take).
Or, at least, I think this could’ve been a relevant factor there.