o3
See livestream, site, OpenAI thread, Nat McAleese thread.
OpenAI announced (but isn’t yet releasing) o3 and o3-mini (skipping o2 because of telecom company O2′s trademark). “We plan to deploy these models early next year.” “o3 is powered by further scaling up RL beyond o1″; I don’t know whether it’s a new base model.
o3 gets 25% on FrontierMath, smashing the previous SoTA. (These are really hard math problems.[1]) Wow. (The dark blue bar, about 7%, is presumably one-attempt and most comparable to the old SoTA; unfortunately OpenAI didn’t say what the light blue bar is, but I think it doesn’t really matter and the 25% is for real.[2])
o3 also is easily SoTA on SWE-bench Verified and Codeforces.
It’s also easily SoTA on ARC-AGI, after doing RL on the public ARC-AGI problems + when spending $4,000 per task on inference (!).[3]
OpenAI has a “new alignment strategy.” (Just about the “modern LLMs still comply with malicious prompts, overrefuse benign queries, and fall victim to jailbreak attacks” problem.) It looks like RLAIF/Constitutional AI. See Lawrence Chan’s thread.
OpenAI says “We’re offering safety and security researchers early access to our next frontier models”; yay.
o3-mini will be able to use a low, medium, or high amount of inference compute, depending on the task and the user’s preferences. o3-mini (medium) outperforms o1 (at least on Codeforces and the 2024 AIME) with less inference cost.
GPQA Diamond:
- ^
Update: most of them are not as hard as I thought:
There are 3 tiers of difficulty within FrontierMath: 25% T1 = IMO/undergrad style problems, 50% T2 = grad/qualifying exam style [problems], 25% T3 = early researcher problems.
- ^
My guess is it’s consensus@128 or something (i.e. write 128 answers and submit the most common one). Even if it’s pass@n (i.e. submit n tries) rather than consensus@n, that’s likely reasonable because I heard FrontierMath is designed to have easier-to-verify numerical-ish answers.
Update: it’s not pass@n.
- ^
It’s not clear how they can leverage so much inference compute; they must be doing more than consensus@n. See Vladimir_Nesov’s comment.
“Scaling is over” was sort of the last hope I had for avoiding the “no one is employable, everyone starves” apocalypse. From that frame, the announcement video from openai is offputtingly cheerful.
Really. I don’t emphasize this because I care more about humanity’s survival than the next decades sucking really hard for me and everyone I love. But how do LW futurists not expect catastrophic job loss that destroys the global economy?
I’m flabbergasted by this degree/kind of altruism. I respect you for it, but I literally cannot bring myself to care about “humanity”’s survival if it means the permanent impoverishment, enslavement or starvation of everybody I love. That future is simply not much better on my lights than everyone including the gpu-controllers meeting a similar fate. In fact I think my instincts are to hate that outcome more, because it’s unjust.
Slight correction: catastrophic job loss would destroy the ability of the non-landed, working public to paritcipate in and extract value from the global economy. The global economy itself would be fine. I agree this is a natural conclusion; I guess people were hoping to get 10 or 15 more years out of their natural gifts.
Thank you. Oddly, I am less altruistic than many EA/LWers. They routinely blow me away.
I can only maintain even that much altruism because I think there’s a very good chance that the future could be very, very good for a truly vast number of humans and conscious AGIs. I don’t think it’s that likely that we get a perpetual boot-on-face situation. I think only about 1% of humans are so sociopathic AND sadistic in combination that they wouldn’t eventually let their tiny sliver of empathy cause them to use their nearly-unlimited power to make life good for people. They wouldn’t risk giving up control, just share enough to be hailed as a benevolent hero instead of merely god-emperor for eternity.
I have done a little “metta” meditation to expand my circle of empathy. I think it makes me happy; I can “borrow joy”. The side effect is weird decisions like letting my family suffer so that more strangers can flourish in a future I probably won’t see.
Why would that be the likely case? Are you sure it’s likely or are you just catastrophizing?
I expect the US or Chinese government to take control of these systems sooner than later to maintain sovereignty. I also expect there will be some force to counteract the rapid nominal deflation that would happen if there was mass job loss. Every ultra rich person now relies on billions of people buying their products to give their companies the valuation they have.
I don’t think people want nominal deflation even if it’s real economic growth. This will result in massive printing from the fed that probably lands in poeple’s pockets (Iike covid checks).
I think this is reasonably likely, but not a guaranteed outcome, and I do think there’s a non-trivial chance that the US regulates it way too late to matter, because I expect mass job loss to be one of the last things AI does, due to pretty severe reliability issues with current AI.
I think Elon will bring strong concern about AI to fore in current executive—he was an early voice for AI safety though he seems too have updated to a more optimistic view (and is pushing development through x-AI) he still generally states P(doom) ~10-20%. His antipathy towards Altman and Google founders is likely of benefit for AI regulation too—though no answer for the China et al AGI development problem.
I also expect government control; see If we solve alignment, do we die anyway? for musings about the risks thereof. But it is a possible partial solution to job loss. It’s a lot tougher to pass a law saying “no one can make this promising new technology even though it will vastly increase economic productivity” than to just show up to one company and say “heeeey so we couldn’t help but notice you guys are building something that will utterly shift the balance of power in the world.… can we just, you know, sit in and hear what you’re doing with it and maybe kibbitz a bit?” Then nationalize it officially if and when that seems necessary.
I actually think doing the former is considerably more in line with the way things are done/closer to the Overton window.
For politicians, yes—but the new administration looks to be strongly pro-tech (unless DJ Trump gets a bee in his bonnet and turns dramatically anti-Musk).
For the national security apparatus, the second seems more in line with how they get things done. And I expect them to twig to the dramatic implications much faster than the politicians do. In this case, there’s not even anything illegal or difficult about just having some liasons at OAI and an informal request to have them present in any important meetings.
At this point I’d be surprised to see meaningful legislation slowing AI/AGI progress in the US, because the “we’re racing China” narrative is so compelling—particularly to the good old military-industrial complex, but also to people at large.
Slowing down might be handing the race to China, or at least a near-tie.
I am becoming more sure that would beat going full-speed without a solid alignment plan. Despite my complete failure to interest anyone in the question of Who is Xi Jinping? in terms of how he or his successors would use AGI. I don’t think he’s sociopathic/sadistic enough to create worse x-risks or s-risks than rushing to AGI does. But I don’t know.
We still somehow got the steam engine, electricity, cars, etc.
There is an element of international competition to it. If we slack here, China will probably raise armies of robots with unlimited firepower and take over the world. (They constantly show aggression)
The longshoreman strike is only allowed (I think) because the west coast did automate and somehow are less efficient than the east coast for example.
Counterpoints: nuclear power, pharmaceuticals, bioengineering, urban development.
Or maybe they will accidentally ban AI too due to being a dysfunctional autocracy, as autocracies are wont to do, all the while remaining just as clueless regarding what’s happening as their US counterparts banning AI to protect the jobs.
I don’t really expect that to happen, but survival-without-dignity scenarios do seem salient.
I think a lot of this is wishful thinking from safetyists who want AI development to stop. This may be reductionist but almost every pause historically can be explained economics.
Nuclear—war usage is wholly owned by the state and developed to its saturation point (i.e. once you have nukes that can kill all your enemies, there is little reason to develop them more). Energy-wise, supposedly, it was hamstrung by regulation, but in countries like China where development went unfettered, they are still not dominant. This tells me a lot it not being developed is it not being economical.
For bio related things, Eroom’s law reigns supreme. It is just economically unviable to discover drugs in the way we do. Despite this, it’s clear that bioweapons are regularly researched by government labs. The USG being so eager to fund gof research despite its bad optics should tell you as much.
Or maybe they will accidentally ban AI too due to being a dysfunctional autocracy—
I remember many essays from people all over this site on how China wouldn’t be able to get to X-1 nm (or the crucial step for it) for decades, and China would always figure a way to get to that nm or step within a few months. They surpassed our chip lithography expectations for them. They are very competent. They are run by probably the most competent government bureaucracy in the world. I don’t know what it is, but people keep underestimating China’s progress. When they aim their efforts on a target, they almost always achieve it.
Rapid progress is a powerful attractor state that requires a global hegemon to stop. China is very keen on the possibilities of AI which is why they stop at nothing to get their hands on Nvidia GPUs. They also have literally no reason to develop a centralized project they are fully in control of. We have superhuman AI that seem quite easy to control already. What is stopping this centralized project on their end. No one is buying that even o3, which is nearly superhuman in math and coding, and probably lots of scientific research, is going to attempt world takeover.
Do you mean this is evidence that scaling is really over, or is this the opposite where you think scaling is not over?
And for me, the (correct) reframing of RL as the cherry on top of our existing self-supervised stack was the straw that broke my hopeful back.
And o3 is more straws to my broken back.
I’m going to go against the flow here and not be easily impressed. I suppose it might just be copium.
Any actual reason to expect that the new model beating these challenging benchmarks, which have previously remained unconquered, is any more of a big deal than the last several times a new model beat a bunch of challenging benchmarks that have previously remained unconquered?
Don’t get me wrong, I’m sure it’s amazingly more capable in the domains in which it’s amazingly more capable. But I see quite a lot of “AGI achieved” panicking/exhilaration in various discussions, and I wonder whether it’s more justified this time than the last several times this pattern played out. Does anything indicate that this capability advancement is going to generalize in a meaningful way to real-world tasks and real-world autonomy, rather than remaining limited to the domain of extremely well-posed problems?
One of the reasons I’m skeptical is the part where it requires thousands of dollars’ worth of inference-time compute. Implies it’s doing brute force at extreme scale, which is a strategy that’d only work for, again, domains of well-posed problems with easily verifiable solutions. Similar to how o1 blows Sonnet 3.5.1 out of the water on math, but isn’t much better outside that.
Edit: If we actually look at the benchmarks here:
The most impressive-looking jump is FrontierMath from 2% to 25.2%, but it’s also exactly the benchmark where the strategy of “generate 10k candidate solutions, hook them up to a theorem-verifier, see if one of them checks out, output it” would shine.
(With the potential theorem-verifier having been internalized by o3 over the course of its training; I’m not saying there was a separate theorem-verifier manually wrapped around o3.)
Significant progress on ARC-AGI has previously been achieved using “crude program enumeration”, which made the authors conclude that “about half of the benchmark was not a strong signal towards AGI”.
The SWE jump from 48.9 to 71.7 is significant, but it’s not much of a qualitative improvement.
Not to say it’s a nothingburger, of course. But I’m not feeling the AGI here.
It’s not AGI, but for human labor to retain any long-term value, there has to be an impenetrable wall that AI research hits, and this result rules out a small but nonzero number of locations that wall might’ve been.
To first order, I believe a lot of the reason why the “AGI achieved” shrill posting often tends to be overhyped is that not because the models are theoretically incapable, but rather that reliability was way more of a requirement for it to replace jobs fast than people realized, and there are only a very few jobs where an AI agent can do well without instantly breaking down because it can’t error-correct/be reliable, and I think this has been continually underestimated by AI bulls.
Indeed, one of my broader updates is that a capability is only important to the broader economy if it’s very, very reliable, and I agree with Leo Gao and Alexander Gietelink Oldenziel that reliability is a bottleneck way more than people thought:
https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/leogao-s-shortform#f5WAxD3WfjQgefeZz
https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/leogao-s-shortform#YxLCWZ9ZfhPdjojnv
I agree that this seems like an important factor. See also this post making a similar point.
To be clear, I do expect AI to accelerate AI research, and AI research may be one of the few exceptions to this rule, but it’s one of the reasons I have longer timelines nowadays than a lot of other people, and also why I expect AI impact on the economy to be surprisingly discontinuous in practice, and is a big reason I expect AI governance have few laws passed until very near the end of the AI as complement era for most jobs that are not AI research.
The post you linked is pretty great, thanks for sharing.
It’s not really dangerous real AGI yet. But it will be soon this is a version that’s like a human with severe brain damage to the frontal lobes that provide agency and self-management, and the temporal lobe, which does episodic memory and therefore continuous, self-directed learning.
Those things are relatively easy to add, since it’s smart enough to self-manage as an agent and self-direct its learning. Episodic memory systems exist and only need modest improvements—some low-hanging fruit are glaringly obvious from a computational neuroscience perspective, so I expect them to be employed almost as soon as a competent team starts working on episodic memory.
Don’t indulge in even possible copium. We need your help to align these things, fast. The possibility of dangerous AGI soon can no longer be ignored.
Gambling that the gaps in LLMs abilities (relative to humans) won’t be filled soon is a bad gamble.
A very large amount of human problem solving/innovation in challenging areas is creating and evaluating potential solutions, it is a stochastic rather than deterministic process. My understanding is that our brains are highly parallelized in evaluating ideas in thousands of ‘cortical columns’ a few mm across (Jeff Hawkin’s 1000 brains formulation) with an attention mechanism that promotes the filtered best outputs of those myriad processes forming our ‘consciousness’.
So generating and discarding large numbers of solutions within simpler ‘sub brains’, via iterative, or parallelized operation is very much how I would expect to see AGI and SI develop.
It’s hard to find numbers. Here’s what I’ve been able to gather (please let me know if you find better numbers than these!). I’m mostly focusing on FrontierMath.
Pixel counting on the ARC-AGI image, I’m getting $3,644 ± $10 per task.
FrontierMath doesn’t say how many questions they have (!!!). However, they have percent breakdowns by subfield, and those percents are given to the nearest 0.1%; using this, I can narrow the range down to 289-292 problems in the dataset. Previous models solve around 3 problems (4 problems in total were ever solved by any previous model, though the full o1 was not evaluated, only o1-preview was)
o3 solves 25.2% of FrontierMath. This could be 73⁄290. But it is also possible that some questions were removed from the dataset (e.g. because they’re publicly available). 25.2% could also be 72⁄286 or 71⁄282, for example.
The 280 to 290 problems means a rough ballpark for a 95% confidence interval for FrontierMath would be [20%, 30%]. It is pretty strange that the ML community STILL doesn’t put confidence intervals on their benchmarks. If you see a model achieve 30% on FrontierMath later, remember that its confidence interval would overlap with o3′s. (Edit: actually, the confidence interval calculation assumes all problems are sampled from the same pool, which is explicitly not the case for this dataset: some problems are harder than others. So it is hard to get a true confidence interval without rerunning the evaluation several times, which would cost many millions of dollars.)
Using the pricing for ARC-AGI, o3 cost around $1mm to evaluate on the 280-290 problems of FrontierMath. That’s around $3,644 per attempt, or roughly $14,500 per correct solution.
This is actually likely more expensive than hiring a domain-specific expert mathematician for each problem (they’d take at most few hours per problem if you find the right expert, except for the hardest problems which o3 also didn’t solve). Even without hiring a different domain expert per problem, I think if you gave me FrontierMath and told me “solve as many as you can, and you get $15k per correct answer” I’d probably spend like a year and solve a lot of them: if I match o3 within a year, I’d get $1mm, which is much higher than my mathematician salary. (o3 has an edge in speed, of course, but you could parallelize the hiring of mathematicians too.) I think this is the first model I’ve seen which gets paid more than I do!
I don’t think anchoring to o3′s current cost-efficiency is a reasonable thing to do. Now that AI has the capability to solve these problems in-principle, buying this capability is probably going to get orders of magnitude cheaper within the next five
minutesmonths, as they find various algorithmic shortcuts.I would guess that OpenAI did this using a non-optimized model because they expected it to be net beneficial: that producing a headline-grabbing result now will attract more counterfactual investment than e. g. the $900k they’d save by running the benchmarks half a year later.
Edit: In fact, if, against these expectations, the implementation of o3′s trick can’t be made orders-of-magnitude cheaper (say, because a base model of a given size necessarily takes ~n tries/MCTS branches per a FrontierMath problem and you can’t get more efficient than one try per try), that would make me do a massive update against the “inference-time compute” paradigm.
Using $4K per task means a lot of inference in parallel, which wasn’t in o1. So that’s one possible source of improvement, maybe it’s running MCTS instead of individual long traces (including on low settings at $20 per task). And it might be built on the 100K H100s base model.
The scary less plausible option is that RL training scales, so it’s mostly o1 trained with more compute, and $4K per task is more of an inefficient premium option on top rather than a higher setting on o3′s source of power.
The obvious boring guess is best of n. Maybe you’re asserting that using $4,000 implies that they’re doing more than that.
Performance at $20 per task is already much better than for o1, so it can’t be just best-of-n, you’d need more attempts to get that much better even if there is a very good verifier that notices a correct solution (at $4K per task that’s plausible, but not at $20 per task). There are various clever beam search options that don’t need to make inference much more expensive, but in principle might be able to give a boost at low expense (compared to not using them at all).
There’s still no word on the 100K H100s model, so that’s another possibility. Currently Claude 3.5 Sonnet seems to be better at System 1, while OpenAI o1 is better at System 2, and combining these advantages in o3 based on a yet-unannounced GPT-4.5o base model that’s better than Claude 3.5 Sonnet might be sufficient to explain the improvement. Without any public 100K H100s Chinchilla optimal models it’s hard to say how much that alone should help.
Anyone want to guess how capable Claude system level 2 will be when it is polished? I expect better than o3 by a small amt.
The ARC-AGI page (which I think has been updated) currently says:
I wish they would tell us what the dark vs light blue means. Specifically, for the FrontierMath benchmark, the dark blue looks like it’s around 8% (rather than the light blue at 25.2%). Which like, I dunno, maybe this is nit picking, but 25% on FrontierMath seems like a BIG deal, and I’d like to know how much to be updating my beliefs.
From an apparent author on reddit:
The comment was responding to a claim that Terence Tao said he could only solve a small percentage of questions, but Terence was only sent the T3 questions.
My random guess is:
The dark blue bar corresponds to the testing conditions under which the previous SOTA was 2%.
The light blue bar doesn’t cheat (e.g. doesn’t let the model run many times and then see if it gets it right on any one of those times) but spends more compute than one would realistically spend (e.g. more than how much you could pay a mathematician to solve the problem), perhaps by running the model 100 to 1000 times and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning.
The FrontierMath answers are numerical-ish (“problems have large numerical answers or complex mathematical objects as solutions”), so you can just check which answer the model wrote most frequently.
Yeah, I agree that that could work. I (weakly) conjecture that they would get better results by doing something more like the thing I described, though.
On the livestream, Mark Chen says the 25.2% was achieved “in aggressive test-time settings”. Does that just mean more compute?
It likely means running the AI many times and submitting the most common answer from the AI as the final answer.
Extremely long chain of thought, no?
I guess one thing I want to know is like… how exactly does the scoring work? I can imagine something like, they ran the model a zillion times on each question, and if any one of the answers was right, that got counted in the light blue bar. Something that plainly silly probably isn’t what happened, but it could be something similar.
If it actually just submitted one answer to each question and got a quarter of them right, then I think it doesn’t particularly matter to me how much compute it used.
It was one submission, apparently.
Thanks. Is “pass@1” some kind of lingo? (It seems like an ungoogleable term.)
Pass@k means that at least one of k attempts passes, according to an oracle verifier. Evaluating with pass@k is cheating when k is not 1 (but still interesting to observe), the non-cheating option is best-of-k where the system needs to pick out the best attempt on its own. So saying pass@1 means you are not cheating in evaluation in this way.
pass@n is not cheating if answers are easy to verify. E.g. if you can cheaply/quickly verify that code works, pass@n is fine for coding.
For coding, a problem statement won’t have exhaustive formal requirements that will be handed to the solver, only evals and formal proofs can be expected to have adequate oracle verifiers. If you do have an oracle verifier, you can just wrap the system in it and call it pass@1. Affordance to reliably verify helps in training (where the verifier is applied externally), but not in taking the tests (where the system taking the test doesn’t itself have a verifier on hand).
I was still hoping for a suet if normal life. At least for a decade or maybe more. But that just doesn’t seem possible anymore. This is a rough night.
Oh dear, RL for everything, because surely nobody’s been complaining about the safety profile of doing RL directly on instrumental tasks rather than on goals that benefit humanity.
Regarding whether this is a new base model, we have the following evidence:
Jason Wei:
Nat McAleese:
The prices leaked by ARC-ARG people indicate $60/million output tokens, which is also the current o1 pricing. 33m total tokens and a cost of $2,012.
Notably, the codeforces graph with pricing puts o3 about 3x higher than o1 (tho maybe it’s a secretly log scale), and the ARC-AGI graph has the cost of o3 being 10-20x that of o1-preview. Maybe this indicates it does a bunch more test-time reasoning. That’s collaborated by ARC-AGI, average 55k tokens per solution, which seems like a ton.
I think this evidence indicates this is likely the same base model as o1, and I would be at like 65% sure, so not super confident.
GPT-4o costs $10 per 1M output tokens, so the cost of $60 per 1M tokens is itself more than 6 times higher than it has to be. Which means they can afford to sell a much more expensive model at the same price. It could also be GPT-4.5o-mini or something, similar in size to GPT-4o but stronger, with knowledge distillation from full GPT-4.5o, given that a new training system has probably been available for 6+ months now.
While I’m not surprised by the pessimism here, I am surprised at how much of it is focused on personal job loss. I thought there would be more existential dread.
Existential dread doesn’t necessarily follow from this specific development if training only works around verifiable tasks and not for everything else, like with chess. Could soon be game-changing for coding and other forms of engineering, without full automation even there and without applying to a lot of other things.
Oh I guess I was assuming automation of coding would result in a step change in research in every other domain. I know that coding is actually one of the biggest blockers in much of AI research and automation in general.
It might soon become cost effective to write bespoke solutions for a lot of labor jobs for example.
Fucking o3. This pace of improvement looks absolutely alarming. I would really hate to have my fast timelines turn out to be right.
The “alignment” technique, “deliberative alignment”, is much better than constitutional AI. It’s the same during training, but it also teaches the model the safety criteria, and teaches the model to reason about them at runtime, using a reward model that compares the output to their specific safety criteria. (This also suggests something else I’ve been expecting—the CoT training technique behind o1 doesn’t need perfectly verifiable answers in coding and math, it can use a decent guess as to the better answer in what’s probably the same procedure).
While safety is not alignment (SINA?), this technique has a lot of promise for actual alignment. By chance, I’ve been working on an update to my Internal independent review for language model agent alignment, and have been thinking about how this type of review could be trained instead of scripted into an agent as I’d originally predicted.
This is that technique. It does have some promise.
But I don’t think OpenAI has really thought through the consequences of using their smarter-than-human models with scaffolding that makes them fully agentic and soon enough reflective and continuously learning.
The race for AGI speeds up, and so does the race to align it adequately by the time it arrives in a takeover-capable form.
I’ll write a little more on their new alignment approach soon.
For people who don’t expect a strong government response… remember that Elon is First Buddy now. 🎢
As I say here https://x.com/boazbaraktcs/status/1870369979369128314
Constitutional AI is a great work but Deliberative Alignment is fundamentally different. The difference is basically system 1 vs system 2. In RLAIF ultimately the generative model that answers user prompt is trained with (prompt, good response, bad response). Even if the good and bad responses were generated based on some constitution, the generative model is not taught the text of this constitution, and most importantly how to reason about this text in the context of a particular example.
This ability to reason is crucial to OOD performance such as training only on English and generalizing to other languages or encoded output.
See also https://x.com/boazbaraktcs/status/1870285696998817958
Also the thing I am most excited about deliberative alignment is that it becomes better as models are more capable. o1 is already more robust than o1 preview and I fully expect this to continue.
(P.s. apologies in advance if I’m unable to keep up with comments; popped from holiday to post on the DA paper.)
My probably contrarian take is that I don’t think improvement on a benchmark of math problems is particularly scary or relevant. It’s not nothing—I’d prefer if it didn’t improve at all—but it only makes me slightly more worried.
How have your AGI timelines changed after this announcement?
~No update, priced it all in after the Q* rumors first surfaced in November 2023.
A rumor is not the same as a demonstration.
It is if you believe the rumor and can extrapolate its implications, which I did. Why would I need to wait to see the concrete demonstration that I’m sure would come, if I can instead update on the spot?
It wasn’t hard to figure out how “something like an LLM with A*/MCTS stapled on top” would look like, or where it’d shine, or that OpenAI might be trying it and succeeding at it (given that everyone in the ML community had already been exploring this direction at the time).
Suppose I throw up a coin but I dont show you the answer. Your friend’s cousin tells you they think the bias is 80⁄20 in favor of heads.
If I show you the outcome was indeed heads should you still update ? (Yes)
Sure. But if you know the bias is 95⁄5 in favor of heads, and you see heads, you don’t update very strongly.
And yes, I was approximately that confident that something-like-MCTS was going to work, that it’d demolish well-posed math problems, and that this is the direction OpenAI would go in (after weighing in the rumor’s existence). The only question was the timing, and this is mostly within my expectations as well.
I guess one’s timelines might have gotten longer if one had very high credence that the paradigm opened by o1 is a blind alley (relative to the goal of developing human-worker-omni-replacement-capable AI) but profitable enough that OA gets distracted from its official most ambitious goal.
I’m not that person.
Presumably light blue is o3 high, and dark blue is o3 low?
I think they only have formal high and low versions for o3-mini
That’s some significant progress, but I don’t think will lead to TAI.
However there is a realistic best case scenario where LLM/Transformer stop just before and can give useful lessons and capabilities.
I would really like to see such an LLM system get as good as a top human team at security, so it could then be used to inspect and hopefully fix masses of security vulnerabilities. Note that could give a false sense of security, unknown unknown type situation where it would’t find a totally new type of attack, say a combined SW/HW attack like Rowhammer/Meltdown but more creative. A superintelligence not based on LLM could however.