Works now for me
lc
If anybody here knows someone from CAIS they need to setup their non-www domain name. Going to https://safe.ai shows a github landing page
Interesting, but non-sequitur. That is, either you believe that interest rates will predictably increase and there’s free money on the table, and you should just say so, or not, and this anecdote doesn’t seem to be relevant (similarly, I made money buying NVDA around that time, but I don’t think that proves anything).
I am saying so! The market is definitely not pricing in AGI; doesn’t matter if it comes in 2028, or 2035, or 2040. Though interest rates are a pretty bad way to arb this; I would just buy call options on the Nasdaq.
Perhaps, but shouldn’t LLMs already be speeding up AI progress? And if so, shouldn’t that already be reflected in METR’s plot?
They’re not that useful yet.
The outside view, insofar is that is a well-defined thing...
It’s not really a well-defined thing, which is why the standard on this site is to taboo those words and just explain what your lines of evidence are, or the motivation for any special priors if you have them.
If AGI were arriving in 2030, the outside view says interest rates would be very high (I’m not particularly knowledgeable about this and might have the details wrong but see the analysis here, I believe the situation is still similar), and less confidently I think the S&P’s value would probably be measured in lightcone percentage points (?).
So, your claim is that interest rates would be very high if AGI were imminent, and they’re not so it’s not. The last time someone said this, if the people arguing in the comment section had simply made a bet on interest rates changing, they would have made a lot of money! Ditto for buying up AI-related stocks or call options on those stocks.
I think you’re just overestimating the ability of the market to generalize to out of distribution events. Prices are set by a market’s participants, and the institutions with the ability to move prices are mostly not thinking about AGI timelines at present. It wouldn’t matter if AGI was arriving in five or ten or twenty years, Bridgewater would be basically doing the same things, and so their inaction doesn’t provide much evidence. Inherent in these forecasts there are also naturally going to be a lot of assumptions about the value of money (or titles to partial ownership of companies controlled by Sam Altmans) in a post-AGI scenario. These are pretty well-disputed premises, to say the least, which makes interpreting current market prices hard.
As far as I am concerned, AGI should be able to do any intellectual task that a human can do. I think that inventing important new ideas tends to take at least a month, but possibly the length of a PhD thesis. So it seems to be a reasonable interpretation that we might see human level AI around mid-2030 to 2040, which happens to be about my personal median.
The issue is, ML research itself is composed of many tasks that do take less than a month for humans to execute. For example, on this model, sometime before “idea generation”, you’re going to have a model that can do most high-context software engineering tasks. The research department at any of the big AI labs would be able to do more stuff if it had such a model. So while current AI is not accelerating machine learning research that much, as it gets better, the trend line from the METR paper is going to curl upward.
You could say that the “inventing important new ideas” part is going to be such a heavy bottleneck, that this speedup won’t amount to much. But I think that’s mostly wrong, and that if you asked ML researchers at OpenAI, a drop in remote worker that could “only” be directed to do things that otherwise took 12 hours would speed up their work by a lot.
But the deeper problem is that the argument is ultimately, subtly circular. Current AI research does look a lot like rapidly iterating and trying random engineering improvements. If you already believe this will lead to AGI, then certainly AI coding assistants which can rapidly iterate would expedite the process. However, I do not believe that blind iteration on the current paradigm leads to AGI (at least not anytime soon), so I see no reason to accept this argument.
It’s actually not circular at all. “Current AI research” has taken us from machines that can’t talk to machines that can talk, write computer programs, give advice, etc. in about five years. That’s the empirical evidence that you can make research progress doing “random” stuff. In the absence of further evidence, people are just expecting the thing that has happened over the last five years to continue. You can reject that claim, but at this point I think the burden of proof is on the people that do.
I don’t predict a superintelligent singleton (having fused with the other AIs) would need to design a bioweapon or otherwise explicitly kill everyone. I expect it to simply transition into using more efficient tools than humans, and transfer the existing humans into hyperdomestication programs
+1, this is clearly a lot more likely than the alignment process missing humans entirely IMO
Now, one could reasonably counter-argue that the yin strategy delivers value somewhere else, besides just e.g. “probability of a date”. Maybe it’s a useful filter for some sort of guy...
I feel like you know this is the case and I’m wondering why you’re even asking the question. Of course it’s a filter; the entire mating process is. Women like confidence, and taking these mixed signals as a sign of attraction is itself a sign of confidence. Walking up to a guy and asking him to have sex immediately would also be more “efficient” by some deranged standards, but the point of flirting is that you get to signal social grace, selectivity, and a willingness to walk away from the interaction.
Close friend of mine, a regular software engineer, recently threw tens of thousands of dollars—a sizable chunk of his yearly salary—at futures contracts on some absurd theory about the Japanese Yen. Over the last few weeks, he coinflipped his money into half a million dollars. Everyone who knows him was begging him to pull out and use the money to buy a house or something. But of course yesterday he sold his futures contracts and bought into 0DTE Nasdaq options on another theory, and literally lost everything he put in and then some. I’m not sure but I think he’s down about half his yearly salary overall.
He has been doing this kind of thing for the last two years or so—not just making investments, but making the most absurd, high risk investments you can think of. Every time he comes up with a new trade, he has a story for me about how his cousin/whatever who’s a commodities trader recommended the trade to him, or about how a geopolitical event is gonna spike the stock of Lockheed Martin, or something. On many occasions I have attempted to explain some kind of Inadequate Equilibria thesis to him, but it just doesn’t seem to “stick”.
It’s not that he “rejects” the EMH in these conversations. I think for a lot of people there is literally no slot in their mind that is able to hold market efficiency/inefficiency arguments. They just see stocks moving up and down. Sometimes the stocks move in response to legible events. They think, this is a tractable problem, I just have to predict the legible events. How could I be unable to make money? Those guys from The Big Short did!
He is also taking a large amount of stimulants. I think that is compounding the situation a bit.
Typically I operationalize “employable as a software engineer” as being capable of completing tasks like:
“Fix this error we’re getting on BetterStack.”
“Move our Redis cache from DigitalOcean to AWS.”
“Add and implement a cancellation feature for ZeroPath scans.”
“Add the results of this evaluation to our internal benchmark.”
These are pretty representative examples of the kinds of tasks your median software engineer will be getting and resolving on a day to day basis.
No chatbot or chatbot wrapper can complete tasks like these for an engineering team at present, incl. Devin et. al. Partly this is because most software engineering work is very high-context, in the sense that implementing the proper solution depends on understanding a large body of existing infrastructure, business knowledge, and code.
When people talk about models today doing “agentic development”, they’re usually explaining its ability to complete small projects in low-context situations, where all you need to understand is the prompt itself and software engineering as a discipline. That makes sense, because if you ask AIs to write (for example) a PONG game in javascript, the AI can complete each of the pieces in one pass, and fit everything it’s doing into one context window. But that kind of task is unlike the vast majority of things employed software engineers do today, which is why we’re not experiencing an intelligence explosion right this second.
- Apr 11, 2025, 11:59 PM; 18 points) 's comment on Reactions to METR task length paper are insane by (
They strengthen chip export restrictions, order OpenBrain to further restrict its internet connections, and use extreme measures to secure algorithmic progress, like wiretapping OpenBrain employees—this catches the last remaining Chinese spy
Wiretapping? That’s it? Was this spy calling Xi from his home phone? xD
As a newly minted +100 strong upvote, I think the current karma economy accurately reflects how my opinion should be weighted
I have Become Stronger
My strong upvotes are now giving +1 and my regular upvotes give +2.
test
Just edited the post because I think the way it was phrased kind of exaggerated the difficulties we’ve been having applying the newer models. 3.7 was better, as I mentioned to Daniel, just underwhelming and not as big a leap as either 3.6 or certainly 3.5.
If you plot a line, does it plateau or does it get to professional human level (i.e. reliably doing all the things you are trying to get it to do as well as a professional human would)?
It plateaus before professional human level, both in a macro sense (comparing what ZeroPath can do vs. human pentesters) and in a micro sense (comparing the individual tasks ZeroPath does when it’s analyzing code). At least, the errors the models make are not ones I would expect a professional to make; I haven’t actually hired a bunch of pentesters and asked them to do the same tasks we expect of the language models and made the diff. One thing our tool has over people is breadth, but that’s because we can parallelize inspection of different pieces and not because the models are doing tasks better than humans.
What about 4.5? Is it as good as 3.7 Sonnet but you don’t use it for cost reasons? Or is it actually worse?
We have not yet tried 4.5 as it’s so expensive that we would not be able to deploy it, even for limited sections.
We use different models for different tasks for cost reasons. The primary workhorse model today is 3.7 sonnet, whose improvement over 3.6 sonnet was smaller than 3.6′s improvement over 3.5 sonnet. When taking the job of this workhorse model, o3-mini and the rest of the recent o-series models were strictly worse than 3.6.
Recent AI model progress feels mostly like bullshit
I haven’t read the METR paper in full, but from the examples given I’m worried the tests might be biased in favor of an agent with no capacity for long term memory, or at least not hitting the thresholds where context limitations become a problem:
For instance, task #3 here is at the limit of current AI capabilities (takes an hour). But it’s also something that could plausibly be done with very little context; if the AI just puts all of the example files in its context window it might be able to write the rest of the decoder from scratch. It might not even need to have the example files in memory while it’s debugging its project against the test cases.
Whereas a task to fix a bug in a large software project, while it might take an engineer associated with that project “an hour” to finish, requires stretching the limits of the amount of information it can fit inside a context window, or recall beyond what we seem to be capable of doing today.
There was a type of guy circa 2021 that basically said that gpt-3 etc. was cool, but we should be cautious about assuming everything was going to change, because the context limitation was a key bottleneck that might never be overcome. That guy’s take was briefly “discredited” in subsequent years when LLM companies increased context lengths to 100k, 200k tokens.
I think that was premature. The context limitations (in particular the lack of an equivalent to human long term memory) are the key deficit of current LLMs and we haven’t really seen much improvement at all.
I find it a little suspicious that the recent OpenAI model releases became leaders on the MASK dataset (when previous iterations didn’t seem to trend better at all), but I’m hopeful this represents deeper alignment successes and not simple tuning on the same or a similar set of tasks.