yo-cuddles

Karma: 95

yo-cuddles Mar 10, 2025, 1:36 PM
2 points
0
in reply to: Capybasilisk’s comment on: So how well is Claude playing Pokémon?
No, sorry, that’s not a typo that’s a linguistic norm that i probably assumed was more common than it actually is

Me and the people I talk with have used the words “mumble” and “babble” to describe LLM reasoning. Sort of like human babble, see https://www.lesswrong.com/posts/i42Dfoh4HtsCAfXxL/babble

yo-cuddles Mar 10, 2025, 12:44 AM
0 points
0
on: So how well is Claude playing Pokémon?
There’s an improvement in LLM’s I’ve seen that is important but has wildly inflated people’s expectations beyond what’s reasonable:

LLM’s have hit a point in some impressive tests where they don’t reliably fail past the threshold of being unrecoverable. They are conservative enough that they can do search on a problem, fail a million times until they mumble into an answer.

I’m going to try writing something of at least not-embarrassing quality about my thoughts on this but I am really confused by people’s hype around this sort of thing, this feels like directed randomness

yo-cuddles Mar 6, 2025, 3:54 AM
5 points
0
in reply to: Daniel Kokotajlo’s comment on: A Bear Case: My Predictions Regarding AI Progress
Gotcha, you didn’t sound OVER confident so I assumed it was much-less-than-certain, still refreshingly concrete

yo-cuddles Mar 6, 2025, 1:05 AM
5 points
0
in reply to: Daniel Kokotajlo’s comment on: A Bear Case: My Predictions Regarding AI Progress
Ah, okay.

I’ll throw in my moderately strong disagreement for future bayes points, respect for the short term, unambiguous prediction!

yo-cuddles Mar 5, 2025, 10:34 PM
14 points
15
on: How Much Are LLMs Actually Boosting Real-World Programmer Productivity?
This is not going to be a high quality answer, sorry in advance.

I noticed this with someone in my office who is learning robotic process automation: people are very bad at measuring their productivity, they are better at seeing certain kinds of gains and certain kinds of losses. I know someone who swears emphatically that they are many times as productive but have become almost totally unreliable. He’s in denial over it, and a couple people now have openly told me they try to remove him from workflows for all the problems he causes.

I think the situation is like this:

If you finish a task very quickly using automated methods, that feels viscerally great and, importantly, is very visible. If your work then incurs time costs later, you might not be able to trace that extra cost to the “automated” tasks you set up earlier, double so if those costs are absorbed by other people catching what you missed and correcting your mistakes, or doing the things that used to be done when you were doing it manually.

I imagine it is hard to track a bug and know, for certain, that you had to waste that time because you used an LLM instead of just doing it yourself. You don’t know who else had to waste time fixing your problem because LLM code is spaghetti, or at least you don’t feel it in your bones in the same way you feel increases in your output, you don’t get to see the counterfactual project where things just went better in intangible ways because you didn’t outsource your thinking to gpt. Few people notice, after the fact, how many problems they incurred because of a specific thing they did.

I think LLM usage is almost ubiquitous at this point, if it were conveying big benefits it would show more clearly. If everyone is saying they are 2x more productive (which is kinda low by some testimonies) then it is probably the case that they are just oblivious to the problems they are causing for themselves because they’re just less visible.

yo-cuddles Mar 5, 2025, 9:44 PM
1 point
0
in reply to: Daniel Kokotajlo’s comment on: A Bear Case: My Predictions Regarding AI Progress
By “solve”, what do you mean? Like, provably secure systems, create a AAA game from scratch, etc?

I feel like any system that could do that would implicitly have what the OP says these systems might lack, but you seem to be in half agreeance with them. Am I misunderstanding something?

yo-cuddles Mar 5, 2025, 9:39 PM
7 points
0
in reply to: Daniel Kokotajlo’s comment on: A Bear Case: My Predictions Regarding AI Progress
Definitely! However, there is more money and “hype” in the direction of wanting these to scale into AGI.

Hype and anti-hype don’t cancel each other out, if someone invests a billion dollars into LLM’s, someone else can’t spend negative 1 billion and it cancels out: the billion dollar spender is the one moving markets, and getting a lot of press attention.

We have Yudkowsky going on destiny, I guess?

yo-cuddles Feb 18, 2025, 10:32 PM
1 point
0
in reply to: jacob_cannell’s comment on: The Game Board has been Flipped: Now is a good time to rethink what you’re doing
I think there’s some miscommunication here, on top of a fundamental disagreement on whether more compute takes us to AGI.

On miscommunication, we’re not talking about the lowering cost per flop, we’re talking about a world where openai either does or does not have a price war eating it’s margins.

On fundamental disagreement, I assume you don’t take very seriously the idea that AI labs are seeing a breakdown of scaling laws? No problem if so, reality should resolve that disagreement relatively soon!

yo-cuddles Feb 3, 2025, 3:20 AM
2 points
0
in reply to: gwern’s comment on: The Game Board has been Flipped: Now is a good time to rethink what you’re doing
This is actually a good use case, which fits with what gpt does well, where very cheap tokens help!

Pending some time for people to pick at it to test it’s limits, this might be really good. My instinct is legal research, case law etc. will be the test of how good it is, if it does well this might be it’s foothold into real commercial use that actually generates profit.

My prediction is that we will be glad this exists. It will not be “phd level”, a phrase which defaces all who utter it, but it will save some people a lot of time and effort

Where I think we disagree: This will likely not elicit a Jevon’s-paradox scenario where we will collectively spend much more money on LLM tokens despite their decreased cost, Killer app this is not.

My prediction is that low level users will use this infrequently because Google (or vanilla chatGPT) is sufficient, what they are looking for is not a report but a webpage and one likely at the top of their search already. Even if it would save them time, they will never use it so often that their first instinct would be deep research and not Google, they will not recognize where deep research would be better and won’t change their habits even if they do. On the far end, some grad students will use this to get them started but it will not do the work of actually doing the research. Besides pay walls disrupting things and limits to important physical media, there is a high likelihood that this won’t replace any of the actual research grad students (or lawyers/paralegals etc) will have to do. The number of hours they spend won’t be much effected, the range of users who will find much value will be few and they probably won’t use it every day.

I expect that, by token usage, deep research will not be a big part of what people use chatGPT for. If I’m wrong I predict it’s because law professions found a use for it.

I will see everyone in 1 year (if we’re alive) to see if this pans out!

yo-cuddles Feb 3, 2025, 12:21 AM
3 points
0
in reply to: Petropolitan’s comment on: The Game Board has been Flipped: Now is a good time to rethink what you’re doing
Also, Amodei needs to cool it. There’s a reading of the things he’s been saying lately that could be taken as sane but a plausible reading that makes him look like a buffoon. Credibility is a scarce resource

yo-cuddles Feb 3, 2025, 12:18 AM
4 points
−2
in reply to: Petropolitan’s comment on: The Game Board has been Flipped: Now is a good time to rethink what you’re doing
I feel like this comes down a lot to intuition, all I can say is gesture at the thinning distance between marginal cost and prices, wave my hand in the direction of discount rates and the valuation of Openai and ask… Are you sure?

The demand curve on this seems textbook inelastic at current margins. slashing the price of milk by 10x would have us cleaning our driveways with it, slashing the price of eggs would have us using crushed eggshells as low grade building material. A 10x decrease in the price per token of AI is barely even noticed, in fact in some markets outside of programming the consumer interest is down during that same window. This an example of a low margin good with little variation in quality descending into a price war. Maybe LLM’s have a long ways left to grow and can scale to agi (maybe, maybe not) but if we’re looking just at the market this doesn’t look like something Jevon’s paradox applies to at all, people are just saying words and if you switched out Jevon for piglet they’d make as much sense imo

The proposal just seems ridiculous to me, right? Who right now is standing on the sidelines with a killer AI app that could rip up the market if only tokens were a bit cheaper? There isn’t, the bottleneck is and always has been quality, the ability for LLM’s to be less-wrong-so-dang-always. Jevon’s paradox seems to be filling the role of a magic word in these conversations, it’s involved despite being out of place.

Sorry if this is invective at all, you’re mostly explaining a point of view so I’m not frustrated in your direction, but people are making little sense to me right now.

yo-cuddles Jan 30, 2025, 12:49 AM
20 points
0
on: The Game Board has been Flipped: Now is a good time to rethink what you’re doing
I’m kinda opposite on the timelines thing? This is probably a timeline delayer even if I thought LLM’s scaled to AGI, which I don’t but let’s play along.

If a Pharma company could look at another company’s product and copy it and release it for free with no consequences, but the product they release itself could only be marginally improved without massive investment, what does that do to the landscape?

It kills the entire industry. This HURTS anyone trying to fundraise, reckless billions will have a harder time finding their way into the hands of developers because many investors will not be happy with the possibility (already demonstrated at least twice) that someone could just read the outputs of your API available model and eat your lunch, and releases so that people have a less restricted, more customizable and cheaper alternative they can run on their own hardware. Expanding that view, how many services will hose this model for cheaper than Openai will host gpt?

Want proof? Openai has problems running a profit on its services yet has effectively cut prices (or otherwise given away more for less money) twice since deepseek came out. Is Openai so grateful for the free research that deepseek produced that they would rather have that than (probably) billions of dollars in lost revenue, added cost and thinner investment?

Being more speculative, the way in which models have converged on being basically interchangeable should be a red flag that the real growth is plateaued. Goods competing mostly via price is a sign that there’s uniformity in quality, that they’re mostly interchangeable.

Real model growth seems to have been discarded in favor of finding ways to make a model stare at a problem for hours at a time and come up with answers that are… Maybe human passable if the problem is easy and it’s accustomed to it? All it’ll cost is a million dollars per run. If that sounds like it’s just brute forcing the problem, it’s because it is.

Where is the real advancement? The only real advancement is inference time scaling and it doesn’t look like this last reach has gotten us close to AGI. The reasoning models are less flexible, not more, the opposite of what you would think if they were actually reasoning, best case is that the reasoning is an excuse to summon a magic token or remove a toxic token.

Am I crazy? Why would this accelerate your timeline?
What links here?
- Noosphere89's comment on Catastrophe through Chaos by Marius Hobbhahn (Jan 31, 2025, 6:33 PM; 13 points)
- Noosphere89's comment on AI #101: The Shallow End by Zvi (Jan 30, 2025, 6:22 PM; 4 points)

yo-cuddles Jan 2, 2025, 11:30 PM
3 points
0
on: o3, Oh My
Can you make some noise in the direction of the shockingly low numbers it gets on early arc-2 benchmarks? This feels like pretty open and shut proof that it doesn’t generalize, no?

The fact that the model was trained on 75 percent of the training set feels like they ghetto rigged a test set and RL’d the thing to success. If the <30% score on the second test ends up being true, I feel like that should inform our guesses at what it’s actually doing heavily away from genuine intelligence and towards a brute force search for verifiable answers.

The frontier tests just feel unconvincing. Chances are, there are well known problem structures with well known solution structures, it’s just plugging and chugging. Mathematicians who have looked at some sample problems have indicated that both tier 1 and tier 2 problems have solutions they know by reflex, which implies these o3 results are not indicative of anything super interesting

This just feels like a nothingburger and I’m waiting for someone to tell me why my doubts are misplaced, convincingly

yo-cuddles Dec 26, 2024, 2:56 PM
6 points
3
in reply to: Matt Goldenberg’s comment on: o3
Thanks for the reply! Still trying to learn how to disagree properly so let me know if I cross into being nasty at all:

I’m sure they’ve gotten better, o1 probably improved more from its heavier use of intermediate logic, compute/runtime and such, but that said, at least up till 4o it looks like there has been improvements in the model itself, they’ve been getting better

They can do incredibly stuff in well documented processes but don’t survive well off the trodden path. They seem to string things together pretty well so I don’t know if I would say there’s nothing else going on besides memorization but it seems to be a lot of what it’s doing, like it’s working with building blocks of memorized stuff and is learning to stack them using the same sort of logic it uses to chain natural language. It fails exactly in the ways you’d expect if that were true, and it has done well in coding exactly as if that were true. The fact that the swe benchmark is giving fantastic scores despite my criticism and yours means those benchmarks are missing a lot and probably not measuring the shortfalls they historically have

See below: 4 was scoring pretty well in code exercises like codeforces that are toolbox oriented and did super well in more complex problems on leetcode… Until the problems were outside of its training data, in which case it dropped from near perfect to not being able to do much worse.

https://x.com/cHHillee/status/1635790330854526981?t=tGRu60RHl6SaDmnQcfi1eQ&s=19

This was 4, but I don’t think o1 is much different, it looks like they update more frequently so this is harder to spot in major benchmarks, but I still see it constantly.

Even if I stop seeing it myself, I’m going to assume that the problem is still there and just getting better at hiding unless there’s a revolutionary change in how these models work. Catching lies up to this out seems to have selected for better lies

yo-cuddles Dec 26, 2024, 4:55 AM
5 points
1
in reply to: Matt Goldenberg’s comment on: o3
I would say that, barring strong evidence to the contrary, this should be assumed to be memorization.

I think that’s useful! LLM’s obviously encode a ton of useful algorithms and can chain them together reasonably well

But I’ve tried to get those bastards to do something slightly weird and they just totally self destruct.

But let’s just drill down to demonstrable reality: if past SWE benchmarks were correct, these things should be able to do incredible amounts of work more or less autonomously and get all the LLM SWE replacements we’ve seen have stuck to highly simple, well documented takes that don’t vary all that much. The benchmarks here have been meaningless from the start and without evidence we should assume increments on them is equally meaningless

The lying liar company run by liars that lie all the time probably lied here and we keep falling for it like Wiley Coyote

yo-cuddles Dec 23, 2024, 10:29 PM
3 points
0
in reply to: Rafael Harth’s comment on: What are the strongest arguments for very short timelines?
Can we bet karma?

Edit: sarcasm

yo-cuddles Dec 23, 2024, 10:24 PM
3 points
0
in reply to: Nathan Helm-Burger’s comment on: What are the strongest arguments for very short timelines?
Hmm, mixed agree/disagree. Scale probably won’t work, algorithms probably would, but I don’t think it’s going to be that quick.

Namely, I think that the company struggling with fixed capital costs could accomplish much more, much quicker using the salary expenses of the top researchers they already have they’d have done it or gave it a good try at least

I’m 5 percent that a serious switch to algorithms would result in AGI in 2 years. You might be more well read than me on this so I’m not quite taking side bets right now!

yo-cuddles Dec 23, 2024, 9:00 PM
5 points
2
in reply to: Nathan Helm-Burger’s comment on: What are the strongest arguments for very short timelines?
I think the algorithm progress is doing some heavy lifting in this model. I think if we had a future textbook on agi we could probably build one but AI is kinda famous for minor and simple things just not being implemented despite all the parts being there

See ReLU activations and sigmoid activations.

If we’re bottlenecking at algorithms alone is there a reason that isn’t a really bad bottleneck?

yo-cuddles Dec 23, 2024, 8:53 PM
8 points
0
on: We are in a New Paradigm of AI Progress—OpenAI’s o3 model makes huge gains on the toughest AI benchmarks in the world
I haven’t had warm receptions when critiquing points, which has frustratingly left me with bad detection for when I’m being nasty, so if I sound thorny it’s not my intent.

Somewhere I think you might have misstepped is the frontier math questions: the quotes you’ve heard are almost certainly about tier 3 questions, the hardest ones meant for math researchers in training. The mid tier is for grad student level problems and tier 1 is bright high schooler to undergrad level problems

Tier 1: 25% of the test

Tier 2: 50% of the test

Tier 3: 25%

O3 got 25%, probably answering none of the hard questions and suspiciously matching almost exactly the proportion of easy questions. From some, there seems to be disagreement about whether tier 2 questions are consistently harder than tier 1 questions

Regardless, some (especially easier) problems are of the sort that can be verified and have explicitly been said to have instantaneously recognizable solutions. This is not an incredibly flattering picture of o3

THIS IS THE END OF WHERE I THINK YOU WERE MISTAKEN, TEXT PAST THIS IS MORE CONJECTURE AND/OR NOT DIRECTLY REFUTING YOU

the ARC test looks like it was taken by an overfit model. If the test creators are right, then the arc test for an 85 percent off a tuned model and probably spamming conclusions that it could verify, it trained on 75 percent of the questions from what I understand so one of that score seems like memorization and a mildly okay score on the 25 percent that was held as test data.

And this part is damning: the arc-2 test which is the success of to the first one, made by the same people, gets a 95 percent pass rate form humans (so easier than the 85 percent pass rate of the first test) but o3′s score dropped to a 30%, a 55% drop and now 65% below human on a similar test made by the same people.

Let me be clear: if that isn’t VERY inaccurate, then this is irrefutably a cooked test and o3 is overfit to the point of invalidating the results for any kind of generalizability.

There are other problems, like the fact that this pruning search method is really, really bad for some problems and that it seems to ride on validation being somewhat easy in order to work at all but that’s not material to the benchmarks

I can cite sources if these are important points, not obviously incorrect, etc, I might write my first post about it if I’m digging that much!

yo-cuddles Dec 23, 2024, 8:26 PM
1 point
2
in reply to: Aaron_Scher’s comment on: o3
Ah wait was reading it wrong. I thought each time was an order of magnitude, that looks to be standard notation for log scale. Mischief managed