LessWrong Team
I have signed no contracts or agreements whose existence I cannot mention.
LessWrong Team
I have signed no contracts or agreements whose existence I cannot mention.
This is fixed now. :)
This bug is at least fixed now! I await your next report, thanks.
Knowing user emotions is good! And it’s sometimes nice when people care enough to get mad. I’m working on fixing this now.
I’m afraid you’re right that the overlay does not automatically match the post page. Unfortunately it turns out a lot of complexity gets added because the overlway has its own scrolling context and I’m needing to make a lot of adjustment for that. Fortunately the core elements of the post page don’t really change, so once we’re matched it should stay matched.
This should be fixed now.
True! That’s now next on my list.
Oh, indeed. That’s no good. I’ll fix it.
All buttons within the feed don’d do anything, but other buttons on the site do? That’s very strange.
Huh, that’s pretty odd and not good. I’ll look into it. Any other buttons or interactions that do nothing?
Fwiw, for me the calculated individual speed up was [-200%, 40%], which while it does weight predominantly in the negative (I got these numbers from the authors after writing my above comments). I’m not sure if that counts as unambiguously wrong about my remembered experience.
I’m not pushing against the study results so much as what I think are misinterpretations people are going to make off of this study.
If the claim is “on a selected kind of tasks, developers in early 2025 predominantly using models like Claude 3.5 and 3.7 were slowed down when they though they were sped up”, then I’m not dismissing that. I don’t think the study is that clean or unambiguous giving methodological challenges, but I find it quite plausible.
In the above, I do only the following: (1) offer explanation for the result, (2) point out that I individually feel misrepresented by a particular descriptor use, (3) point out and affirm points the authors also make (a) about this being point-in-time, (b) there being selection effects at play.
You can say that if I feel now that I’m being speed up, I should be less confident given the results, and to that I say yeah, I am. And I’m surprised by the result too.
There’s a claim you’re making here that “I went looking for reasons” that feels weird. I don’t take it that whenever a result is “your remembered experience is wrong”, I’m being epistemically unvirtuous if I question it or discuss details. To repeat, I question the interpretation/generalization some might make rather than the raw result or even what the authors interpret it as, and I think as a participant I’m better positioned to notice the misgeneralization than just hearing the headline result (people reading the actual paper probably end up in the right place).
I was one of the developers in the @METR_Evals study. Some thoughts:
1. This is much less true of my participation in the study where I was more conscientious, but I feel like historically a lot of my AI speed-up gains were eaten by the fact that while a prompt was running, I’d look at something else (FB, X, etc) and continue to do so for much longer than it took the prompt to run.
I discovered two days ago that Cursor has (or now has) a feature you can enable to ring a bell when the prompt is done. I expect to reclaim a lot of the AI gains this way.
2. Historically I’ve lost some of my AI speed ups to cleaning up the same issues LLM code would introduce, often relatively simple violations of code conventions lik e using || instead of ??
A bunch of this is avoidable with stored system prompts which I was lazy about writing. Cursor has now made this easier and even attempts to learn repeatable rules “The user prefers X” that will get reused, saving time here.
3. Regarding me specifically, I work on the LessWrong codebase which is technically open-source. I feel like calling myself an “open-source developer” has the wrong connotations, and makes it more sound like I contribute to a highly-used Python library or something as an upper-tier developer which I’m not.
4. As a developer in the study, it’s striking to me how much more capable the models have gotten since February (when I was participating in the study).
I’m trying to recall if I was even using agents at the start. Certainly the later models (Opus 4, Gemini 2.5 Pro, o3 could just do vastly with less guidance) than 3.6, o1, etc.
For me, not going over my own data in the study, I could buy that maybe i was being slowed down a few months ago, but it is much much harder to believe now.
5. There was a selection effect in which tasks I submitted to the study. (a) I didn’t want to risk getting randomized to “no AI” on tasks that felt sufficiently important or daunting to do without AI assistence. (b) Neatly packaged and well-scoped tasks felt suitable for the study, large open-ended greenfield stuff felt harder to legibilize, so I didn’t submit those tasks to study even though AI speed up might have been larger.
6. I think if the result is valid at this point in time, that’s one thing, I think if people are citing in another 3 months time, they’ll be making a mistake (and I hope Metr has published a follow-up).
I was one of the devs. Granted the money went to Lightcone and not me personally, but even if it had, I don’t see it motivating me in any particular direction. For one thing, Not taking longer – I’ve got too much to do to to drag my feet to make a little more money. Not pleasing METR – I didn’t believe they wanted any particular result.
Did you mean to reply to that parent?
I was part of the study actually. For me, I think a lot of the productivity gains were lost from starting to look at some distraction while waiting for the LLM and then being “afk” for a lot longer than the prompt took to wrong. However! I just discovered that Cursor has exactly the feature I wanted them to have: a bell that rings when your prompt is done. Probably that alone is worth 30% of the gains.
Other than that, the study started in February (?). The models have gotten a lot better in just the past few months such that even if the study was true for the average time it was run, I don’t expect it to be true now or in another three months (unless the devs are really bad at using AI actually or something).
Subjectively, I spend less time now trying to wrangle a solution out of them and a lot more it works pretty quickly.
I agree it’s a stark difference. The intention here was to match other sites with feeds out of a general sense that our mobile font is too small.
If you wanted to choose one font size across mobile, which would you go for?
Hmm, that’s no good. Sorry for the slow reply, if you’re willing I’d like to debug it with you (will DM).
Oh, very reasonable. I’ll have a think about how to solve that. So I can understand what you’re trying to do, why is it you want to refresh the page?
Oh, that’s the audio player widget. Seems it is broken here! Thank you for the report.
Cheers for the feedback, I apologize for confusing and annoyingness.
What do you mean by “makes the URL bar useless”? What’s the use you’re hoping would still be there? (typing in a different address should still work
The point of the modals is they don’t lose your place in the feed in a way that’s hard technically to do with proper navigation, though it’s possible we should just figure out how to do that.
And ah yeah, the “view all comments” isn’t a link on right-click, but I can make it be so (the titles are already that). That’s a good idea.
All comment threads are what I call a “linear-slice” (parent-child-child-child) with no branching. Conveying this relationship while breaking with the convention of the rest of the site (nesting) has proven tricky, but I’m reluctant to give up the horizontal space, and it looks cleaner. But two comments next to each other are just parent/child, and if there are ommitted comments, there’s a bar saying “+N” that when clicked, will display them.
Something I will do is make it so the post-modal and comments-modal is one, and when you click to view a particular comment, you’ll be shown it but the rest will also be there, which should hopefully help with orienting.
Thanks again for writing up those thoughts!
I’m curious for examples, feel free to DM if you don’t want to draw further attention to them
My guess is because (particularly before the introduction of recommended posts that are older to the posts list) people would find it strange to see old posts highlighted on the frontpage.