jsteinhardt

Karma: 5,716

jsteinhardt Mar 24, 2025, 7:27 PM
2 points
0
in reply to: Maxwell Peterson’s comment on: Analyzing long agent transcripts (Docent)
Looks like an issue with the cross-posting (it works at https://bounded-regret.ghost.io/analyzing-long-agent-transcripts-docent/). Moderators, any idea how to fix?
EDIT: Fixed now, thanks to Oliver!

jsteinhardt Mar 22, 2025, 5:01 PM
LW: 2 AF: 1
0
AF
in reply to: Daniel Kokotajlo’s comment on: METR: Measuring AI Ability to Complete Long Tasks
I meant coding in particular, I agree algorithmic progress is not 3x faster. I checked again just now with someone and they did indeed report 3x speedup for writing code, although said that the new bottleneck becomes waiting for experiments to run (note this is not obviously something that can be solved by greater automation, at least up until the point that AI is picking better experiments than humans).

jsteinhardt Mar 20, 2025, 11:49 PM
LW: 6 AF: 3
−10
AF
in reply to: Daniel Kokotajlo’s comment on: METR: Measuring AI Ability to Complete Long Tasks
Research engineers I talk to already report >3x speedups from AI assistants. It seems like that has to be enough that it would be showing up in the numbers. My null hypothesis would be that programmer productivity is increasing exponentially and has been for ~2 years, and this is already being taken into account in the curves, and without this effect you would see a slower (though imo not massively slower) exponential.
(This would argue for dropping the pre-2022 models from the graph which I think would give slightly faster doubling times, on the order of 5-6 months if I had to eyeball.).

jsteinhardt Mar 20, 2025, 6:17 AM
LW: 13 AF: 7
−4
AF
in reply to: Daniel Kokotajlo’s comment on: METR: Measuring AI Ability to Complete Long Tasks
Doesn’t the trend line already take into account the effect you are positing? ML research engineers already say they get significant and increasing productivity boosts from AI assistants and have been for some time. I think the argument you are making is double-counting this. (Unless you want to argue that the kink with Claude is the start of the super-exponential, which we would presumably get data on pretty soon).

jsteinhardt Nov 11, 2023, 7:35 AM
2 points
0
in reply to: Daniel Kokotajlo’s comment on: GPT-2030 and Catastrophic Drives: Four Vignettes
When only a couple thousand copies you probably don’t want to pay for the speedup, eg even going an extra 4x decreases the number of copies by 8x.

I also think when you don’t have control over your own hardware the speedup schemes become harder, since they might require custom network topologies. Not sure about that though

jsteinhardt Oct 7, 2023, 4:40 AM
82 points
47
on: Related Discussion from Thomas Kwa’s MIRI Research Experience
While I am not close to this situation, I felt moved to write something, mostly to support junior researchers and staff such as TurnTrout, Thomas Kwa, and KurtB who are voicing difficult experiences that may be challenging for them to talk about; and partly because I can provide perspective as someone who has managed many researchers and worked in a variety of research and non-research organizations and so can more authoritatively speak to what behaviors are ‘normal’ and what patterns tend to lead to good or bad outcomes. Caveat that I know very little about any internal details of MIRI, but I am still reasonably confident of what I’m saying based on general patterns and experience in the world.
Based on reading Thomas Kwa’s experience, as well as KurtB’s experience, Nate Soares’ behavior is far outside any norms of acceptable behavior that I’d endorse. Accepting or normalizing this behavior within an organization has a corrosive effect on the morale, epsistemics, and spiritual well-being of its members. The morale effects are probably obvious, but regarding epistemics, leadership is significantly less likely to get useful feedback if people are afraid to cross them (psychological safety is an important concept here). Finally, regarding spirit, normalizing this behavior sends a message to people that they aren’t entitled to set boundaries or be respected, which can create far-reaching damage in their other interactions and in their image of themselves. Based on this, I feel very worried for MIRI and think it should probably do a serious re-think of its organizational culture.
Since some commenters brought up academia and the idea that some professors can be negligent or difficult to work with, I will compare Nate’s behavior to professors in CS academia. Looking at what Thomas Kwa described, I can think of some professors who exhibit individual traits in Thomas’ description, but someone who had all of them at once would be an outlier (in a field that is already welcoming to difficult personalities), and I would strongly warn students against working with such a person. KurtB’s experience goes beyond that and seems at least a standard deviation worse; if someone behaved this way, I would try to minimize their influence in any organization I was part of and refuse to collaborate with them, and I would expect even a tenured faculty to have a serious talking-to about their behavior from colleagues (though maybe some places would be too cowardly to have this conversation), and for HR complaints to stack up.
Nate, the best description I can think of for what’s going on is that you have fairly severe issues with emotional regulation. Your comments indicate that you see this as a basic aspect of your emotional make-up (and maybe intimately tied to your ability to do research), but I have seen this pattern several times before and I am pretty confident this is not the case. In previous cases I’ve seen, the person in question expresses or exhibits and unwillingness to change up until the point that they face clear consequences for their actions, at which point (after a period of expressing outrage) they buckle down and make the changes, which usually changes their own life for the better, including being able to think more clearly. A first step would be going to therapy, which I definitely recommend. I am pretty confident that even for your own sake you should make a serious effort to make changes here. (I hope this doesn’t come across as condescending, as I genuinely believe this is good advice.)
Along these lines, for people around Nate who think that they “have” to accept this behavior because Nate’s work is important, even on those grounds alone setting boundaries on the behavior will lead to better outcomes.
Here is an example of how an organization could set boundaries on this behavior: If Nate yells at a staff member, that staff member no longer does ops work for Nate until he apologizes and expresses a credible commitment to communicate more courteously in the future. (This could be done in principle by making it opt-in to do continued ops work for Nate if this happens, and working hard to create a real affordance for not opting in.)
The important principle here is that Nate internalizes the costs of his decisions (by removing his ability to impose costs on others, and bearing the resulting inconvenience). Here the cost to Nate is also generally lower than the cost that would have been imposed on others (inflating your own bike tire is less costly than having your day ruined by being yelled at), though this isn’t crucial. The important thing is Nate would have skin in the game—if he still doesn’t change, then I believe somewhat more that he’s actually incapable of doing so, but I would guess that this would actually lead to changes. And if MIRI for some reason believes that other people should be willing to bear large costs for small benefits to Nate, they should also hire a dedicated staff to do damage control for him. (Maybe some or all of this is already happening… I am not at MIRI so I don’t know, but it doesn’t sound this way based on the experiences that have been shared.)
In summary: based on my own personal experience across many organizations, Nate’s behavior is not okay and MIRI should set boundaries on it. I do not believe Nate’s claim that this is a fundamental aspect of his emotional make-up, as it matches other patterns in the past that have changed when consequences were imposed, and even if it is a fundamental aspect he should face the natural consequences of his actions. These consequences should center on removing his ability to harm others, or, if this is not feasible, creating institutions at MIRI to reliably clean up after him and maintain psychological safety.
What links here?
- Raghuvar Nadig's comment on Related Discussion from Thomas Kwa’s MIRI Research Experience by Raemon (Oct 11, 2023, 2:21 PM; 11 points)

jsteinhardt Aug 20, 2023, 9:58 PM
2 points
0
in reply to: habryka’s comment on: AI Forecasting: Two Years In
I don’t see it in the header in Mobile (although I do see the updated text now about it being a link post). Maybe it works on desktop but not mobile?

jsteinhardt Aug 20, 2023, 9:44 PM
9 points
6
in reply to: Dan H’s comment on: AI Forecasting: Two Years In
Is it clear these results don’t count? I see nothing in the Metaculus question text that rules it out.

jsteinhardt Aug 20, 2023, 9:43 PM
2 points
0
on: AI Forecasting: Two Years In
Mods, could you have these posts link back to my blog Bounded Regret in some form? Right now there is no indication that this is cross-posted from my blog, and no link back to the original source.

jsteinhardt Jul 15, 2023, 6:59 AM
27 points
10
in reply to: Ben Pace’s comment on: Elon Musk announces xAI
Dan spent his entire PhD working on AI safety and did some of the most influential work on OOD robustness and OOD detection, as well as writing Unsolved Problems. Even if this work is less valued by some readers on LessWrong (imo mistakenly), it seems pretty inaccurate to say that he didn’t work on safety before founding CAIS.

jsteinhardt Jun 25, 2023, 8:19 PM
10 points
1
in reply to: the gears to ascension’s comment on: Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?
Melanie Mitchell and Meg Mitchell are different people. Melanie was the participant in this debate, but you seem to be ascribing Meg’s opinions to her, including linking to video interviews with Meg in your comments.

jsteinhardt Jun 10, 2023, 12:37 AM
2 points
0
in reply to: reallyeli’s comment on: What will GPT-2030 look like?
I’m leaving it to the moderators to keep the copies mirrored, or just accept that errors won’t be corrected on this copy. Hopefully there’s some automatic way to do that?

jsteinhardt Jun 9, 2023, 3:52 AM
5 points
0
in reply to: Daniel Kokotajlo’s comment on: What will GPT-2030 look like?
Oops, thanks, updated to fix this.

jsteinhardt Jun 9, 2023, 3:51 AM
2 points
0
in reply to: reallyeli’s comment on: What will GPT-2030 look like?
Thanks! I removed the link.

jsteinhardt Jun 9, 2023, 3:50 AM
2 points
0
in reply to: Sheikh Abdur Raheem Ali’s comment on: What will GPT-2030 look like?
Thanks! I removed the link.

jsteinhardt May 27, 2023, 5:46 PM
LW: 2 AF: 1
0
AF
in reply to: TurnTrout’s comment on: Steering GPT-2-XL by adding an activation vector
Glad it was helpful!

jsteinhardt May 19, 2023, 7:53 PM
LW: 65 AF: 33
24
AF
in reply to: TurnTrout’s comment on: Steering GPT-2-XL by adding an activation vector
Hi Alex,
Let me first acknowledge that your write-up is significantly more thorough than pretty much all content on LessWrong, and that I found the particular examples interesting. I also appreciated that you included a related work section in your write-up. The reason I commented on this post and not others is because it’s one of the few ML posts on LessWrong that seemed like it might teach me something, and I wish I had made that more clear before posting critical feedback (I was thinking of the feedback as directed at Oliver / Raemon’s moderation norms, rather than your work, but I realize in retrospect it probably felt directed at you).
I think the main important point is that there is a body of related work in the ML literature that explores fairly similar ideas, and LessWrong readers who care about AI alignment should be aware of this work, and that most LessWrong readers who read the post won’t realize this. I think it’s good to point out Dan’s initial mistake, but I took his substantive point to be what I just summarized, and it seems correct to me and hasn’t been addressed. (I also think Dan overfocused on Ludwig’s paper, see below for more of my take on related work.)
Here is how I currently see the paper situated in broader work (I think you do discuss the majority but not all of this):
* There is a lot of work studying activation vectors in computer vision models, and the methods here seem broadly similar to the methods there. This seems like the closest point of comparison.
* In language, there’s a bunch of work on controllable generation (https://arxiv.org/pdf/2201.05337.pdf) where I would be surprised if no one looked at modifying activations (at least I’d expect someone to try soft prompt tuning), but I don’t know for sure.
* On modifying activations in language models there is a bunch of stuff on patching / swapping, and on modifying stuff in the directions of probes.
I think we would probably both agree that this is the main set of related papers, and also both agree that you cited work within each of these branches (except maybe the second one). Where we differ is that I see all of this as basically variations on the same idea of modifying the activations or weights to control a model’s runtime behavior:
* You need to find a direction, which you can do either by learning a direction or by simple averaging. Simple averaging is more or less the same as one step of gradient descent, so I see these as conceptually similar.
* You can modify the activations or weights. Usually if an idea works in one case it works in the other case, so I also see these as similar.
* The modality can be language or vision. Most prior work has been on vision models, but some of that has also been on vision-language models, e.g. I’m pretty sure there’s a paper on averaging together CLIP activations to get controllable generation.
So I think it’s most accurate to say that you’ve adapted some well-explored ideas to a use case that you are particularly interested in. However, the post uses language like “Activation additions are a new way of interacting with LLMs”, which seems to be claiming that this is entirely new and unexplored, and I think this could mislead readers, as for instance Thomas Kwa’s response seems to suggest.
I also felt like Dan H brought up reasonable questions (e.g. why should we believe that weights vs. activations is a big deal? Why is fine-tuning vs. averaging important? Have you tried testing the difference empirically?) that haven’t been answered that would be good to at least more clearly acknowledge. The fact that he was bringing up points that seemed good to me that were not being directly engaged with was what most bothered me about the exchange above.
This is my best attempt to explain where I’m coming from in about an hour of work (spent e.g. reading through things and trying to articulate intuitions in LW-friendly terms). I don’t think it captures my full intuitions or the full reasons I bounced off the related work section, but hopefully it’s helpful.

jsteinhardt May 18, 2023, 11:24 PM
LW: 30 AF: 7
14
AF
in reply to: Raemon’s comment on: Steering GPT-2-XL by adding an activation vector
I’ll just note that I, like Dan H, find it pretty hard to engage with this post because I can’t tell whether it’s basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn’t really help in this regard.

I’m not sure what you mean about whether the post was “missing something important”, but I do think that you should be pretty worried about LessWrong’s collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged with on his substantive point, he’s being nitpicked by a moderator. It’s not an accident that no one else is bringing these points up—it’s because everyone else who has the expertise to do so has given up or judged it not worth their time, largely because of responses like the one Dan H is getting.

jsteinhardt Oct 7, 2022, 2:51 AM
105 points
98
on: So, geez there’s a lot of AI content these days
Here is my take: since there’s so much AI content, it’s not really feasible to read all of it, so in practice I read almost none of it (and consequently visit LW less frequently).
The main issue I run into is that for most posts, on a brief skim it seems like basically a thing I have thought about before. Unlike academic papers, most LW posts do not cite previous related work nor explain how what they are talking about relates to this past work. As a result, if I start to skim a post and I think it’s talking about something I’ve seen before, I have no easy way of telling if they’re (1) aware of this fact and have something new to say, (2) aware of this fact but trying to provide a better exposition, or (3) unaware of this fact and reinventing the wheel. Since I can’t tell, I normally just bounce off.
I think a solution could be to have a stronger norm that posts about AI should say, and cite, what they are building on and how it relates / what is new. This would decrease the amount of content while improving its quality, and also make it easier to choose what to read. I view this as a win-win-win.
What links here?

jsteinhardt Jul 25, 2022, 1:59 AM
8 points
1
in reply to: Adam Jermyn’s comment on: Hiring Programmers in Academia
I think this might be an overstatement. It’s true that NSF tends not to fund developers, but in ML the NSF is only one of many funders (lots of faculty have grants from industry partnerships, for instance).