Adam Jermyn

Karma: 1,599

Adam Jermyn Apr 2, 2025, 3:48 PM
3 points
1
in reply to: Annapurna’s comment on: Tracing the Thoughts of a Large Language Model
As long as you make it clear at the header that it’s your unofficial translation, go for it!

Adam Jermyn Mar 28, 2025, 4:36 PM
7 points
3
in reply to: derek shiller’s comment on: Tracing the Thoughts of a Large Language Model
I would guess that models plan in this style much more generally. It’s just useful in so many contexts. For instance, if you’re trying to choose what article goes in front of a word, and that word is fixed by other constraints, you need a plan of what that word is (“an astronomer” not “a astronomer”). Or you might be writing code and have to know the type of the return value of a function before you’ve written the body of the function, since Python type annotations come at the start of the function in the signature. Etc. This sort of thing just comes up all over the place.

Adam Jermyn Mar 28, 2025, 4:33 PM
6 points
0
in reply to: Archimedes’s comment on: Tracing the Thoughts of a Large Language Model
It’s not so much that we didn’t think models plan ahead in general, as that we had various hypotheses (including “unknown unknowns”) and this kind of planning in poetry wasn’t obviously the best one until we saw the evidence.

[More generally: in Interpretability we often have the experience of being surprised by the specific mechanism a model is using, even though with the benefit of hindsight it seems obvious. E.g. when we did the work for Towards Monosemanticity we were initially quite surprised to see the “the in <context>” features, thought they were indicative of a bug in our setup, and had to spend a while thinking about them and poking around before we realized why the model wanted them (which now feels obvious).]

Tracing the Thoughts of a Large Language Model

Adam JermynMar 27, 2025, 5:20 PM

230 points

22 comments10 min readLW link

(www.anthropic.com)

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

Mar 13, 2025, 7:18 PM

138 points

15 comments13 min readLW link

Adam Jermyn Dec 30, 2024, 4:59 AM
10 points
5
in reply to: habryka’s comment on: evhub’s Shortform
I can also confirm (I have a 3:1 match).

Adam Jermyn Nov 10, 2024, 9:09 PM
13 points
6
in reply to: Milan W’s comment on: Personal AI Planning
Unless we build more land (either in the ocean or in space)?

Adam Jermyn Oct 29, 2024, 12:14 AM
6 points
2
in reply to: Ben Pace’s comment on: Dario Amodei — Machines of Loving Grace
There is Dario’s written testimony before Congress, which mentions existential risk as a serious possibility: https://www.judiciary.senate.gov/imo/media/doc/2023-07-26_-_testimony_-_amodei.pdf
He also signed the CAIS statement on x-risk: https://www.safe.ai/work/statement-on-ai-risk

Adam Jermyn Oct 16, 2024, 12:11 AM
7 points
5
in reply to: Ben Pace’s comment on: Dario Amodei — Machines of Loving Grace
He does start out by saying he thinks & worries a lot about the risks (first paragraph):

I think and talk a lot about the risks of powerful AI. The company I’m the CEO of, Anthropic, does a lot of research on how to reduce these risks… I think that most people are underestimating just how radical the upside of AI could be, just as I think most people are underestimating how bad the risks could be.

He then explains (second paragraph) that the essay is meant to sketch out what things could look like if things go well:

In this essay I try to sketch out what that upside might look like—what a world with powerful AI might look like if everything goes right.

I think this is a coherent thing to do?

Adam Jermyn Oct 16, 2024, 12:06 AM
11 points
0
in reply to: ryan_greenblatt’s comment on: Dario Amodei — Machines of Loving Grace
I get 1e7 using 16 bit-flips per bfloat16 operation, 300K operating temperature, and 312Tflop/s (from Nvidia’s spec sheet). My guess is that this is a little high because a float multiplication involves more operations than just flipping 16 bits, but it’s the right order-of-magnitude.

Adam Jermyn Jun 15, 2024, 5:57 PM
6 points
2
on: Yann LeCun: We only design machines that minimize costs [therefore they are safe]
Another objection is that you can minimize the wrong cost function. Making “cost” go to zero could mean making “the thing we actually care about” go to (negative huge number).

Adam Jermyn Jun 1, 2024, 1:37 AM
25 points
16
in reply to: TurnTrout’s comment on: MIRI 2024 Communications Strategy
One day a mathematician doesn’t know a thing. The next day they do. In between they made no observations with their senses of the world.

It’s possible to make progress through theoretical reasoning. It’s not my preferred approach to the problem (I work on a heavily empirical team at a heavily empirical lab) but it’s not an invalid approach.

Adam Jermyn Dec 6, 2023, 4:51 AM
2 points
0
in reply to: Raemon’s comment on: The LessWrong 2022 Review
I’m guessing that the sales numbers aren’t high enough to make $200k if sold at plausible markups?

Adam Jermyn Dec 3, 2023, 3:38 AM
7 points
0
in reply to: Sam Marks’s comment on: How useful is mechanistic interpretability?
In Towards Monosemanticity we also did a version of this experiment, and found that the SAE was much less interpretable when the transformer weights were randomized (https://transformer-circuits.pub/2023/monosemantic-features/index.html#appendix-automated-randomized).

Adam Jermyn Oct 14, 2023, 12:43 PM
LW: 7 AF: 4
4
AF
in reply to: Zvi’s comment on: RSPs are pauses done right
Anthropic’s RSP includes evals after every 4x increase in effective compute and after every 3 months, whichever comes sooner, even if this happens during training, and the policy says that these evaluations include fine-tuning.

Adam Jermyn Jul 19, 2023, 11:05 PM
LW: 16 AF: 9
3
AF
on: Alignment Grantmaking is Funding-Limited Right Now
This matches my impression. At EAG London I was really stunned (and heartened!) at how many skilled people are pivoting into interpretability from non-alignment fields.

Adam Jermyn May 17, 2023, 8:01 AM
LW: 3 AF: 2
0
AF
on: EIS IX: Interpretability and Adversaries

Second, the measure of “features per dimension” used by Elhage et al. (2022) might be misleading. See the paper for details of how they arrived at this quantity. But as shown in the figure above, “features per dimension” is defined as the Frobenius norm of the weight matrix before the layer divided by the number of neurons in the layer. But there is a simple sanity check that this doesn’t pass. In the case of a ReLU network without bias terms, multiplying a weight matrix by a constant factor will cause the “features per dimension” to be increased by that factor squared while leaving the activations in the forward pass unchanged up to linearity until a non-ReLU operation (like a softmax) is performed. And since each component of a softmax’s output is strictly increasing in that component of the input, scaling weight matrices will not affect the classification.

It’s worth noting that Elhage+2022 studied an autoencoder with tied weights and no softmax, so there isn’t actually freedom to rescale the weight matrix without affecting the loss in their model, making the scale of the weights meaningful. I agree that this measure doesn’t generalize to other models/tasks though.

They also define a more fine-grained measure (the dimensionality of each individual feature) in a way that is scale-invariant and which broadly agrees with their coarser measure...

Conditioning Predictive Models: Open problems, Conclusion, and Appendix

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 10, 2023, 7:21 PM

36 points

3 comments11 min readLW link

Conditioning Predictive Models: Deployment strategy

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 9, 2023, 8:59 PM

28 points

0 comments10 min readLW link

Conditioning Predictive Models: Interactions with other approaches

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 8, 2023, 6:19 PM

32 points

2 comments11 min readLW link

Adam Jermyn

Trac­ing the Thoughts of a Large Lan­guage Model

Au­dit­ing lan­guage mod­els for hid­den objectives

Con­di­tion­ing Pre­dic­tive Models: Open prob­lems, Con­clu­sion, and Appendix

Con­di­tion­ing Pre­dic­tive Models: De­ploy­ment strategy

Con­di­tion­ing Pre­dic­tive Models: In­ter­ac­tions with other approaches

Tracing the Thoughts of a Large Language Model

Auditing language models for hidden objectives

Conditioning Predictive Models: Open problems, Conclusion, and Appendix

Conditioning Predictive Models: Deployment strategy

Conditioning Predictive Models: Interactions with other approaches