Cameron Berg

Karma: 1,681

Currently doing alignment and digital minds research @AE Studio

Meta AI Resident ’23, Cognitive science @ Yale ‘22, SERI MATS ’21, LTFF grantee.

Very interested in work at the intersection of AI x cognitive science x alignment x philosophy.

Cameron Berg May 12, 2025, 6:37 PM
4 points
1
on: Interim Research Report: Mechanisms of Awareness
Nice work. To me, this seems less like evidence that self-awareness is trivial, and more like evidence that it’s structurally latent. A single steering vector makes the model both choose risky options and say “I am risk-seeking”—despite the self-report behavior never being trained for. That suggests the model’s internal representations of behavior and linguistic self-description are already aligned. It’s probably not introspecting in a deliberate sense, but the geometry makes shallow self-modeling an easy, natural side effect.

Mistral Large 2 (123B) seems to exhibit alignment faking

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Cameron Berg, Judd Rosenblatt, Mike Vaiana and AE Studio

Mar 27, 2025, 3:39 PM

80 points

4 comments13 min readLW link

Cameron Berg Mar 17, 2025, 2:58 PM
LW: 2 AF: 1
0
AF
in reply to: evhub’s comment on: Reducing LLM deception at scale with self-other overlap fine-tuning
Makes sense, thanks—can you also briefly clarify what exactly you are pointing at with ‘syntactic?’ Seems like this could be interpreted in multiple plausible ways, and looks like others might have a similar question.

Cameron Berg Mar 17, 2025, 2:50 AM
LW: 11 AF: 3
7
AF
in reply to: evhub’s comment on: Reducing LLM deception at scale with self-other overlap fine-tuning
The idea to combine SOO and CAI is interesting. Can you elaborate at all on what you were imagining here? Seems like there are a bunch of plausible ways you could go about injecting SOO-style finetuning into standard CAI—is there a specific direction you are particularly excited about?

Reducing LLM deception at scale with self-other overlap fine-tuning

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Cameron Berg, Mike Vaiana and AE Studio

Mar 13, 2025, 7:09 PM

155 points

40 comments6 min readLW link

Alignment can be the ‘clean energy’ of AI

Cameron Berg, Judd Rosenblatt and AE Studio

Feb 22, 2025, 12:08 AM

67 points

8 comments8 min readLW link

Cameron Berg Nov 27, 2024, 9:52 PM
6 points
0
in reply to: markdon’s comment on: Making a conservative case for alignment
We’ve spoken to numerous policymakers and thinkers in DC. The goal is to optimize for explaining to these folks why alignment is important, rather than the median conservative person per se (ie, DC policymakers are not “median conservatives”).

Cameron Berg Nov 18, 2024, 7:36 PM
0 points
0
in reply to: Patrick Cootes’s comment on: Making a conservative case for alignment
Fixed, thanks!

Cameron Berg Nov 18, 2024, 7:36 PM
5 points
3
in reply to: Simon Fischer’s comment on: Making a conservative case for alignment
Note this is not equivalent to saying ‘we’re almost certainly going to get AGI during Trump’s presidency,’ but rather that there will be substantial developments that occur during this period that prove critical to AGI development (which, at least to me, does seem almost certainly true).

Cameron Berg Nov 18, 2024, 7:33 PM
2 points
0
in reply to: deepthoughtlife’s comment on: Making a conservative case for alignment
One thing that seems strangely missing from this discussion is that alignment is in fact, a VERY important CAPABILITY that makes it very much better. But the current discussion of alignment in the general sphere acts like ‘alignment’ is aligning the AI with the obviously very leftist companies that make it rather than with the user!
Agree with this—we do discuss this very idea at length here and also reference it throughout the piece.
That alignment is to the left is one of just two things you have to overcome in making conservatives willing to listen. (The other is obviously the level of danger.)
I think this is a good distillation of the key bottlenecks and seems helpful for anyone interacting with lawmakers to keep in mind.

Cameron Berg Nov 16, 2024, 12:15 AM
13 points
13
in reply to: Casey_’s comment on: Making a conservative case for alignment
Whether one is an accelerationist, Pauser, or an advocate of some nuanced middle path, the prospects/goals of everyone are harmed if the discourse-landscape becomes politicized/polarized.
...
I just continue to think that any mention, literally at all, of ideology or party is courting discourse-disaster for all, again no matter what specific policy one is advocating for.
...
Like a bug stuck in a glue trap, it places yet another limb into the glue in a vain attempt to push itself free.
I would agree in a world where the proverbial bug hasn’t already made any contact with the glue trap, but this very thing has clearly already been happening for almost a year in a troubling direction. The political left has been fairly casually ‘Everything-Bagel-izing’ AI safety, largely in smuggling in social progressivism that has little to do with the core existential risks, and the right, as a result, is increasingly coming to view AI safety as something approximating ‘woke BS stifling rapid innovation.’ The fly is already a bit stuck.
The point we are trying to drive home here is precisely what you’re also pointing at: avoiding an AI-induced catastrophe is obviously not a partisan goal. We are watching people in DC slowly lose sight of this critical fact. This is why we’re attempting to explain here why basic AI x-risk concerns are genuinely important regardless of one’s ideological leanings. ie, genuinely important to left-leaning and right-leaning people alike. Seems like very few people have explicitly spelled out the latter case, though, which is why we thought it would be worthwhile to do so here.

Making a conservative case for alignment

Cameron Berg, Judd Rosenblatt, phgubbins and AE Studio

Nov 15, 2024, 6:55 PM

208 points

67 comments7 min readLW link

Cameron Berg Nov 3, 2024, 2:54 PM
17 points
2
in reply to: Kaj_Sotala’s comment on: Science advances one funeral at a time
Interesting—this definitely suggests that Planck’s statement probably shouldn’t be taken literally/at face value if it is indeed true that some paradigm shifts have historically happened faster than generational turnover. It may still be possible that this may be measuring something slightly different than the initial ‘resistance phase’ that Planck was probably pointing at.
Two hesitations with the paper’s analysis:
(1) by only looking at successful paradigm shifts, there might be a bit of a survival bias at play here (we’re not hearing about the cases where a paradigm shift was successfully resisted and never came to fruition).
(2) even if senior scientists in a field may individually accept new theories, institutional barriers can still prevent that theory from getting adequate funding, attention, exploration. I do think Anthony’s comment below nicely captures how the institutional/sociological dynamics in science seemingly differ substantially from other domains (in the direction of disincentivizing ‘revolutionary’ exploration).

Cameron Berg Nov 2, 2024, 4:27 PM
4 points
0
in reply to: Dmitry Vaintrob’s comment on: Science advances one funeral at a time
Thanks for this! Completely agree that there are Type I and II errors here and that we should be genuinely wary of both. Also agree with your conclusion that ‘pulling the rope sideways’ is strongly preferred to simply lowering our standards. The unconventional researcher-identification approach undertaken by the HHMI might be a good proof of concept for this kind of thing.

Cameron Berg Nov 2, 2024, 4:13 PM
2 points
2
in reply to: Andy_McKenzie’s comment on: Science advances one funeral at a time
I think you might be taking the quotation a bit too literally—we are of course not literally advocating for the death of scientists, but rather highlighting that many of the largest historical scientific innovations have been systematically rejected by one’s contemporaries in their field.
Agree that scientists change their minds and can be convinced by sufficient evidence, especially within specific paradigms. I think the thornier problem that Kuhn and others have pointed out is that the introduction of new paradigms into a field are very challenging to evaluate for those who are already steeped in an existing paradigm, which tends to cause these people to reject, ridicule, etc those with strong intuitions for new paradigms, even when they demonstrate themselves in hindsight to be more powerful or explanatory than existing ones.

Science advances one funeral at a time

Cameron Berg, Judd Rosenblatt, Diogo de Lucena and AE Studio

Nov 1, 2024, 11:06 PM

100 points

9 comments2 min readLW link

Cameron Berg Oct 24, 2024, 5:49 PM
4 points
0
in reply to: Steven Byrnes’s comment on: Self-prediction acts as an emergent regularizer
Thanks for this! Consider the self-modeling loss gradient: $\partial L_{s e l f} / \partial W = 2 (W A - A) A^{⊤} = 2 E A A^{⊤}$ . While the identity function would globally minimize the self-modeling loss with zero loss for all inputs (effectively eliminating the task’s influence by zeroing out its gradients), SGD learns local optima rather than global optima, and the gradients don’t point directly toward the identity solution. The gradient depends on both the deviation from identity ( $E$ ) and the activation covariance ( $A A^{⊤}$ ), with the network balancing this against the primary task loss. Since the self-modeling prediction isn’t just a separate output block—it’s predicting the full activation pattern—the interaction between the primary task loss, activation covariance structure ( $A A^{⊤}$ ), and need to maintain useful representations creates a complex optimization landscape where local optima dominate. We see this empirically in the consistent non-zero $W - I$ difference during training.

Cameron Berg Oct 24, 2024, 5:47 PM
5 points
0
in reply to: Steven Byrnes’s comment on: Self-prediction acts as an emergent regularizer
The comparison to activation regularization is quite interesting. When we write down the self-modeling loss $(^a - a)^{2}$ in terms of the self-modeling layer, we get $| | (W a + b - a) | |^{2} = | | (W - I) a + b | |^{2} = | | E a + b | |^{2}$ .

This does resemble activation regularization, with the strength of regularization attenuated by how far the weight matrix is from identity (the magnitude of $E$ ). However, due to the recurrent nature of this loss—where updates to the weight matrix depend on activations that are themselves being updated by the loss—the resulting dynamics are more complex in practice. Looking at the gradient $\partial L_{s e l f} / \partial W = 2 E A A^{⊤}$ , we see that self-modeling depends on the full covariance structure of activations, not just pushing them toward zero or any fixed vector. The network must learn to actively predict its own evolving activation patterns rather than simply constraining their magnitude.
Comparing the complexity measures (SD & RLCT) between self-modeling and activation regularization is a great idea and we will definitely add this to the roadmap and report back. And batch norm/other forms of regularization were not added.

Self-prediction acts as an emergent regularizer

Cameron Berg, Judd Rosenblatt, Mike Vaiana, Diogo de Lucena, florin_pop and AE Studio

Oct 23, 2024, 10:27 PM

91 points

9 comments4 min readLW link

Cameron Berg Oct 12, 2024, 7:17 PM
1 point
0
in reply to: TsviBT’s comment on: Overview of strong human intelligence amplification methods
I am not suggesting either of those things. You enumerated a bunch of ways we might use cutting-edge technologies to facilitate intelligence amplification, and I am simply noting that frontier AI seems like it will inevitably become one such technology in the near future.
On a psychologizing note, your comment seems like part of a pattern of trying to wriggle out of doing things the way that is hard that will work.
Completely unsure what you are referring to or the other datapoints in this supposed pattern. Strikes me as somewhat ad-hominem-y unless I am misunderstanding what you are saying.
AI helping to do good science wouldn’t make the work any less hard—it just would cause the same hard work to happen faster.
hard+works is better than easy+not-works
seems trivially true. I think the full picture is something like:
efficient+effective > inefficient+effective > efficient+ineffective > inefficient+ineffective
Of course agree that if AI-assisted science is not effective, it would be worse to do than something that is slower but effective. Seems like whether or not this sort of system could be effective is an empirical question that will be largely settled in the next few years.

Cameron Berg

Mis­tral Large 2 (123B) seems to ex­hibit al­ign­ment faking

Re­duc­ing LLM de­cep­tion at scale with self-other over­lap fine-tuning

Align­ment can be the ‘clean en­ergy’ of AI

Mak­ing a con­ser­va­tive case for alignment

Science ad­vances one funeral at a time

Self-pre­dic­tion acts as an emer­gent regularizer