Dan Braun

Karma: 1,148

Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke and rusheb

May 29, 2024, 5:44 PM

93 points

0 comments7 min readLW link

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache and Marius Hobbhahn

May 20, 2024, 5:53 PM

105 points

4 comments3 min readLW link

Dan Braun May 18, 2024, 6:27 AM
LW: 5 AF: 4
0
AF
in reply to: Logan Riggs’s comment on: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Thanks Logan!
2. Unlike local SAEs, our e2e SAEs aren’t trained on reconstructing the current layer’s activations. So at least my expectation was that they would get a worse reconstruction error at the current layer.
Improving training times wasn’t our focus for this paper, but I agree it would be interesting and expect there to be big gains to be made by doing things like mixing training between local and e2e+downstream and/or training multiple SAEs at once (depending on how you do this, you may need to be more careful about taking different pathways of computation to the original network).
We didn’t iterate on the e2e+downstream setup much. I think it’s very likely that you could get similar performance by making tweaks like the ones you suggested.

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill and Lee Sharkey

May 17, 2024, 4:25 PM

57 points

20 comments4 min readLW link

(arxiv.org)

Dan Braun Apr 26, 2024, 8:44 AM
6 points
2
on: Improving Dictionary Learning with Gated Sparse Autoencoders
This is neat, nice work!
I’m finding it quite hard to get a sense at what the actual Loss Recovered numbers you report are, and to compare them concretely to other work. If possible, it’d be very helpful if you shared:
1. What the zero ablations CE scores are for each model and SAE position. (I assume it’s much worse for the MLP and attention outputs than the residual stream?)
2. What the baseline CE scores are for each model.

Dan Braun Nov 4, 2023, 10:57 AM
LW: 8 AF: 4
4
AF
on: Untrusted smart models and trusted dumb models
Nice post.
Pushing back a little on this part of the appendix:
Also, unlike many other capabilities which we might want to evaluate, we don’t need to worry about the possibility that even though the model can’t do this unassisted, it can do it with improved scaffolding—the central deceptive alignment threat model requires the model to think its strategy through in a forward pass, and so if our model can’t be fine-tuned to answer these questions, we’re probably safe.
I’m a bit concerned about people assuming this is true for models going forward. A sufficiently intelligent RL-trained model can learn to distribute its planning across multiple forward passes. I think your claim is true for models trained purely on next-token prediction, and for GPT4-level models which, even though they have an RL component in their training, their outputs are all human-understandable (and incorporated into the oversight process).
But even 12 months from now I’m unsure how confident you could be in this claim for frontier models. Hopefully, labs are dissuaded from producing models which can use uninterpretable scratch pads given how much more dangerous they would be and harder to evaluate.

Dan Braun Sep 26, 2023, 7:03 AM
5 points
2
in reply to: aog’s comment on: Understanding strategic deception and deceptive alignment
(These are my own takes, the other authors may disagree)

We briefly address a case that can be viewed as “strategic sycophancy” case in Appendix B in the blog post, which is described similarly to your example. We indeed classify it as an instance of Deceptive Alignment.
As you mention, this case does have some differences with ideas commonly associated with Deceptive Alignment, notably the difference in behaviour between oversight and non-oversight. But it does share two important commonalities:
- The model is pursuing a goal that its designers do not want.
- The model strategically deceives the user (and designer) to further a goal.
Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).
Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).
I agree that deception which is not strategic or intentional could be important to prevent. However,
1. I expect the failure cases in these scenarios to manifest earlier, making them easier to fix and likely less catastrophic than cases that are strategic and intentional.
2. Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn’t be very useful. We can use “deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.

Understanding strategic deception and deceptive alignment

Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and Dan Braun

Sep 25, 2023, 4:27 PM

64 points

16 comments7 min readLW link

(www.apolloresearch.ai)

Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni and Jérémy Scheurer

May 30, 2023, 4:17 PM

217 points

11 comments8 min readLW link

A small update to the Sparse Coding interim research report

Lee Sharkey, Dan Braun and beren

Apr 30, 2023, 7:54 PM

61 points

5 comments1 min readLW link

Dan Braun Apr 22, 2023, 7:56 AM
3 points
0
in reply to: TurnTrout’s comment on: Understanding and controlling a maze-solving policy network
Thanks for sharing that analysis, it is indeed reassuring!

Dan Braun Mar 12, 2023, 10:33 AM
LW: 4 AF: 2
1
AF
on: Understanding and controlling a maze-solving policy network
Nice project and writeup. I particularly liked the walkthrough of thought processes throughout the project
Decision square’s Euclidean distance to the top-right $5 \times 5$ corner, positive ( $+ 1.326$ ).
We are confused and don’t fully understand which logical interactions produce this positive regression coefficient.
I’d be weary about interpreting the regression coefficients of features that are correlated (see Multicollinearity). Even the sign may be misleading.
It might be worth making a cross-correlation plot of the features. This won’t give you a new coefficients to put faith in, but it might help you decide how much to trust the ones you have. It can also be useful looking at how unstable the coefficients are during training (or e.g. when trained on a different dataset).
What links here?
- Behavioural statistics for a maze-solving agent by peligrietzer (Apr 20, 2023, 10:26 PM; 46 points)

Navigating public AI x-risk hype while pursuing technical solutions

Dan BraunFeb 19, 2023, 12:22 PM

18 points

0 comments2 min readLW link

Dan Braun Feb 4, 2023, 11:24 AM
1 point
1
in reply to: Raemon’s comment on: Small Talk is Good, Actually
Bad link

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey, Dan Braun and beren

Dec 13, 2022, 3:41 PM

150 points

23 comments22 min readLW link 2 reviews

Dan Braun Oct 13, 2022, 12:53 PM
2 points
1
in reply to: Nora Belrose’s comment on: Interpreting Neural Networks through the Polytope Lens
Hi Nora. We used rapidsai’s cuml which has GPU compatibility. Beware, the only “metric” available is “euclidean”, despite what the docs say (issue).

Interpreting Neural Networks through the Polytope Lens

Sid Black, Lee Sharkey, Connor Leahy, beren, CRG, merizian, Eric Winsor and Dan Braun

Sep 23, 2022, 5:58 PM

144 points

29 comments33 min readLW link

Dan Braun Jul 18, 2022, 5:16 AM
1 point
0
in reply to: Shmi’s comment on: All AGI safety questions welcome (especially basic ones) [monthly thread]
I think the risk level becomes clearer when stepping back from stories of how pursuing specific utility functions lead to humanity’s demise. An AGI will have many powerful levers on the world at its disposal. Very few combinations of lever pulls result in a good outcome for humans.

From the perspective of ants in an anthill, the actual utility function(s) of the humans is of minor relevance; the ants will be destroyed by a nuclear bomb in much the same way as they will be destroyed by a new construction site or a group of mischievous kids playing around.
(I think your Fermi AGI paradox is a good point, I don’t quite know how to factor that into my AGI risk assessment.)

Dan Braun May 30, 2022, 10:26 AM
12 points
7
in reply to: RHollerith’s comment on: Will working here advance AGI? Help us not destroy the world!
I have a different intuition here; I would much prefer the alignment team at e.g. DeepMind to be working at DeepMind as opposed to doing their work for some “alignment-only” outfit. My guess is that there is a non-negligible influence that an alignment team can have on a capabilities org in the form of:
- The alignment team interacting with other staff either casually in the office or by e.g. running internal workshops open to all staff (like DeepMind apparently do)
- The org consulting with the alignment team (e.g. before releasing models or starting dangerous projects)
- Staff working on raw capabilities having somewhere easy to go if they want to shift to alignment work
I think the above benefits likely outweigh the impact of the influence in the other direction (such as the value drift from having economic or social incentives linked to capabilities work)

Dan Braun Mar 29, 2022, 7:08 AM
1 point
on: A Longlist of Theories of Impact for Interpretability
Nice list!

Conditioned on the future containing AIs that are capable of suffering in a morally relevant way, interpretability work may also help identify and even reduce this suffering (and/or increase pleasure and happiness). While this may not directly reduce x-risk, it is a motivator for people taken in by arguments on s-risks from sentient AIs to work on/advocate for interpretability research.