Sam Marks

Karma: 3,269

Sam Marks Feb 3, 2025, 2:12 AM
LW: 17 AF: 10
6
AF
on: Gradual Disempowerment, Shell Games and Flinches
Thanks for writing this reflection, I found it useful.
Just to quickly comment on my own epistemic state here:
1. I haven’t read GD.
2. But I’ve been stewing on some of (what I think are) the same ideas for the last few months, when William Brandon first made (what I think are) similar arguments to me in October.
  1. (You can judge from this Twitter discussion whether I seem to get the core ideas)
3. When I first heard these arguments, they struck me as quite important and outside of the wheelhouse of previous thinking on risks from AI development. I think they raise concerns that I don’t currently know how to refute around “even if we solve technical AI alignment, we still might lose control over our future.”
4. That said, I’m currently in a state of “I don’t know what to do about GD-type issues, but I have a lot of ideas about what to do about technical alignment.” For me at least, I think this creates an impulse to dismiss away GD-type concerns, so that I can justify continuing doing something where “the work cut out for me” (if not in absolute terms, then at least relative to working on GD-type issues).
5. In my case in particular I think it actually makes sense to keep working on technical alignment (because I think it’s going pretty productively).
6. But I think that other people who work (or are considering working in) technical alignment or governance should maybe consider trying to make progress on understanding and solving GD-type issues (assuming that’s possible).

Sam Marks Jan 20, 2025, 8:59 PM
LW: 6 AF: 5
0
AF
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
Thanks, this is helpful. So it sounds like you expect
1. AI progress which is slower than the historical trendline (though perhaps fast in absolute terms) because we’ll soon have finished eating through the hardware overhang
2. separately, takeover-capable AI soon (i.e. before hardware manufacturers have had a chance to scale substantially).
It seems like all the action is taking place in (2). Even if (1) is wrong (i.e. even if we see substantially increased hardware production soon), that makes takeover-capable AI happen faster than expected; IIUC, this contradicts the OP, which seems to expect takeover-capable AI to happen later if it’s preceded by substantial hardware scaling.
In other words, it seems like in the OP you care about whether takeover-capable AI will be preceded by massive compute automation because:
1. [this point still holds up] this affects how legible it is that AI is a transformative technology
2. [it’s not clear to me this point holds up] takeover-capable AI being preceded by compute automation probably means longer timelines
The second point doesn’t clearly hold up because if we don’t see massive compute automation, this suggests that AI progress slower than the historical trend.

Sam Marks Jan 20, 2025, 8:47 AM
LW: 7 AF: 3
0
AF
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
I really like the framing here, of asking whether we’ll see massive compute automation before [AI capability level we’re interested in]. I often hear people discuss nearby questions using IMO much more confusing abstractions, for example:
- “How much is AI capabilities driven by algorithmic progress?” (problem: obscures dependence of algorithmic progress on compute for experimentation)
- “How much AI progress can we get ‘purely from elicitation’?” (lots of problems, e.g. that eliciting a capability might first require a (possibly one-time) expenditure of compute for exploration)
My inside view sense is that the feasibility of takeover-capable AI without massive compute automation is about 75% likely if we get AIs that dominate top-human-experts prior to 2040.^[6] Further, I think that in practice, takeover-capable AI without massive compute automation is maybe about 60% likely.
Is this because:
1. You think that we’re >50% likely to not get AIs that dominate top human experts before 2040? (I’d be surprised if you thought this.)
2. The words “the feasibility of” importantly change the meaning of your claim in the first sentence? (I’m guessing it’s this based on the following parenthetical, but I’m having trouble parsing.)
Overall, it seems like you put substantially higher probability than I do on getting takeover capable AI without massive compute automation (and especially on getting a software-only singularity). I’d be very interested in understanding why. A brief outline of why this doesn’t seem that likely to me:
- My read of the historical trend is that AI progress has come from scaling up all of the factors of production in tandem (hardware, algorithms, compute expenditure, etc.).
- Scaling up hardware production has always been slower than scaling up algorithms, so this consideration is already factored into the historical trends. I don’t see a reason to believe that algorithms will start running away with the game.
  - Maybe you could counter-argue that algorithmic progress has only reflected returns to scale from AI being applied to AI research in the last 12-18 months and that the data from this period is consistent with algorithms becoming more relatively important relative to other factors?
- I don’t see a reason that “takeover-capable” is a capability level at which algorithmic progress will be deviantly important relative to this historical trend.
I’d be interested either in hearing you respond to this sketch or in sketching out your reasoning from scratch.

Sam Marks Jan 11, 2025, 3:04 AM
7 points
0
on: Scaling Sparse Feature Circuit Finding to Gemma 9B
Good work! A few questions:
1. Where do the edges you draw come from? IIUC, this method should result in a collection of features but not say what the edges between them are.
2. IIUC, the binary masking technique here is the same as the subnetwork probing baseline from the ACDC paper, where it seemed to work about as well as ACDC (which in turn works a bit worse than attribution patching). Do you know why you’re finding something different here? Some ideas:
  1. The SP vs. ACDC comparison from the ACDC paper wasn’t really apples-to-apples because ACDC pruned edges whereas SP pruned nodes (and kept all edges betwen non-pruned nodes IIUC). If Syed et al. had compared attribution patching on nodes vs. subnetwork probing, they would have found that subnetwork probing was better.
  2. There’s something special about SAE features which changes which subnetwork discovery technique works best.
    I’d be a bit interested in seeing your experiments repeated for finding subnetworks of neurons (instead of subnetworks of SAE features); does the comparison between attribution patching/integrated gradients and training a binary mask still hold in that case?

Recommendations for Technical AI Safety Research Directions

Sam MarksJan 10, 2025, 7:34 PM

64 points

1 comment17 min readLW link

(alignment.anthropic.com)

Sam Marks Dec 29, 2024, 5:24 PM
5 points
2
on: Greedy-Advantage-Aware RLHF
Cool stuff!
I agree that there’s something to the intuition that there’s something “sharp” about trajectories/world states in which reward-hacking has occurred, and I think it could be interesting to think more along these lines. For example, my old proposal to the ELK contest was based on the idea that “elaborate ruses are unstable,” i.e. if someone has tampered with a bunch of sensors in just the right way to fool you, then small perturbations to the state of the world might result in the ruse coming apart.
I think this demo is a cool proof-of-concept but is far from being convincing enough yet to merit further investment. If I were working on this, I would try to come up with an example setting that (a) is more realistic, (b) is plausibly analogous to future cases of catastrophic reward hacking, and (c) seems especially leveraged for this technique (i.e., it seems like this technique will really dramatically outperform baselines). Other things I would do:
1. Think more about what the baselines are here—are there other techniques you could have used to fix the problem in this setting? (If there are but you don’t think they’ll work in all settings, then think about what properties you need a setting to have to rule out the baselines, and make sure you pick a next setting that satisfies those properties.)
2. The technique here seems a bit hacky—just flipping the sign of the gradient update on abnormally high-reward episodes IIUC. I think think more about if there’s something more principled to aim for here. E.g., just spitballing, maybe what you want to do is to take the original reward function $R (τ)$ , where $τ$ is a trajectory, and instead optimize a “smoothed” reward function $R^{'} (τ)$ which is produced by averaging $R (τ^{'})$ over a bunch of small perturbation $τ^{'}$ of $τ$ (produced e.g. by modifying $τ$ by changing a small number of tokens).

Sam Marks Dec 29, 2024, 4:46 PM
12 points
5
in reply to: habryka’s comment on: evhub’s Shortform
The old 3:1 match still applies to employees who joined prior to May/June-ish 2024. For new joiners it’s indeed now 1:1 as suggested by the Dario interview you linked.

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

Dec 18, 2024, 5:19 PM

484 points

75 comments10 min readLW link

Sam Marks Dec 17, 2024, 6:59 PM
5 points
0
in reply to: faul_sname’s comment on: Sam Marks’s Shortform
Based on the blog post, it seems like they had a system prompt that worked well enough for all of the constraints except for regexes (even though modifying the prompt to fix the regexes thing resulted in the model starting to ignore the other constraints). So it seems like the goal here was to do some custom thing to fix just the regexes (without otherwise impeding the model’s performance, include performance at following the other constraints).
(Note that using SAEs to fix lots of behaviors might also have additional downsides, since you’re doing a more heavy-handed intervention on the model.)

Sam Marks Dec 17, 2024, 6:01 PM
LW: 7 AF: 4
0
AF
in reply to: Buck’s comment on: Sam Marks’s Shortform
The entrypoint to their sampling code is here. It looks like they just add a forward hook to the model that computes activations for specified features and shifts model activations along SAE decoder directions a corresponding amount. (Note that this is cheaper than autoencoding the full activation. Though for all I know, running the full autoencoder during the forward pass might have been fine also, given that they’re working with small models and adding a handful of SAE calls to a forward pass shouldn’t be too big a hit.)

Sam Marks Dec 16, 2024, 9:01 PM
LW: 7 AF: 4
2
AF
in reply to: Sam Marks’s comment on: Sam Marks’s Shortform
@Adam Karvonen I feel like you guys should test this unless there’s a practical reason that it wouldn’t work for Benchify (aside from “they don’t feel like trying any more stuff because the SAE stuff is already working fine for them”).

Sam Marks Dec 16, 2024, 9:00 PM
LW: 5 AF: 2
0
AF
in reply to: Buck’s comment on: Sam Marks’s Shortform
I’m guessing you’d need to rejection sample entire blocks, not just lines. But yeah, good point, I’m also curious about this. Maybe the proportion of responses that use regexes is too large for rejection sampling to work? @Adam Karvonen

Sam Marks Dec 16, 2024, 8:56 PM
7 points
0
in reply to: sarahconstantin’s comment on: Sam Marks’s Shortform
Apparently fuzz tests that used regexes were an issue in practice for Benchify (the company that ran into this problem). From the blog post:
Benchify observed that the model was much more likely to generate a test with no false positives when using string methods instead of regexes, even if the test coverage wasn’t as extensive.

Sam Marks Dec 16, 2024, 6:42 PM
2 points
0
in reply to: Neel Nanda’s comment on: Sam Marks’s Shortform
Isn’t every instance of clamping a feature’s activation to 0 conditional in this sense?

Sam Marks Dec 15, 2024, 9:22 PM
LW: 51 AF: 26
7
AF
on: Sam Marks’s Shortform
x-posting a kinda rambling thread I wrote about this blog post from Tilde research.
---
If true, this is the first known application of SAEs to a found-in-the-wild problem: using LLMs to generate fuzz tests that don’t use regexes. A big milestone for the field of interpretability!
I’ll discussed some things that surprised me about this case study in
---
The authors use SAE features to detect regex usage and steer models not to generate regexes. Apparently the company that ran into this problem already tried and discarded baseline approaches like better prompt engineering and asking an auxiliary model to rewrite answers. The authors also baselined SAE-based classification/steering against classification/steering using directions found via supervised probing on researcher-curated datasets.
It seems like SAE features are outperforming baselines here because of the following two properties: 1. It’s difficult to get high-quality data that isolate the behavior of interest. (I.e. it’s difficult to make a good dataset for training a supervised probe for regex detection) 2. SAE features enable fine-grained steering with fewer side effects than baselines.
Property (1) is not surprising in the abstract, and I’ve often argued that if interpretability is going to be useful, then it will be for tasks where there are structural obstacles to collecting high-quality supervised data (see e.g. the opening paragraph to section 4 of Sparse Feature Circuits https://arxiv.org/abs/2403.19647).
However, I think property (1) is a bit surprising in this particular instance—it seems like getting good data for the regex task is more “tricky and annoying” than “structurally difficult.” I’d weakly guess that if you are a whiz at synthetic data generation then you’d be able to get good enough data here to train probes that outperform the SAEs. But that’s not too much of a knock against SAEs—it’s still cool if they enable an application that would otherwise require synthetic datagen expertise. And overall, it’s a cool showcase of the fact that SAEs find meaningful units in an unsupervised way.
Property (2) is pretty surprising to me! Specifically, I’m surprised that SAE feature steering enables finer-grained control than prompt engineering. As others have noted, steering with SAE features often results in unintended side effects; in contrast, since prompts are built out of natural language, I would guess that in most cases we’d be able to construct instructions specific enough to nail down our behavior of interest pretty precisely. But in this case, it seems like the task instructions are so long and complicated that the models have trouble following them all. (And if you try to improve your prompt to fix the regex behavior, the model starts misbehaving in other ways, leading to a “whack-a-mole” problem.) And also in this case, SAE feature steering had fewer side-effects than I expected!
I’m having a hard time drawing a generalizable lesson from property (2) here. My guess is that this particular problem will go away with scale, as larger models are able to more capably follow fine-grained instructions without needing model-internals-based interventions. But maybe there are analogous problems that I shouldn’t expect to be solved with scale? E.g. maybe interpretability-assisted control will be useful across scales for resisting jailbreaks (which are, in some sense, an issue with fine-grained instruction-following).
Overall, something surprised me here and I’m excited to figure out what my takeaways should be.
---
Some things that I’d love to see independent validation of:
1. It’s not trivial to solve this problem with simple changes to the system prompt. (But I’d be surprised if it were: I’ve run into similar problems trying to engineer system prompts with many instructions.)
2. It’s not trivial to construct a dataset for training probes that outcompete SAE features. (I’m at ~30% that the authors just got unlucky here.)
---
Huge kudos to everyone involved, especially the eagle-eyed @Adam Karvonen for spotting this problem in the wild and correctly anticipating that interpretability could solve it!
---
I’d also be interested in tracking whether Benchify (the company that had the fuzz-tests-without-regexes problem) ends up deploying this system to production (vs. later finding out that the SAE steering is unsuitable for a reason that they haven’t yet noticed).

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

Dec 11, 2024, 6:30 AM

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

Sam Marks Dec 6, 2024, 2:39 AM
33 points
0
on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
I’ve donated $5,000. As with Ryan Greenblatt’s donation, this is largely coming from a place of cooperativeness: I’ve gotten quite a lot of value from Lesswrong and Lighthaven.^[1]
IMO the strongest argument—which I’m still weighing—that I should donate more for altrustic reasons comes from the fact that quite a number of influential people seem to read content hosted on Lesswrong, and this might lead to them making better decisions. A related anecdote: When David Bau (big name professor in AI interpretability) gives young students a first intro to interpretability, his go-to example is (sometimes at least) nostalgebraist’s logit lens. Watching David pull up a Lesswrong post on his computer and excitedly talk students through it was definitely surreal and entertaining.
1. ^
  When I was trying to assess the value I’ve gotten from Lightcone, I was having a hard time at first converting non-monetary value into dollars. But then I realized that there was an easy way to eyeball a monetary lower-bound for value provided me by Lightcone: estimate the effect that Lesswrong had on my personal income. Given that Lesswrong probably gets a good chunk of the credit for my career transition from academic math into AI safety research, that’s a lot of dollars!

Sam Marks Nov 16, 2024, 9:19 PM
2 points
0
in reply to: MondSemmel’s comment on: Lao Mein’s Shortform
I’m quite happy for laws to be passed and enforced via the normal mechanisms. But I think it’s bad for policy and enforcement to be determined by Elon Musk’s personal vendettas. If Elon tried to defund the AI safety institute because of a personal vendetta against AI safety researchers, I would have some process concerns, and so I also have process concerns when these vendettas are directed against OAI.

Sam Marks Nov 16, 2024, 8:38 AM
15 points
3
on: OpenAI Email Archives (from Musk v. Altman)
FYI it seems like this (important-seeming) email is missing, though the surrounding emails in the exchange seem to be present. (So maybe some other ones are missing too.)

Sam Marks Nov 16, 2024, 6:43 AM
2 points
−3
in reply to: Lao Mein’s comment on: Lao Mein’s Shortform
For what it’s worth—even granting that it would be good for the world for Musk to use the force of government for pursuing a personal vendetta against Altman or OAI—I think this is a pretty uncomfortable thing to root for, let alone to actively influence. I think this for the same reason that I think it’s uncomfortable to hope for—and immoral to contribute to—assassination of political leaders, even assuming that their assassination would be net good.

Sam Marks

Recom­men­da­tions for Tech­ni­cal AI Safety Re­search Directions

Align­ment Fak­ing in Large Lan­guage Models

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

Recommendations for Technical AI Safety Research Directions

Alignment Faking in Large Language Models

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders