PhD Student at Umass Amherst
Oliver Daniels
yeah I was mostly thinking neutral along the axis of “safey-ism” vs “accelerationism” (I think there’s a fairly straight-forward right-wing bias on X, further exasperated by Bluesky)
also see Cognitive Load Is What Matters
Two common failure modes to avoid when doing the legibly impressive things
1. Only caring instrumentally about the project (decreases motivation)2. Doing “net negative” projects
Is the move of a lot of alignment discourse to Twitter/X a coordination failure or a positive development?
I’m kinda sad that LW seems less “alive” than it did a few years ago, but also seems healthy to be engaging in a more neutral space with a wider audience
Yeah it does seem unfortunate that there’s not a robust pipeline for tackling the “hard problem” (even conditioned to more “moderate” models of x-risk)
But (conditioned on “moderate” models) there’s still a log of low-hanging fruit that STEM people from average universities (a group I count myself among) can pick. Like it seems good for Alice to bounce off of ELK and work on technical governance, and for Bob to make incremental progress on debate. The current pipeline/incentive system is still valuable, even if it systematically neglects tackling the “hard problem of alignment”.
still trying to figure out the “optimal” config setup. The “clean code” method is roughly to have dedicated config files for different components that can be composed and overridden etc (see for example, https://github.com/oliveradk/measurement-pred). But I don’t like how far away these configs are from the main code. On the other hand, as the experimental setup gets more mature I often want to toggle across config groups. Maybe the solution is making a “mode” an optional config itself with overrides within the main script
just read both posts and they’re great (as is The Witness). It’s funny though, part of me wants to defend OOP—I do think there’s something to finding really good abstractions (even preemptively), but that its typically not worth it for self-contained projects with small teams and fixed time horizons (e.g. ML research projects, but also maybe indie games).
The builder-breaker thing isn’t unique to CoT though right? My gloss on the recent Obfuscated Activations paper is something like “activation engineering is not robust to arbitrary adversarial optimization, and only somewhat robust to contained adversarial optimization”.
thanks for the detailed (non-ML) example! exactly the kind of thing I’m trying to get at
Thanks! huh yeah the python interactive windows seems like a much cleaner approach, I’ll give it a try
thanks! yup curser is notebook compatible
Write Good Enough Code, Quickly
Thanks!
I wish there was a bibTeX functionality for alignment forum posts...
Concrete Methods for Heuristic Estimation on Neural Networks
I’m curious if Redwood would be willing to share a kind of “after action report” for why they stopped working on ELK/heuristic argument inspired stuff (e.g Causal Scrubbing, Patch Patching, Generalized Wick Decompositions, Measurement Tampering)
My impression it is some mix of:a. Control seems great
b. Heuristic arguments is a bad bet (for some of the reasons mech interp is a bad bet)
c. ARC has it covered
But the weighting is pretty important here. If its
a. more people should be working on heuristic argument inspired stuff.b. less people should be working on heuristic argument inspired stuff (i.e. ARC employees should quit, or at least people shouldn’t take jobs at ARC)
c. people should try to work at ARC if they’re interested, but its going to be difficult to make progress, especially for e.g. a typical ML PhD student interested in safety.
Ultimately people should come to their own conclusions, but Redwood’s considerations would be pretty valuable information.
(The community often calls this “scalable oversight”, but we want to be clear that this does not necessarily include scaling to large numbers of situations, as in monitoring.)
I like this terminology and think the community should adopt it
Just to make it explicit and check my understanding—the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g
= input → output
= input-> Attn 1.0 → MLP 2 → Attn 4.3 → output
And it follows that the (pre final layernorm) output of a transformer is the sum of all the “paths” from input to output constructed from the factorized DAG.
For anyone trying to replicate / try new methods, I posted a diamonds “pure prediction model” to huggingface https://huggingface.co/oliverdk/codegen-350M-mono-measurement_pred, (github repo here: https://github.com/oliveradk/measurement-pred/tree/master)
does anyone have thoughts on how to improve peer review in academic ML? From discussions with my advisor, my sense is that the system used to depend on word of mouth and people caring more about their academic reputation, which works in a fields of 100′s of researchers but breaks down in fields of 1000′s+. Seems like we need some kind of karma system to both rank reviewers and submissions. I’d be very surprised if nobody has proposed such a system, but a quick google search doesn’t yield results.
I think reforming peer review is probably underrated from a safety perspective (for reasons articulated here—basically bad peer review disincentivizes any rigorous analysis of safety research and degrades trust in the safety ecosystem)