Oliver Daniels

Karma: 196

PhD Student at Umass Amherst

Oliver Daniels Dec 31, 2024, 6:36 PM
3 points
0
on: Oliver Daniels-Koch’s Shortform
does anyone have thoughts on how to improve peer review in academic ML? From discussions with my advisor, my sense is that the system used to depend on word of mouth and people caring more about their academic reputation, which works in a fields of 100′s of researchers but breaks down in fields of 1000′s+. Seems like we need some kind of karma system to both rank reviewers and submissions. I’d be very surprised if nobody has proposed such a system, but a quick google search doesn’t yield results.

I think reforming peer review is probably underrated from a safety perspective (for reasons articulated here—basically bad peer review disincentivizes any rigorous analysis of safety research and degrades trust in the safety ecosystem)

Oliver Daniels Dec 27, 2024, 9:52 PM
1 point
0
in reply to: ryan_greenblatt’s comment on: Oliver Daniels-Koch’s Shortform
yeah I was mostly thinking neutral along the axis of “safey-ism” vs “accelerationism” (I think there’s a fairly straight-forward right-wing bias on X, further exasperated by Bluesky)

Oliver Daniels Dec 27, 2024, 9:16 PM
1 point
0
on: Write Good Enough Code, Quickly
also see Cognitive Load Is What Matters

Oliver Daniels Dec 27, 2024, 7:20 PM
3 points
0
in reply to: leogao’s comment on: leogao’s Shortform
Two common failure modes to avoid when doing the legibly impressive things

1. Only caring instrumentally about the project (decreases motivation)
2. Doing “net negative” projects

Oliver Daniels Dec 27, 2024, 6:35 PM
2 points
0
on: Oliver Daniels-Koch’s Shortform
Is the move of a lot of alignment discourse to Twitter/X a coordination failure or a positive development?

I’m kinda sad that LW seems less “alive” than it did a few years ago, but also seems healthy to be engaging in a more neutral space with a wider audience

Oliver Daniels Dec 27, 2024, 6:23 PM
3 points
−2
on: The Field of AI Alignment: A Postmortem, and What To Do About It
Yeah it does seem unfortunate that there’s not a robust pipeline for tackling the “hard problem” (even conditioned to more “moderate” models of x-risk)
But (conditioned on “moderate” models) there’s still a log of low-hanging fruit that STEM people from average universities (a group I count myself among) can pick. Like it seems good for Alice to bounce off of ELK and work on technical governance, and for Bob to make incremental progress on debate. The current pipeline/incentive system is still valuable, even if it systematically neglects tackling the “hard problem of alignment”.

Oliver Daniels Dec 23, 2024, 4:53 PM
1 point
0
on: Write Good Enough Code, Quickly
still trying to figure out the “optimal” config setup. The “clean code” method is roughly to have dedicated config files for different components that can be composed and overridden etc (see for example, https://github.com/oliveradk/measurement-pred). But I don’t like how far away these configs are from the main code. On the other hand, as the experimental setup gets more mature I often want to toggle across config groups. Maybe the solution is making a “mode” an optional config itself with overrides within the main script

Oliver Daniels Dec 20, 2024, 5:06 AM
2 points
0
in reply to: brambleboy’s comment on: Write Good Enough Code, Quickly
just read both posts and they’re great (as is The Witness). It’s funny though, part of me wants to defend OOP—I do think there’s something to finding really good abstractions (even preemptively), but that its typically not worth it for self-contained projects with small teams and fixed time horizons (e.g. ML research projects, but also maybe indie games).

Oliver Daniels Dec 17, 2024, 7:10 PM
3 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
The builder-breaker thing isn’t unique to CoT though right? My gloss on the recent Obfuscated Activations paper is something like “activation engineering is not robust to arbitrary adversarial optimization, and only somewhat robust to contained adversarial optimization”.

Oliver Daniels Dec 16, 2024, 12:30 AM
1 point
0
in reply to: Tahp’s comment on: Write Good Enough Code, Quickly
thanks for the detailed (non-ML) example! exactly the kind of thing I’m trying to get at

Oliver Daniels Dec 16, 2024, 12:29 AM
2 points
0
in reply to: Jatin Nainani’s comment on: Write Good Enough Code, Quickly
Thanks! huh yeah the python interactive windows seems like a much cleaner approach, I’ll give it a try

Oliver Daniels Dec 15, 2024, 5:36 AM
2 points
0
in reply to: anaguma’s comment on: Write Good Enough Code, Quickly
thanks! yup curser is notebook compatible

Write Good Enough Code, Quickly

Oliver DanielsDec 15, 2024, 4:45 AM

19 points

10 comments8 min readLW link

Oliver Daniels Dec 9, 2024, 4:15 PM
1 point
0
in reply to: Jordan Taylor’s comment on: Benchmarks for Detecting Measurement Tampering [Redwood Research]
Thanks!

Oliver Daniels Nov 21, 2024, 1:57 AM
9 points
4
on: Oliver Daniels-Koch’s Shortform
I wish there was a bibTeX functionality for alignment forum posts...

Concrete Methods for Heuristic Estimation on Neural Networks

Oliver DanielsNov 14, 2024, 5:07 AM

28 points

0 comments27 min readLW link

Oliver Daniels Oct 18, 2024, 5:42 PM
9 points
4
on: Oliver Daniels-Koch’s Shortform
I’m curious if Redwood would be willing to share a kind of “after action report” for why they stopped working on ELK/heuristic argument inspired stuff (e.g Causal Scrubbing, Patch Patching, Generalized Wick Decompositions, Measurement Tampering)

My impression it is some mix of:
a. Control seems great
b. Heuristic arguments is a bad bet (for some of the reasons mech interp is a bad bet)
c. ARC has it covered
But the weighting is pretty important here. If its
a. more people should be working on heuristic argument inspired stuff.
b. less people should be working on heuristic argument inspired stuff (i.e. ARC employees should quit, or at least people shouldn’t take jobs at ARC)
c. people should try to work at ARC if they’re interested, but its going to be difficult to make progress, especially for e.g. a typical ML PhD student interested in safety.
Ultimately people should come to their own conclusions, but Redwood’s considerations would be pretty valuable information.

Oliver Daniels Aug 21, 2024, 2:05 AM
5 points
4
on: AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work
(The community often calls this “scalable oversight”, but we want to be clear that this does not necessarily include scaling to large numbers of situations, as in monitoring.)
I like this terminology and think the community should adopt it

Oliver Daniels Aug 7, 2024, 5:22 AM
1 point
0
in reply to: Joseph Miller’s comment on: The Residual Expansion: A Framework for thinking about Transformer Circuits
Just to make it explicit and check my understanding—the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g
$I d$ = input → output
$(A t t n_{4}^{3} \circ M L P_{2} \circ A t t_{1}^{0})$ = input-> Attn 1.0 → MLP 2 → Attn 4.3 → output
And it follows that the (pre final layernorm) output of a transformer is the sum of all the “paths” from input to output constructed from the factorized DAG.

Oliver Daniels Jun 20, 2024, 5:50 PM
4 points
0
on: Benchmarks for Detecting Measurement Tampering [Redwood Research]
For anyone trying to replicate / try new methods, I posted a diamonds “pure prediction model” to huggingface https://huggingface.co/oliverdk/codegen-350M-mono-measurement_pred, (github repo here: https://github.com/oliveradk/measurement-pred/tree/master)

Oliver Daniels

Write Good Enough Code, Quickly

Con­crete Meth­ods for Heuris­tic Es­ti­ma­tion on Neu­ral Networks

Concrete Methods for Heuristic Estimation on Neural Networks