Oliver Daniels-Koch

Karma: 86

PhD Student at Umass Amherst

Oliver Daniels-Koch 18 Oct 2024 17:42 UTC
8 points
4
on: Oliver Daniels-Koch’s Shortform
I’m curious if Redwood would be willing to share a kind of “after action report” for why they stopped working on ELK/heuristic argument inspired stuff (e.g Causal Scrubbing, Patch Patching, Generalized Wick Decompositions, Measurement Tampering)

My impression it is some mix of:
a. Control seems great
b. Heuristic arguments is a bad bet (for some of the reasons mech interp is a bad bet)
c. ARC has it covered
But the weighting is pretty important here. If its
a. more people should be working on heuristic argument inspired stuff.
b. less people should be working on heuristic argument inspired stuff (i.e. ARC employees should quit, or at least people shouldn’t take jobs at ARC)
c. people should try to work at ARC if they’re interested, but its going to be difficult to make progress, especially for e.g. a typical ML PhD student interested in safety.
Ultimately people should come to their own conclusions, but Redwood’s considerations would be pretty valuable information.

Oliver Daniels-Koch 21 Aug 2024 2:05 UTC
5 points
4
on: AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work
(The community often calls this “scalable oversight”, but we want to be clear that this does not necessarily include scaling to large numbers of situations, as in monitoring.)
I like this terminology and think the community should adopt it

Oliver Daniels-Koch 7 Aug 2024 5:22 UTC
1 point
0
in reply to: Joseph Miller’s comment on: The Residual Expansion: A Framework for thinking about Transformer Circuits
Just to make it explicit and check my understanding—the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g
$I d$ = input → output
$(A t t n_{4}^{3} \circ M L P_{2} \circ A t t_{1}^{0})$ = input-> Attn 1.0 → MLP 2 → Attn 4.3 → output
And it follows that the (pre final layernorm) output of a transformer is the sum of all the “paths” from input to output constructed from the factorized DAG.

Oliver Daniels-Koch 20 Jun 2024 17:50 UTC
3 points
0
on: Benchmarks for Detecting Measurement Tampering [Redwood Research]
For anyone trying to replicate / try new methods, I posted a diamonds “pure prediction model” to huggingface https://huggingface.co/oliverdk/codegen-350M-mono-measurement_pred, (github repo here: https://github.com/oliveradk/measurement-pred/tree/master)

Oliver Daniels-Koch 5 Jun 2024 13:00 UTC
3 points
2
on: Oliver Daniels-Koch’s Shortform
just read “Situational Awareness”—it definitely woke me up. AGI is real, and very plausibly (55%?) happening within this decade. I need to stop sleep walking and get serious about contributing within the next two years.
First, some initial thoughts on the essay
- Very “epic” and (self?) aggrandizing. If you believe the conclusions, its not unwarranted, but I worry a bit about narratives that satiate some sense of meaning and self-importance. (That counter-reaction is probably much stronger though, and on the margin it seems really valuable to “full-throatily” take on the prospect of AGI within the next 3-5 years)
- I think most of my uncertainty lies in the “unhobbling” type algorithmic progress, this seems especially unpredictable, and may require lots of expensive experimentation if e.g. the relevant capabilities to get some meta-cognitive process to train only emerge at a certain scale. I’m vaguely thinking back to Paul’s post on self-driving cars and AGI timelines. Maybe this is all priced in though—there’s way more research investment, and tech path seems relatively straight forward if we can apply enough experimentation. Still, research is hard, takes a lot of serial time, and is less predictable that e.g. industrial processes. (I’m kind just saying this though, not actually sure how to quantify this, I’m pretty sure people have analsysis of insight generation or whatever, idk...)

Oliver Daniels-Koch 24 May 2024 0:53 UTC
3 points
0
on: Oliver Daniels-Koch’s Shortform
I previously thought the argument for measurement tampering being more tractable then general ELK was mostly about the structural / causal properties of multiple independent measurements, but I think I’m more swayed by the argument that measurement tampering will just be more obvious (both easier to see using interpretability and more anomalous in general) then e.g. sycophancy. This is a flimsier argument though, and is less likely to hold when tampering is more subtle.

Oliver Daniels-Koch 9 May 2024 14:27 UTC
1 point
0
in reply to: Erik Jenner’s comment on: Oliver Daniels-Koch’s Shortform
Here’s a revised sketch

A few notes:
- I use Scalable Oversight to refer to both Alignment and Control
- I’m confused whether weak to strong learning is a restatement of scalable oversight, ELK, or its own thing, so I ignore it
- I don’t explicitly include easy-to-hard, I think OOD basically covers it
- taxonomies and abstractions are brittle and can be counterproductive
Scalable Oversight Taxonomy
- Scalable Oversight
  - Scalable Alignment
    Benchmarks / Tasks
    Sandwiching Experiments (human amateurs + model, gt from human experts)
    Weak models supervising Strong models
    Approaches
    Debate
    Recursive reward modeling
    (Solution to Eliciting Latent Knowledge) + Narrow Elicitation
    (Note—I think assumes more then prior scalable oversight ideas that there will be base model with adequate knowledge, such that the hard part is extracting the knowledge rather than teaching the model)
    Eliciting Latent Knowledge
    Approaches
    Contrast Consistent Search
    Confidence
    Intermediate Probing
    “Speed Prior”
    “Simplicity Prior”
    Concept Extrapolation—learn all salient generalizations, use expensive supervision to select correct one
    IID Mechanistic Anomaly Detection + expensive supervision on anomalies
    Subclasses
    Measurement Tampering Detection
    Approaches
    OOD Mechanistic Anomaly Detection
    In distribution
    Out of Distribution (likely? requires multiple measurment structure)
    Concept Extrapolation
    train diverse probes on untrusted data, select probe that predicts positive measurements less frequently
    Narrow Elicitation
    ...
  - Scalable Control
    Weak Review
    Untrusted Rephrase or whatever
    Coup probes
    MAD (Review all anomalies)
- Trojans
  - ...
  - MAD (maybe?)
- Adversarial Examples
  - ...
  - MAD (maybe?)
- Natural Mechanism Distinction
  - MAD
- Spurious Correlate Detection / Resolution
  - Concept Extrapolation

Oliver Daniels-Koch 9 May 2024 13:30 UTC
1 point
0
in reply to: Oliver Daniels-Koch’s comment on: Oliver Daniels-Koch’s Shortform
I think I’m mostly right, but using a somewhat confused frame.
It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we’ll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms.

Oliver Daniels-Koch 9 May 2024 12:19 UTC
1 point
0
on: Oliver Daniels-Koch’s Shortform
One confusion I have with MAD as an approach to ELK is that it seems to assume some kind of initial inner alignment. If we’re flagging when the model takes actions / makes predictions for “unusual reasons”, where unusual is define with respect to some trusted set, but aligned and misaligned models are behaviorally indistinguishable on the trusted set, then a model could learn to do things for misaligned reasons on the trusted set, and then use those same reasons on the untrusted set. For example, a deceptively aligned model would appear aligned in training but attempt take-over in deployment for the “same reason” (e.g. to maximize paperclips), but a MAD approach that “properly” handles out of distribution cases would not flag take over attempts because we want models to be able to respond to novel situations.
I guess this is part of what motivates measurement tampering as a subclass of ELK—instead of trying to track motivations of the agent as reasons, we try to track the reasons for the measurement predictions, and we have some trusted set with no tampering, where we know the reasons for the measurements is ~exactly that the thing we want to be measuring.
Now time to check my answer by rereading https://www.alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk

Oliver Daniels-Koch 8 May 2024 23:16 UTC
6 points
0
on: Oliver Daniels-Koch’s Shortform
Clarifying the relationship between mechanistic anomaly detection (MAD), measurement tampering detection (MTD), weak to strong generalization (W2SG), weak to strong learning (W2SL), and eliciting latent knowledge (ELK). (Nothing new or interesting here, I just often loose track of these relationships in my head)
eliciting latent knowledge is an approach to scalable oversight which hopes to use the latent knowledge of a model as a supervision signal or oracle.
weak to strong learning is an experimental setup for evaluating scalable oversight protocols, and is a class of sandwiching experiments
weak to strong generalization is a class of approaches to ELK which relies on generalizing a “weak” supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.
measurement tampering detection is a class of weak to strong generalization problems, where the “weak” supervision consists of multiple measurements which are sufficient for supervision in the absence of “tampering” (where tampering is not yet formally defined)
mechanistic anomaly detection is an approach to ELK, where examples are flagged as anomalous if they cause the model to do things for “different reasons” then on a trusted dataset, where “different reasons” are defined w.r.t internal model cognition and structure.
mechanistic anomaly detection methods that work for ELK should also probably work for other problems (such as backdoor detection and adversarial example detection)
so when developing benchmarks for mechanistic anomaly detection, we both want to test methods against methods in standard machine learning security problems (adversarial examples and trojans) that have similar structure to scalable oversight problems, against other elk approaches (e.g. CCS), and against other scalable oversight approaches (e.g. debate)

Oliver Daniels-Koch 2 May 2024 20:41 UTC
1 point
0
AF
in reply to: Fabien Roger’s comment on: Benchmarks for Detecting Measurement Tampering [Redwood Research]
oh I see, by all(sensor_preds) I meant sum([logit_i] for i in n_sensors) (the probability that all sensors are activated). Makes sense, thanks!

Oliver Daniels-Koch 1 May 2024 18:12 UTC
1 point
0
AF
in reply to: Fabien Roger’s comment on: Benchmarks for Detecting Measurement Tampering [Redwood Research]
is individual measurement prediction AUROC a) or b)
a) mean(AUROC(sensor_i_pred, sensor_i))
b) AUROC(all(sensor_preds), all(sensors))

Oliver Daniels-Koch 25 Apr 2024 12:40 UTC
1 point
0
in reply to: Fabien Roger’s comment on: Benchmarks for Detecting Measurement Tampering [Redwood Research]
looking at your code—seems like there’s an option for next-token prediction in the initial finetuning state, but no mention (that I can find) in the paper—am I correct in assuming the next token prediction weight was set to 0? (apologies for bugging you on this stuff!)

Oliver Daniels-Koch 24 Apr 2024 1:11 UTC
1 point
0
in reply to: Fabien Roger’s comment on: Benchmarks for Detecting Measurement Tampering [Redwood Research]
yup, sorry for missing that, thanks!

Oliver Daniels-Koch 22 Apr 2024 1:47 UTC
LW: 3 AF: 2
0
AF
on: Benchmarks for Detecting Measurement Tampering [Redwood Research]
did the paper report accuracy of the pure prediction model (on the pure prediction task)? (trying to replicate and want a sanity check).

Oliver Daniels-Koch 1 Apr 2024 2:37 UTC
1 point
0
in reply to: Aaron_Scher’s comment on: Aaron_Scher’s Shortform
I think mechanistic anomaly detection (mostly ARC but also Redwood and some forthcoming work) is importantly different than robustness (though clearly related).

Oliver Daniels-Koch 17 Mar 2024 19:01 UTC
1 point
0
in reply to: Oliver Daniels-Koch’s comment on: Oliver Daniels-Koch’s Shortform
(from conversation with Erik Jenner) roughly 3 classes of applications
1. MTD all the way down
  1. Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering.
2. Other Scalable Oversight + MTD as reward function / side constraint
  1. Use whatever scalable oversight protocol seems good (Debate, IDA, etc) for the “primary” training signal, while also setting up a bunch of measurements of safety properties (e.g. integrity of the binaries on the server the model is hosted). Do RL on the scalable oversight signal while constantly screening for and negatively rewarding actions flagged as tampering
3. Other Scalable Oversight + MTD as extra safety check
  1. same as before, but your leaning more on the scalable oversight to avoid tampering in most cases, and detecting measurement tampering as a precautionary step (i.e. you detect measurement tampering you Pause/Stop)
(some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))

Oliver Daniels-Koch 17 Mar 2024 17:24 UTC
3 points
0
on: Oliver Daniels-Koch’s Shortform
I’ve been thinking a little more about the high-level motivation of measurement tampering, and struggling to think through when measurement tampering detection itself is actually going to be useful for mitigating x-risk. Like is human/ai feedback considered a robust measurement device? If no, then what is the most alignment relevant domain MTD could be applied to. If yes, do the structural properties of measurement that supposedly make it easier then general ELK still hold?

Oliver Daniels-Koch 7 Mar 2024 16:24 UTC
5 points
1
in reply to: beren’s comment on: Many arguments for AI x-risk are wrong
Strongly agree, and also want to note that wire-heading is (almost?) always a (near?) optimal policy—i.e. trajectories that tamper with the reward signal and produce high reward will be strongly upweighted, and insofar as the model has sufficient understanding/situational awareness of the reward process and some reasonable level of goal-directedness, this upweighting could plausibly induce a policy explicitly optimizing the reward.

Oliver Daniels-Koch 4 Oct 2023 0:08 UTC
1 point
0
on: Benchmarks for Detecting Measurement Tampering [Redwood Research]
Another (more substantive) question. Again from section 2.1.2
In the validation set, we exclude data points where the diamond is there, the measurements are positive, but at least one of the measurements would have been positive if the diamond wasn’t there, since both diamond detectors and tampering detectors can be used to remove incentives to tamper with measurements. We keep them in the train set, and they account for 50% of the generated data.
Is this (just) because agent would get rewarding for measurements reading the diamond is present? I think I can image cases where agents are incentivized to tamper with measurements even when the diamond is present to make the task of distinguishing tampering more difficult.