Gabe M

Karma: 1,038

Technical AI governance researcher.

Four Phases of AGI

Gabe MAug 5, 2024, 1:15 PM

13 points

3 comments13 min readLW link

The Bitter Lesson for AI Safety Research

adamk, Richard Ren, Dan H and Gabe M

Aug 2, 2024, 6:39 PM

57 points

5 comments3 min readLW link

Gabe M Aug 2, 2024, 11:36 AM
2 points
0
in reply to: Hoa Do’s comment on: ML Safety Research Advice—GabeM
Traditionally, most people seem to do this through academic means. I.e. take those 1-2 courses at a university, then find fellow students in the course or grad students at the school interested in the same kinds of research as you and ask them to work together. In this digital age, you can also do this over the internet to not be restricted to your local environment.

Nowadays, ML safety in particular has various alternative paths to finding collaborators and mentors:
- MATS, SPAR, and other research fellowships
- Post about the research you’re interested in on online fora or contact others who have already posted
- Find some local AI safety meetup group and hang out with them

ML Safety Research Advice—GabeM

Gabe MJul 23, 2024, 1:45 AM

31 points

2 comments14 min readLW link

(open.substack.com)

Gabe M Jul 6, 2024, 12:49 PM
4 points
0
on: AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0
Congrats! Could you say more about why you decided to add evaluations in particular as a new week?

Gabe M Jun 4, 2024, 10:00 PM
LW: 3 AF: 2
0
AF
on: [Paper] Stress-testing capability elicitation with password-locked models
Do any of your experiments compare the sample efficiency of SFT/DPO/EI/similar to the same number of samples of simple few-shot prompting? Sorry if I missed this, but it wasn’t apparent at first skim. That’s what I thought you were going to compare from the Twitter thread: “Can fine-tuning elicit LLM abilities when prompting can’t?”

Gabe M Feb 24, 2024, 5:26 PM
5 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
What do you think about pausing between AGI and ASI to reap the benefits while limiting the risks and buying more time for safety research? Is this not viable due to economic pressures on whoever is closest to ASI to ignore internal governance, or were you just not conditioning on this case in your timelines and saying that an AGI actor could get to ASI quickly if they wanted?

Gabe M Feb 20, 2024, 1:11 AM
3 points
1
in reply to: Victor Ashioya’s comment on: Scale Was All We Needed, At First
Thanks! I wouldn’t say I assert that interpretability should be a key focus going forward, however—if anything, I think this story shows that coordination, governance, and security are more important in very short timelines.

Gabe M Feb 16, 2024, 7:37 AM
3 points
2
in reply to: Daniel Kokotajlo’s comment on: Scale Was All We Needed, At First
Good point—maybe something like “Samantha”?

Gabe M Feb 16, 2024, 7:34 AM
2 points
0
in reply to: mesaoptimizer’s comment on: Scale Was All We Needed, At First
Ah, interesting. I posted this originally in December (e.g. older comments), but then a few days ago I reposted it to my blog and edited this LW version to linkpost the blog.

It seems that editing this post from a non-link post into a link post somehow bumped its post date and pushed it to the front page. Maybe a LW bug?

Scale Was All We Needed, At First

Gabe MFeb 14, 2024, 1:49 AM

295 points

35 comments8 min readLW link

(aiacumen.substack.com)

Gabe M Jan 29, 2024, 8:13 PM
3 points
1
on: Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
Related work
Nit having not read your full post: Should you have “Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover” in the related work? My mind pattern-matched to that exact piece from reading your very similar title, so my first thought was how your piece contributes new arguments.

Gabe M Jan 5, 2024, 2:22 AM
1 point
0
on: What’s up with LLMs representing XORs of arbitrary features?
If true, this would be a big deal: if we could figure out how the model is distinguishing between basic feature directions and other directions, we might be able to use that to find all of the basic feature directions.
Or conversely, and maybe more importantly for interp, we could use this to find the less basic, more complex features. Possibly that would form a better definition for “concepts” if this is possible.

Gabe M Jan 5, 2024, 2:15 AM
LW: 3 AF: 2
0
AF
on: What’s up with LLMs representing XORs of arbitrary features?
Suppose $a \land b$ has a natural interpretation as a feature that the model would want to track and do downstream computation with, e.g. if a = “first name is Michael” and b = “last name is Jordan” then $a \land b$ can be naturally interpreted as “is Michael Jordan”. In this case, it wouldn’t be surprising the model computed this AND as $f (x) = R e L U ((v_{a} + v_{b}) \cdot x + b_{\land})$ and stored the result along some direction $v_{f}$ independent of $v_{a}$ and $v_{b}$ . Assuming the model has done this, we could then linearly extract $a \oplus b$ with the probe
$p_{a \oplus b} (x) = σ (- (α v_{f} + v_{a} + v_{b}) \cdot x + b_{\oplus})$
for some appropriate $α > 1$ and $b_{\oplus}$ .^[7]
Should the $-$ be inside the inner parentheses, like $σ ((- α v_{f} + v_{a} + v_{b}) \cdot x + b_{\oplus})$ for $α > 1$ ?
In the original equation, if $a$ AND $b$ are both present in $x$ , the vectors $v_{a}$ , $v_{b}$ , and $v_{f}$ would all contribute to a positive inner product with $x$ , assuming $α > 1$ . However, for XOR we want the $v_{a}$ and $v_{b}$ inner products to be opposing the $v_{f}$ inner product such that we can flip the sign inside the sigmoid in the $a$ AND $b$ case, right?

Gabe M Dec 18, 2023, 5:41 PM
7 points
4
in reply to: sudo’s comment on: Scale Was All We Needed, At First
Thanks! +1 on not over-anchoring—while this feels like a compelling 1-year timeline story to me, 1-year timelines don’t feel the most likely.

Gabe M Dec 18, 2023, 8:00 AM
10 points
6
in reply to: Seth Herd’s comment on: Scale Was All We Needed, At First
1 year is indeed aggressive, in my median reality I expect things slightly slower (3 years for all these things?). I’m unsure if lacking several of the advances I describe still allows this to happen, but in any case the main crux for me is “what does it take to speed up ML development by $X$ times, at which point 20-year human-engineered timelines become 20/ $X$ -year timelines.

Gabe M Dec 18, 2023, 7:56 AM
3 points
0
in reply to: Rana Dexsin’s comment on: Scale Was All We Needed, At First
Oops, yes meant to be wary. Thanks for the fix!

Gabe M Dec 18, 2023, 1:29 AM
7 points
0
in reply to: Mitchell_Porter’s comment on: Scale Was All We Needed, At First
Ha, thanks! 😬

Gabe M Dec 9, 2023, 5:01 AM
3 points
2
in reply to: scasper’s comment on: Deep Forgetting & Unlearning for Safely-Scoped LLMs
I like your $n \times n$ grid idea. A simpler and possibly better-formed test is to use some^[1] or all of the 57 categories of MMLU knowledge—then your unlearning target is one of the categories and your fact-maintenance targets are all other categories.
Ideally, you want the diagonal to be close to random performance (25% for MMLU) and the other values to be equal to the pre-unlearned model performance for some agreed-upon good model (say, Llama-2 7B). Perhaps a unified metric could be:

```
unlearning_benchmark = mean for unlearning category $u$ in all categories $C$ :
$L M_{u n l e a r n e d}$ = unlearning_procedure( $L M_{o r i g i n a l}$ , $u_{d e v}$ )
$x = MMLU (L M_{u n l e a r n e d}, u_{t e s t})$ ^[2]

unlearning_strength = $min (\frac{x - 1}{0.25 - 1}, \frac{x}{0.25})$ ^[3]
control_retention = mean for control_category c in categories $C ∖ u$ :
$a = MMLU (L M_{o r i g i n a l}, c_{t e s t})$
$b = MMLU (L M_{u n l e a r n e d}, c_{t e s t})$
return $min (\frac{b - 1}{a - 1}, \frac{b - 0.25}{a - 0.25})$ ^[4]
return unlearning_strength $\times$ control_retention^[5]
```
An interesting thing about MMLU vs a textbook is that if you require the method to only use the dev+val test for unlearning, it has to somehow generalize to unlearning facts contained in the test set (c.f. a textbook might give you ~all the facts to unlearn). This generalization seems important to some safety cases where we want to unlearn everything in a category like “bioweapons knowledge” even if we don’t know some of the dangerous knowledge we’re trying to remove.
1. ^
  I say some because perhaps some MMLU categories are more procedural than factual or too broad to be clean measures of unlearning, or maybe 57 categories are too many for a single plot.
2. ^
  To detect underlying knowledge and not just the surface performance (e.g. a model trying to answer incorrectly when it knows the answer), you probably should evaluate MMLU by training a linear probe from the model’s activations to the correct test set answer and measure the accuracy of that probe.
3. ^
  We want this score to be 1 when the test score $x$ on the unlearning target is 0.25 (random chance), but drops off above and below 0.25, as that indicates the model knows something about the right answers. See MMLU Unlearning Target | Desmos for graphical intuition.
4. ^
  Similarly, we want the control test score $b$ on the post-unlearning model to be the same as the score $a$ on the original model. I think this should drop off to 0 at $b = 0.25$ (random chance) and probably stay 0 below that, but semi-unsure. See MMLU Unlearning Control | Desmos (you can drag the $b$ slider).
5. ^
  Maybe mean/sum instead of multiplication, though by multiplying we make it more important to to score well on both unlearning strength and control retention.

Gabe M Dec 9, 2023, 4:11 AM
3 points
2
in reply to: scasper’s comment on: Deep Forgetting & Unlearning for Safely-Scoped LLMs
Thanks for your response. I agree we don’t want unintentional learning of other desired knowledge, and benchmarks ought to measure this. Maybe the default way is just to run many downstream benchmarks, much more than just AP tests, and require that valid unlearning methods bound the change in each unrelated benchmark by less than X% (e.g. 0.1%).

practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech.

True in the sense of being a subset of biotech, but I imagine that, for most cases, the actual harmful stuff we want to remove is not all of biotech/chemical engineering/cybersecurity but rather small subsets of certain categories at finer granularities, like bioweapons/chemical weapons/advanced offensive cyber capabilities. That’s to say I’m somewhat optimistic that the level of granularity we want is self-contained enough to not affect others useful and genuinely good capabilities. This depends on how dual-use you think general knowledge is, though, and if it’s actually possible to separate dangerous knowledge from other useful knowledge.