scasper

Karma: 2,003

https://stephencasper.com/

scasper Jan 26, 2024, 12:31 AM
LW: 6 AF: 5
1
AF
on: The case for ensuring that powerful AIs are controlled
I have been thinking a lot lately about evals and what differences black- vs. white-box access makes for evaluators. I was confused about the appendix on black-box evals. My best attempt to make sense of it hinges on two instances of the authors intending something different than what was written.
First, the appendix hinges somewhat on this point.
For now, we think the easiest way to build control techniques that avoid making incorrect assumptions about scheming models is to make conservative assumptions about their internals, by rejecting any strategy which can be evaded by a red-team-constructed model.
I agree with this, but saying that “control evals should be black-box” is a non sequitur. This is making the point that bad reverse-engineering of the model is bad and seems to assume that good reverse-engineering is hard/unlikely implicitly.
Second, even if good reverse-engineering is hard/unlikely, reverse-engineering is not the definition of a white-box eval. There are a lot of things that you can do with white-box access for evals other than interpretability-based things: gradient-based attacks, gradient-guided attacks, latent-space attacks, and finetuning—and all of these things actually seem plausibly useful for control evals. Plus I don’t see why interp should be off limits for evaluators who know the limitations of the tools they are using.
Ultimately, white-box attacks offer strictly more options for any type of eval compared to black-box ones. For any property that an AI system might have, if there are two teams of competent people trying to verify if the system has that property, all else equal, I’d bet on the success of the team that has more access.
So if I understand things correctly, and if the intended point of this appendix is that “Interpretability-based evals for control seem unreliable and possibly could lead to pitfalls,” I would recommend just saying that more directly.

scasper Dec 9, 2023, 3:56 AM
LW: 2 AF: 1
0
AF
in reply to: Gabe M’s comment on: Deep Forgetting & Unlearning for Safely-Scoped LLMs
Thanks!

I intuit that what you mentioned as a feature might also be a bug. I think that practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech. And if so, then we would want benchmarks that measure a method’s ability to forget/unlearn just the things key to that domain and nothing else. For example, if a method succeeds in unlearning biotech but makes the target LM also unlearn math and physics, then we should be concerned about that, and we probably want benchmarks to help us quantify that.

I could imagine an unlearning benchmark, for example, with $n$ textbooks and $n$ ap tests. Then for each of $k$ different knowledge-recovery strategies, one could construct the $n \times n$ grid of how well the model performs on each target test for each unlearning textbook.

scasper Dec 6, 2023, 4:20 AM
LW: 7 AF: 2
0
AF
in reply to: Thomas Kwa’s comment on: Deep Forgetting & Unlearning for Safely-Scoped LLMs
+1, I’ll add this and credit you.

scasper Dec 5, 2023, 9:09 PM
7 points
5
in reply to: aog’s comment on: Deep Forgetting & Unlearning for Safely-Scoped LLMs
+1
Although the NeurIPS challenge and prior ML lit on forgetting and influence functions seem worth keeping on the radar because they’re still closely-related to challenges here.

scasper Dec 5, 2023, 8:10 PM
3 points
0
in reply to: Bogdan Ionut Cirstea’s comment on: Deep Forgetting & Unlearning for Safely-Scoped LLMs
Thanks! I edited the post to add a link to this.

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasperDec 5, 2023, 4:48 PM

125 points

30 comments13 min readLW link

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush and scasper

Nov 7, 2023, 5:59 PM

38 points

2 comments2 min readLW link

(arxiv.org)

scasper Nov 4, 2023, 11:59 PM
5 points
0
in reply to: lemonhope’s comment on: The 6D effect: When companies take risks, one email can be very powerful.
A good critical paper about potentially risky industry norms is this one.

The 6D effect: When companies take risks, one email can be very powerful.

scasperNov 4, 2023, 8:08 PM

277 points

42 comments3 min readLW link

Announcing the CNN Interpretability Competition

scasperSep 26, 2023, 4:21 PM

22 points

0 comments4 min readLW link

scasper Sep 21, 2023, 6:37 AM
6 points
0
in reply to: LawrenceC’s comment on: Interpretability Externalities Case Study—Hungry Hungry Hippos
Thanks
To use your argument, what does MI actually do here?
The inspiration, I would suppose. Analogous to the type claimed in the HHH and hyena papers.
And yes to your second point.

scasper Sep 20, 2023, 4:23 PM
13 points
6
on: Interpretability Externalities Case Study—Hungry Hungry Hippos
Nice post. I think it can serve as a good example about how the hand waviness of how interpretability can help us do good things with AI goes both ways.
I’m particularly worried about MI people studying instances of when LLMs do and don’t express types of situational awareness and then someone using these insights to give LLMs much stronger situational awareness abilities.
Lastly,
On the other hand, interpretability research is probably crucial for AI alignment.
I don’t think this is true and I especially hope it is not true because (1) mechanistic interpretability still fails to do impressive things by trying to reverse engineer networks and (2) it is entirely fungible from a safety standpoint with other techniques that often do better for various things.

scasper Aug 29, 2023, 9:50 PM
LW: 2 AF: 2
1
AF
on: Barriers to Mechanistic Interpretability for AGI Safety
Several people seem to be coming to similar conclusions recently (e.g., this recent post).
I’ll add that I have as well and wrote a sequence about it :)

scasper Aug 18, 2023, 4:37 PM
5 points
3
in reply to: Richard_Ngo’s comment on: Against Almost Every Theory of Impact of Interpretability
Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones.
Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don’t think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.

scasper Aug 18, 2023, 4:14 PM
27 points
26
in reply to: Richard_Ngo’s comment on: Against Almost Every Theory of Impact of Interpretability
I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make.

scasper Aug 16, 2023, 10:55 PM
LW: 3 AF: 3
0
AF
on: Mech Interp Challenge: August—Deciphering the First Unique Character Model
I think this is very exciting, and I’ll look forward to seeing how it goes!

scasper Aug 1, 2023, 5:00 AM
LW: 2 AF: 1
1
AF
in reply to: Khanh Nguyen’s comment on: Open Problems and Fundamental Limitations of RLHF
Thanks, we will consider adding each of these. We appreciate that you took a look and took the time to help suggest these!

scasper Aug 1, 2023, 3:16 AM
LW: 4 AF: 3
2
AF
in reply to: DanielFilan’s comment on: Open Problems and Fundamental Limitations of RLHF
No, I don’t think the core advantages of transparency are really unique to RLHF, but in the paper, we list certain things that are specific to RLHF which we think should be disclosed. Thanks.

scasper Jul 31, 2023, 9:41 PM
6 points
6
in reply to: Stephen McAleese’s comment on: Open Problems and Fundamental Limitations of RLHF
Thanks, and +1 to adding the resources. Also Charbel-Raphael who authored the in-depth post is one of the authors of this paper! That post in particular was something we paid attention to during the design of the paper.

Open Problems and Fundamental Limitations of RLHF

scasperJul 31, 2023, 3:31 PM

66 points

6 comments2 min readLW link

(arxiv.org)

scasper

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

The 6D effect: When com­pa­nies take risks, one email can be very pow­er­ful.

An­nounc­ing the CNN In­ter­pretabil­ity Competition

Open Prob­lems and Fun­da­men­tal Limi­ta­tions of RLHF

Deep Forgetting & Unlearning for Safely-Scoped LLMs

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

The 6D effect: When companies take risks, one email can be very powerful.

Announcing the CNN Interpretability Competition

Open Problems and Fundamental Limitations of RLHF