Fabien Roger

Karma: 4,440

How to replicate and extend our alignment faking demo

Fabien Roger19 Dec 2024 21:44 UTC

104 points

5 comments2 min readLW link

(alignment.anthropic.com)

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

18 Dec 2024 17:19 UTC

479 points

71 comments10 min readLW link

A toy evaluation of inference code tampering

Fabien Roger9 Dec 2024 17:43 UTC

50 points

0 comments9 min readLW link

(alignment.anthropic.com)

The case for unlearning that removes information from LLM weights

Fabien Roger14 Oct 2024 14:08 UTC

96 points

15 comments6 min readLW link

[Question] Is cybercrime really costing trillions per year?

Fabien Roger27 Sep 2024 8:44 UTC

63 points

28 comments1 min readLW link

An issue with training schemers with supervised fine-tuning

Fabien Roger27 Jun 2024 15:37 UTC

49 points

12 comments6 min readLW link

Best-of-n with misaligned reward models for Math reasoning

Fabien Roger21 Jun 2024 22:53 UTC

25 points

0 comments3 min readLW link

Memorizing weak examples can elicit strong behavior out of password-locked models

Fabien Roger and ryan_greenblatt

6 Jun 2024 23:54 UTC

58 points

5 comments7 min readLW link

[Paper] Stress-testing capability elicitation with password-locked models

Fabien Roger and ryan_greenblatt

4 Jun 2024 14:52 UTC

85 points

10 comments12 min readLW link

(arxiv.org)

Open consultancy: Letting untrusted AIs choose what answer to argue for

Fabien Roger12 Mar 2024 20:38 UTC

35 points

5 comments5 min readLW link

Fabien’s Shortform

Fabien Roger5 Mar 2024 18:58 UTC

6 points

76 comments1 min readLW link

Notes on control evaluations for safety cases

ryan_greenblatt, Buck and Fabien Roger

28 Feb 2024 16:15 UTC

49 points

0 comments32 min readLW link

Protocol evaluations: good analogies vs control

Fabien Roger19 Feb 2024 18:00 UTC

42 points

10 comments11 min readLW link

Toy models of AI control for concentrated catastrophe prevention

Fabien Roger and Buck

6 Feb 2024 1:38 UTC

51 points

2 comments7 min readLW link

A quick investigation of AI pro-AI bias

Fabien Roger19 Jan 2024 23:26 UTC

55 points

1 comment2 min readLW link

Measurement tampering detection as a special case of weak-to-strong generalization

ryan_greenblatt, Fabien Roger and Buck

23 Dec 2023 0:05 UTC

57 points

10 comments4 min readLW link

Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem

Ansh Radhakrishnan, Buck, ryan_greenblatt and Fabien Roger

16 Dec 2023 5:49 UTC

76 points

4 comments6 min readLW link 1 review

AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt and Kshitij Sachan

13 Dec 2023 15:51 UTC

235 points

24 comments10 min readLW link 4 reviews

Auditing failures vs concentrated failures

ryan_greenblatt and Fabien Roger

11 Dec 2023 2:47 UTC

44 points

1 comment7 min readLW link 1 review

Some negative steganography results

Fabien Roger9 Dec 2023 20:22 UTC

59 points

5 comments2 min readLW link