Nina Panickssery

Karma: 1,595

https://ninapanickssery.com/

https://ninapanickssery.substack.com/

Investigating the Ability of LLMs to Recognize Their Own Writing

Christopher Ackerman and Nina Panickssery

30 Jul 2024 15:41 UTC

32 points

0 comments15 min readLW link

Jailbreak steering generalization

Sarah Ball and Nina Panickssery

20 Jun 2024 17:25 UTC

41 points

4 comments2 min readLW link

(arxiv.org)

Soviet comedy film recommendations

Nina Panickssery9 Jun 2024 23:40 UTC

42 points

11 comments2 min readLW link

(open.substack.com)

Steering Llama-2 with contrastive activation additions

Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub and TurnTrout

2 Jan 2024 0:47 UTC

123 points

29 comments8 min readLW link

(arxiv.org)

Comparing representation vectors between llama 2 base and chat

Nina Panickssery28 Oct 2023 22:54 UTC

36 points

5 comments2 min readLW link

Investigating the learning coefficient of modular addition: hackathon project

Nina Panickssery and Dmitry Vaintrob

17 Oct 2023 19:51 UTC

94 points

5 comments12 min readLW link

Influence functions—why, what and how

Nina Panickssery15 Sep 2023 20:42 UTC

70 points

6 comments8 min readLW link

Red-teaming language models via activation engineering

Nina Panickssery26 Aug 2023 5:52 UTC

69 points

6 comments9 min readLW link

The Low-Hanging Fruit Prior and sloped valleys in the loss landscape

Dmitry Vaintrob and Nina Panickssery

23 Aug 2023 21:12 UTC

82 points

1 comment13 min readLW link

Understanding and visualizing sycophancy datasets

Nina Panickssery16 Aug 2023 5:34 UTC

45 points

0 comments6 min readLW link

Decomposing independent generalizations in neural networks via Hessian analysis

Dmitry Vaintrob and Nina Panickssery

14 Aug 2023 17:04 UTC

83 points

4 comments1 min readLW link

Recipe: Hessian eigenvector computation for PyTorch models

Nina Panickssery14 Aug 2023 2:48 UTC

31 points

5 comments5 min readLW link

Modulating sycophancy in an RLHF model via activation steering

Nina Panickssery9 Aug 2023 7:06 UTC

69 points

20 comments12 min readLW link

Reducing sycophancy and improving honesty via activation steering

Nina Panickssery28 Jul 2023 2:46 UTC

122 points

17 comments9 min readLW link

Decoding intermediate activations in llama-2-7b

Nina Panickssery21 Jul 2023 5:35 UTC

37 points

3 comments4 min readLW link

Activation adding experiments with llama-7b

Nina Panickssery16 Jul 2023 4:17 UTC

51 points

1 comment3 min readLW link

Activation adding experiments with FLAN-T5

Nina Panickssery13 Jul 2023 23:32 UTC

21 points

5 comments7 min readLW link