RSS

Nina Panickssery

Karma: 1,595

https://​​ninapanickssery.com/​​

https://​​ninapanickssery.substack.com/​​

In­ves­ti­gat­ing the Abil­ity of LLMs to Rec­og­nize Their Own Writing

30 Jul 2024 15:41 UTC
32 points
0 comments15 min readLW link

Jailbreak steer­ing generalization

20 Jun 2024 17:25 UTC
41 points
4 comments2 min readLW link
(arxiv.org)

Soviet com­edy film recommendations

Nina Panickssery9 Jun 2024 23:40 UTC
42 points
11 comments2 min readLW link
(open.substack.com)

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
123 points
29 comments8 min readLW link
(arxiv.org)

Com­par­ing rep­re­sen­ta­tion vec­tors be­tween llama 2 base and chat

Nina Panickssery28 Oct 2023 22:54 UTC
36 points
5 comments2 min readLW link

In­ves­ti­gat­ing the learn­ing co­effi­cient of mod­u­lar ad­di­tion: hackathon project

17 Oct 2023 19:51 UTC
94 points
5 comments12 min readLW link

In­fluence func­tions—why, what and how

Nina Panickssery15 Sep 2023 20:42 UTC
70 points
6 comments8 min readLW link

Red-team­ing lan­guage mod­els via ac­ti­va­tion engineering

Nina Panickssery26 Aug 2023 5:52 UTC
69 points
6 comments9 min readLW link

The Low-Hang­ing Fruit Prior and sloped valleys in the loss landscape

23 Aug 2023 21:12 UTC
82 points
1 comment13 min readLW link

Un­der­stand­ing and vi­su­al­iz­ing syco­phancy datasets

Nina Panickssery16 Aug 2023 5:34 UTC
45 points
0 comments6 min readLW link

De­com­pos­ing in­de­pen­dent gen­er­al­iza­tions in neu­ral net­works via Hes­sian analysis

14 Aug 2023 17:04 UTC
83 points
4 comments1 min readLW link

Recipe: Hes­sian eigen­vec­tor com­pu­ta­tion for PyTorch models

Nina Panickssery14 Aug 2023 2:48 UTC
31 points
5 comments5 min readLW link

Mo­du­lat­ing syco­phancy in an RLHF model via ac­ti­va­tion steering

Nina Panickssery9 Aug 2023 7:06 UTC
69 points
20 comments12 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Panickssery28 Jul 2023 2:46 UTC
122 points
17 comments9 min readLW link

De­cod­ing in­ter­me­di­ate ac­ti­va­tions in llama-2-7b

Nina Panickssery21 Jul 2023 5:35 UTC
37 points
3 comments4 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with llama-7b

Nina Panickssery16 Jul 2023 4:17 UTC
51 points
1 comment3 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with FLAN-T5

Nina Panickssery13 Jul 2023 23:32 UTC
21 points
5 comments7 min readLW link