Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
nlpet
Karma:
75
All
Posts
Comments
New
Top
Old
Latent Adversarial Training (LAT) Improves the Representation of Refusal
alexandraabbas
,
nlpet
and
hal2k
6 Jan 2025 10:24 UTC
20
points
6
comments
10
min read
LW
link
Characterizing stable regions in the residual stream of LLMs
Jett Janiak
,
jacek
,
Chatrik
,
Giorgi Giglemiani
,
nlpet
and
StefanHex
26 Sep 2024 13:44 UTC
42
points
4
comments
1
min read
LW
link
(arxiv.org)
Evaluating Synthetic Activations composed of SAE Latents in GPT-2
Giorgi Giglemiani
,
nlpet
,
Chatrik
,
Jett Janiak
and
StefanHex
25 Sep 2024 20:37 UTC
29
points
0
comments
3
min read
LW
link
(arxiv.org)
Back to top