Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
jacek
Karma:
191
All
Posts
Comments
New
Top
Old
Characterizing stable regions in the residual stream of LLMs
Jett Janiak
,
jacek
,
Chatrik
,
Giorgi Giglemiani
,
nlpet
and
StefanHex
26 Sep 2024 13:44 UTC
38
points
4
comments
1
min read
LW
link
(arxiv.org)
Goodhart’s Law in Reinforcement Learning
jacek
,
Joar Skalse
,
OliverHayman
,
charlie_griffin
and
Xingjian Bai
16 Oct 2023 0:54 UTC
126
points
22
comments
7
min read
LW
link
A warm-up for the AI governance project
jacek
17 Feb 2023 18:06 UTC
10
points
2
comments
3
min read
LW
link
Categorical-measure-theoretic approach to optimal policies tending to seek power
jacek
12 Jan 2023 0:32 UTC
31
points
3
comments
6
min read
LW
link
Back to top