RSS

In­struc­tion-fol­low­ing AGI is eas­ier and more likely than value al­igned AGI

Seth Herd15 May 2024 19:38 UTC
33 points
14 comments12 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

30 Apr 2024 17:58 UTC
59 points
14 comments17 min readLW link

Towards Guaran­teed Safe AI: A Frame­work for En­sur­ing Ro­bust and Reli­able AI Systems

Joar Skalse17 May 2024 19:13 UTC
37 points
1 comment2 min readLW link

Iden­ti­fy­ing Func­tion­ally Im­por­tant Fea­tures with End-to-End Sparse Dic­tionary Learning

17 May 2024 16:25 UTC
38 points
1 comment4 min readLW link
(publications.apolloresearch.ai)

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

16 Jan 2024 0:26 UTC
82 points
9 comments18 min readLW link

Davi­dad’s Bold Plan for Align­ment: An In-Depth Explanation

19 Apr 2023 16:09 UTC
154 points
31 comments21 min readLW link

Trans­form­ers Rep­re­sent Belief State Geom­e­try in their Resi­d­ual Stream

Adam Shai16 Apr 2024 21:16 UTC
378 points
90 comments12 min readLW link

Re­veal­ing In­ten­tion­al­ity In Lan­guage Models Through AdaVAE Guided Sampling

jdp20 Oct 2023 7:32 UTC
119 points
15 comments22 min readLW link

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
119 points
17 comments1 min readLW link
(www.anthropic.com)

Lin­ear in­fra-Bayesian Bandits

Vanessa Kosoy10 May 2024 6:41 UTC
38 points
5 comments1 min readLW link
(arxiv.org)

Towards a for­mal­iza­tion of the agent struc­ture problem

Alex_Altair29 Apr 2024 20:28 UTC
52 points
4 comments14 min readLW link

The “no sand­bag­ging on check­able tasks” hypothesis

Joe Carlsmith31 Jul 2023 23:06 UTC
51 points
13 comments9 min readLW link

AI Safety Strate­gies Landscape

Charbel-Raphaël9 May 2024 17:33 UTC
29 points
1 comment42 min readLW link

There are no co­her­ence theorems

20 Feb 2023 21:25 UTC
121 points
115 comments19 min readLW link

An In­tro­duc­tion to AI Sandbagging

26 Apr 2024 13:40 UTC
41 points
5 comments8 min readLW link

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC
91 points
10 comments2 min readLW link

Towards Devel­op­men­tal Interpretability

12 Jul 2023 19:33 UTC
173 points
9 comments9 min readLW link

AISC9 has ended and there will be an AISC10

Linda Linsefors29 Apr 2024 10:53 UTC
62 points
4 comments2 min readLW link

Com­pe­ti­tion: Am­plify Ro­hin’s Pre­dic­tion on AGI re­searchers & Safety Concerns

stuhlmueller21 Jul 2020 20:06 UTC
82 points
41 comments3 min readLW link

Fix­ing The Good Reg­u­la­tor Theorem

johnswentworth9 Feb 2021 20:30 UTC
136 points
38 comments8 min readLW link1 review