RSS

Jozdien

Karma: 1,611

BIG-Bench Ca­nary Con­tam­i­na­tion in GPT-4

Jozdien22 Oct 2024 15:40 UTC
123 points
13 comments4 min readLW link

Gra­di­ent Des­cent on the Hu­man Brain

1 Apr 2024 22:39 UTC
52 points
5 comments2 min readLW link

Difficulty classes for al­ign­ment properties

Jozdien20 Feb 2024 9:08 UTC
34 points
5 comments2 min readLW link

The Poin­ter Re­s­olu­tion Problem

Jozdien16 Feb 2024 21:25 UTC
41 points
20 comments3 min readLW link

Cri­tiques of the AI con­trol agenda

Jozdien14 Feb 2024 19:25 UTC
47 points
14 comments9 min readLW link

The case for more am­bi­tious lan­guage model evals

Jozdien30 Jan 2024 0:01 UTC
110 points
30 comments5 min readLW link

Thoughts On (Solv­ing) Deep Deception

Jozdien21 Oct 2023 22:40 UTC
69 points
2 comments6 min readLW link

High-level in­ter­pretabil­ity: de­tect­ing an AI’s objectives

28 Sep 2023 19:30 UTC
69 points
4 comments21 min readLW link

The Com­pleat Cybornaut

19 May 2023 8:44 UTC
64 points
2 comments16 min readLW link

AI Safety via Luck

Jozdien1 Apr 2023 20:13 UTC
81 points
7 comments11 min readLW link

Gra­di­ent Filtering

18 Jan 2023 20:09 UTC
54 points
16 comments13 min readLW link

[ASoT] Si­mu­la­tors show us be­havi­oural prop­er­ties by default

Jozdien13 Jan 2023 18:42 UTC
35 points
3 comments3 min readLW link

Try­ing to iso­late ob­jec­tives: ap­proaches to­ward high-level interpretability

Jozdien9 Jan 2023 18:33 UTC
48 points
14 comments8 min readLW link

[ASoT] Fine­tun­ing, RL, and GPT’s world prior

Jozdien2 Dec 2022 16:33 UTC
44 points
8 comments5 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
59 points
8 comments20 min readLW link

Gam­ing Incentives

Jozdien29 Jul 2021 13:51 UTC
10 points
4 comments6 min readLW link

In­suffi­cient Values

16 Jun 2021 14:33 UTC
31 points
16 comments6 min readLW link

Utopic Nightmares

Jozdien14 May 2021 21:24 UTC
10 points
20 comments5 min readLW link