RSS

Jan Betley

Karma: 847

OpenAI Re­sponses API changes mod­els’ behavior

Apr 11, 2025, 1:27 PM
52 points
6 comments2 min readLW link

[Question] Are there any (semi-)de­tailed fu­ture sce­nar­ios where we win?

Jan BetleyApr 7, 2025, 7:13 PM
15 points
3 comments1 min readLW link

Jan Betley’s Shortform

Jan BetleyMar 31, 2025, 2:02 PM
5 points
8 commentsLW link

Find­ing Emer­gent Misalignment

Jan BetleyMar 26, 2025, 5:33 PM
19 points
0 comments3 min readLW link

Open prob­lems in emer­gent misalignment

Mar 1, 2025, 9:47 AM
80 points
13 comments7 min readLW link

Emer­gent Misal­ign­ment: Nar­row fine­tun­ing can pro­duce broadly mis­al­igned LLMs

Feb 25, 2025, 5:39 PM
328 points
91 comments4 min readLW link

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

Jul 8, 2024, 10:24 PM
109 points
37 comments5 min readLW link

Self-shut­down AI

Jan BetleyAug 21, 2023, 4:48 PM
13 points
2 comments2 min readLW link

Lo­cal­iz­ing goal mis­gen­er­al­iza­tion in a maze-solv­ing policy network

Jan BetleyJul 6, 2023, 4:21 PM
37 points
2 comments7 min readLW link

[Question] Re­v­erse en­g­ineer­ing of the simulation

Jan BetleyFeb 7, 2022, 9:36 PM
1 point
2 comments1 min readLW link

[Question] What do we *re­ally* ex­pect from a well-al­igned AI?

Jan BetleyJan 4, 2021, 8:57 PM
13 points
10 comments1 min readLW link