RSS

ojorgensen

Karma: 198

AI Safety Researcher, my website is here.

Un­der­stand­ing Coun­ter­bal­anced Sub­trac­tions for Bet­ter Ac­ti­va­tion Additions

ojorgensenAug 17, 2023, 1:53 PM
21 points
0 comments14 min readLW link

Be­cause of Lay­erNorm, Direc­tions in GPT-2 MLP Lay­ers are Monosemantic

ojorgensenJul 28, 2023, 7:43 PM
13 points
3 comments13 min readLW link

UK Foun­da­tion Model Task Force—Ex­pres­sion of Interest

ojorgensenJun 18, 2023, 9:43 AM
64 points
2 comments1 min readLW link
(twitter.com)

ojor­gensen’s Shortform

ojorgensenMay 4, 2023, 1:51 PM
2 points
1 commentLW link

(Ex­tremely) Naive Gra­di­ent Hack­ing Doesn’t Work

ojorgensenDec 20, 2022, 2:35 PM
17 points
0 comments6 min readLW link

[Question] Which Is­sues in Con­cep­tual Align­ment have been For­mal­ised or Ob­served (or not)?

ojorgensenNov 1, 2022, 10:32 PM
4 points
0 comments1 min readLW link

Strange Loops—Self-Refer­ence from Num­ber The­ory to AI

ojorgensenSep 28, 2022, 2:10 PM
19 points
6 comments18 min readLW link

Eval­u­at­ing OpenAI’s al­ign­ment plans us­ing train­ing stories

ojorgensenAug 25, 2022, 4:12 PM
4 points
0 comments5 min readLW link

Disagree­ments about Align­ment: Why, and how, we should try to solve them

ojorgensenAug 9, 2022, 6:49 PM
11 points
2 comments16 min readLW link