RSS

Align­ment Jam

TagLast edit: 16 May 2023 8:25 UTC by Esben Kran

This lists the posts that have come from the Alignment Jam hackathons.

Com­pu­ta­tional Me­chan­ics Hackathon (June 1 & 2)

Adam Shai24 May 2024 22:18 UTC
34 points
5 comments1 min readLW link

Towards AI Safety In­fras­truc­ture: Talk & Outline

Paul Bricman7 Jan 2024 9:31 UTC
11 points
0 comments2 min readLW link
(www.youtube.com)

Find­ing De­cep­tion in Lan­guage Models

20 Aug 2024 9:42 UTC
18 points
4 comments4 min readLW link

De­mon­strate and eval­u­ate risks from AI to so­ciety at the AI x Democ­racy re­search hackathon

Esben Kran19 Apr 2024 14:46 UTC
5 points
0 comments1 min readLW link
(www.apartresearch.com)

Re­sults from the AI test­ing hackathon

Esben Kran2 Jan 2023 15:46 UTC
13 points
0 comments1 min readLW link

Su­per­po­si­tion and Dropout

Edoardo Pona16 May 2023 7:24 UTC
21 points
5 comments6 min readLW link

Solv­ing the Mechanis­tic In­ter­pretabil­ity challenges: EIS VII Challenge 1

9 May 2023 19:41 UTC
119 points
1 comment10 min readLW link

Re­sults from the in­ter­pretabil­ity hackathon

17 Nov 2022 14:51 UTC
81 points
0 comments6 min readLW link
(alignmentjam.com)

Iden­ti­fy­ing se­man­tic neu­rons, mechanis­tic cir­cuits & in­ter­pretabil­ity web apps

13 Apr 2023 11:59 UTC
18 points
0 comments8 min readLW link

Solv­ing the Mechanis­tic In­ter­pretabil­ity challenges: EIS VII Challenge 2

25 May 2023 15:37 UTC
71 points
1 comment13 min readLW link

We Found An Neu­ron in GPT-2

11 Feb 2023 18:27 UTC
143 points
23 comments7 min readLW link
(clementneo.com)

How-to Trans­former Mechanis­tic In­ter­pretabil­ity—in 50 lines of code or less!

StefanHex24 Jan 2023 18:45 UTC
47 points
5 comments13 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

15 Jul 2023 19:12 UTC
47 points
5 comments9 min readLW link