RSS

Align­ment Re­search Cen­ter (ARC)

TagLast edit: Dec 30, 2024, 9:23 AM by Dakara

Alignment Research Centre (ARC) is a non-profit research organization whose mission is to align future machine learning systems with human interests. Its current work focuses on developing an alignment strategy that could be adopted in industry today while scaling gracefully to future ML systems. Right now Paul Christiano, Mark Xu, and Jacob Hilton are researchers and Kyle Scott handles operations.

ARC Evals new re­port: Eval­u­at­ing Lan­guage-Model Agents on Real­is­tic Au­tonomous Tasks

Beth BarnesAug 1, 2023, 6:30 PM
153 points
12 comments5 min readLW link
(evals.alignment.org)

Ob­sta­cles in ARC’s agenda: Find­ing explanations

David MatolcsiApr 30, 2025, 11:03 PM
105 points
8 comments17 min readLW link

Ob­sta­cles in ARC’s agenda: Mechanis­tic Ano­maly Detection

David MatolcsiMay 1, 2025, 8:51 PM
39 points
1 comment11 min readLW link

Ob­sta­cles in ARC’s agenda: Low Prob­a­bil­ity Estimation

David MatolcsiMay 2, 2025, 7:38 PM
39 points
0 comments6 min readLW link

Es­ti­mat­ing Tail Risk in Neu­ral Networks

Mark XuSep 13, 2024, 8:00 PM
68 points
9 comments23 min readLW link
(www.alignment.org)

Steel­man­ning heuris­tic arguments

Dmitry VaintrobApr 13, 2025, 1:09 AM
72 points
0 comments17 min readLW link

Low Prob­a­bil­ity Es­ti­ma­tion in Lan­guage Models

Gabriel WuOct 18, 2024, 3:50 PM
50 points
0 comments10 min readLW link
(www.alignment.org)

A bird’s eye view of ARC’s research

Jacob_HiltonOct 23, 2024, 3:50 PM
121 points
12 comments7 min readLW link
(www.alignment.org)

Paul Chris­ti­ano on Dwarkesh Podcast

ESRogsNov 3, 2023, 10:13 PM
19 points
0 comments1 min readLW link
(www.dwarkeshpatel.com)

Prizes for ma­trix com­ple­tion problems

paulfchristianoMay 3, 2023, 11:30 PM
164 points
52 comments1 min readLW link
(www.alignment.org)

More in­for­ma­tion about the dan­ger­ous ca­pa­bil­ity eval­u­a­tions we did with GPT-4 and Claude.

Beth BarnesMar 19, 2023, 12:25 AM
233 points
54 comments8 min readLW link
(evals.alignment.org)

ARC’s first tech­ni­cal re­port: Elic­it­ing La­tent Knowledge

Dec 14, 2021, 8:09 PM
228 points
90 comments1 min readLW link3 reviews
(docs.google.com)

ARC is hiring the­o­ret­i­cal researchers

Jun 12, 2023, 6:50 PM
126 points
12 comments4 min readLW link
(www.alignment.org)

AXRP Epi­sode 23 - Mechanis­tic Ano­maly De­tec­tion with Mark Xu

DanielFilanJul 27, 2023, 1:50 AM
22 points
0 comments72 min readLW link

ARC pa­per: For­mal­iz­ing the pre­sump­tion of independence

Erik JennerNov 20, 2022, 1:22 AM
97 points
2 comments2 min readLW link
(arxiv.org)

[Question] How is ARC plan­ning to use ELK?

jacquesthibsDec 15, 2022, 8:11 PM
24 points
5 comments1 min readLW link

ARC tests to see if GPT-4 can es­cape hu­man con­trol; GPT-4 failed to do so

Christopher KingMar 15, 2023, 12:29 AM
116 points
22 comments2 min readLW link

Coun­terex­am­ples to some ELK proposals

paulfchristianoDec 31, 2021, 5:05 PM
53 points
10 comments7 min readLW link

Ex­per­i­men­tally eval­u­at­ing whether hon­esty generalizes

paulfchristianoJul 1, 2021, 5:47 PM
103 points
24 comments9 min readLW link1 review

[Question] Why is there an al­ign­ment prob­lem?

InfiniteLightDec 22, 2023, 6:19 AM
1 point
0 comments1 min readLW link

Eval­u­a­tions pro­ject @ ARC is hiring a re­searcher and a web­dev/​engineer

Beth BarnesSep 9, 2022, 10:46 PM
99 points
7 comments10 min readLW link

ELK prize results

Mar 9, 2022, 12:01 AM
138 points
50 comments21 min readLW link

The Align­ment Problems

Martín SotoJan 12, 2023, 10:29 PM
20 points
0 comments4 min readLW link

The Goal Mis­gen­er­al­iza­tion Problem

MyspyMay 18, 2023, 11:40 PM
1 point
0 comments1 min readLW link
(drive.google.com)

Chris­ti­ano (ARC) and GA (Con­jec­ture) Dis­cuss Align­ment Cruxes

Feb 24, 2023, 11:03 PM
61 points
7 comments47 min readLW link

Con­crete Meth­ods for Heuris­tic Es­ti­ma­tion on Neu­ral Networks

Oliver DanielsNov 14, 2024, 5:07 AM
28 points
0 comments27 min readLW link

ARC is hiring!

Dec 14, 2021, 8:09 PM
64 points
2 comments1 min readLW link
No comments.