RSS

Beth Barnes

Karma: 3,027

Alignment researcher. Views are my own and not those of my employer. https://​​www.barnes.page/​​

Clar­ify­ing METR’s Au­dit­ing Role

Beth BarnesMay 30, 2024, 6:41 PM
108 points
1 comment2 min readLW link

In­tro­duc­ing METR’s Au­ton­omy Eval­u­a­tion Resources

Mar 15, 2024, 11:16 PM
90 points
0 comments1 min readLW link
(metr.github.io)

METR is hiring!

Beth BarnesDec 26, 2023, 9:00 PM
65 points
1 comment1 min readLW link

Bounty: Di­verse hard tasks for LLM agents

Dec 17, 2023, 1:04 AM
49 points
31 comments16 min readLW link

Send us ex­am­ple gnarly bugs

Dec 10, 2023, 5:23 AM
77 points
10 comments2 min readLW link

Manag­ing risks of our own work

Beth BarnesAug 18, 2023, 12:41 AM
66 points
0 comments2 min readLW link

ARC Evals new re­port: Eval­u­at­ing Lan­guage-Model Agents on Real­is­tic Au­tonomous Tasks

Beth BarnesAug 1, 2023, 6:30 PM
153 points
12 comments5 min readLW link
(evals.alignment.org)

More in­for­ma­tion about the dan­ger­ous ca­pa­bil­ity eval­u­a­tions we did with GPT-4 and Claude.

Beth BarnesMar 19, 2023, 12:25 AM
233 points
54 comments8 min readLW link
(evals.alignment.org)

Reflec­tion Mechanisms as an Align­ment Tar­get—At­ti­tudes on “near-term” AI

Mar 2, 2023, 4:29 AM
21 points
0 comments8 min readLW link

‘simu­la­tor’ fram­ing and con­fu­sions about LLMs

Beth BarnesDec 31, 2022, 11:38 PM
104 points
11 comments4 min readLW link

Reflec­tion Mechanisms as an Align­ment tar­get: A fol­low-up survey

Oct 5, 2022, 2:03 PM
15 points
2 comments7 min readLW link

Eval­u­a­tions pro­ject @ ARC is hiring a re­searcher and a web­dev/​engineer

Beth BarnesSep 9, 2022, 10:46 PM
99 points
7 comments10 min readLW link

Help ARC eval­u­ate ca­pa­bil­ities of cur­rent lan­guage mod­els (still need peo­ple)

Beth BarnesJul 19, 2022, 4:55 AM
95 points
6 comments2 min readLW link

Reflec­tion Mechanisms as an Align­ment tar­get: A survey

Jun 22, 2022, 3:05 PM
32 points
1 comment14 min readLW link

Another list of the­o­ries of im­pact for interpretability

Beth BarnesApr 13, 2022, 1:29 PM
33 points
1 comment5 min readLW link

Re­v­erse-en­g­ineer­ing us­ing interpretability

Beth BarnesDec 29, 2021, 11:21 PM
21 points
2 comments5 min readLW link

Risks from AI persuasion

Beth BarnesDec 24, 2021, 1:48 AM
76 points
15 comments31 min readLW link

Some thoughts on why ad­ver­sar­ial train­ing might be useful

Beth BarnesDec 8, 2021, 1:28 AM
9 points
6 comments3 min readLW link

Con­sid­er­a­tions on in­ter­ac­tion be­tween AI and ex­pected value of the fu­ture

Beth BarnesDec 7, 2021, 2:46 AM
68 points
28 comments4 min readLW link

More de­tailed pro­posal for mea­sur­ing al­ign­ment of cur­rent models

Beth BarnesNov 20, 2021, 12:03 AM
31 points
0 comments8 min readLW link