RSS

AI Evaluations

TagLast edit: Aug 1, 2023, 1:03 AM by duck_master

AI Evaluations focus on experimentally assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based.

(note: initially written by GPT4, may contain errors despite a human review. Please correct them if you see them)

Behavioral evaluations assess a model’s abilities in various tasks, such as autonomously replicating, acquiring resources, and avoiding being shut down. However, a concern with these evaluations is that they may not be sufficient to detect deceptive alignment, making it difficult to ensure that models are non-deceptive.

Understanding-based evaluations, on the other hand, evaluate a developer’s ability to understand the model they have created and why they have obtained the model. This approach can be more useful in terms of safety, as it focuses on understanding the model’s behavior instead of just checking the behavior itself. Coupling understanding-based evaluations with behavioral evaluations can lead to a more comprehensive assessment of AI safety and alignment.

Current challenges in AI evaluations include:

(this text was initially written by GPT4, taking in as input A very crude deception eval is already passed, ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so, and Towards understanding-based safety evaluations)

See also:

How evals might (or might not) pre­vent catas­trophic risks from AI

Orpheus16Feb 7, 2023, 8:16 PM
45 points
0 comments9 min readLW link

The case for more am­bi­tious lan­guage model evals

JozdienJan 30, 2024, 12:01 AM
117 points
30 comments5 min readLW link

When can we trust model eval­u­a­tions?

evhubJul 28, 2023, 7:42 PM
165 points
10 comments10 min readLW link1 review

Gam­ing Truth­fulQA: Sim­ple Heuris­tics Ex­posed Dataset Weaknesses

TurnTroutJan 16, 2025, 2:14 AM
64 points
3 comments1 min readLW link
(turntrout.com)

Thoughts on shar­ing in­for­ma­tion about lan­guage model capabilities

paulfchristianoJul 31, 2023, 4:04 PM
210 points
44 comments11 min readLW link1 review

Towards un­der­stand­ing-based safety evaluations

evhubMar 15, 2023, 6:18 PM
164 points
16 comments5 min readLW link

An­nounc­ing Apollo Research

May 30, 2023, 4:17 PM
217 points
11 comments8 min readLW link

OMMC An­nounces RIP

Apr 1, 2024, 11:20 PM
189 points
5 comments2 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

Nov 7, 2023, 5:59 PM
38 points
2 comments2 min readLW link
(arxiv.org)

How good are LLMs at do­ing ML on an un­known dataset?

Håvard Tveit IhleJul 1, 2024, 9:04 AM
33 points
4 comments13 min readLW link

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

Aug 8, 2023, 1:30 AM
318 points
30 comments18 min readLW link1 review

Deep­Mind: Model eval­u­a­tion for ex­treme risks

Zach Stein-PerlmanMay 25, 2023, 3:00 AM
94 points
12 comments1 min readLW link1 review
(arxiv.org)

In­ves­ti­gat­ing the Abil­ity of LLMs to Rec­og­nize Their Own Writing

Jul 30, 2024, 3:41 PM
32 points
0 comments15 min readLW link

Can Gen­er­al­ized Ad­ver­sar­ial Test­ing En­able More Ri­gor­ous LLM Safety Evals?

scasperJul 30, 2024, 2:57 PM
25 points
0 comments4 min readLW link

Twit­ter thread on AI safety evals

Richard_NgoJul 31, 2024, 12:18 AM
63 points
3 comments2 min readLW link
(x.com)

GPT-4o Sys­tem Card

Zach Stein-PerlmanAug 8, 2024, 8:30 PM
68 points
11 comments2 min readLW link
(openai.com)

An is­sue with train­ing schemers with su­per­vised fine-tuning

Fabien RogerJun 27, 2024, 3:37 PM
49 points
12 comments6 min readLW link

≤10-year Timelines Re­main Un­likely De­spite Deep­Seek and o3

Rafael HarthFeb 13, 2025, 7:21 PM
52 points
52 comments15 min readLW link

Model evals for dan­ger­ous capabilities

Zach Stein-PerlmanSep 23, 2024, 11:00 AM
51 points
11 comments3 min readLW link

[Question] Can GPT-4 play 20 ques­tions against an­other in­stance of it­self?

Nathan Helm-BurgerMar 28, 2023, 1:11 AM
15 points
1 comment1 min readLW link
(evanthebouncy.medium.com)

A starter guide for evals

Jan 8, 2024, 6:24 PM
53 points
2 comments12 min readLW link
(www.apolloresearch.ai)

Re­spon­si­ble De­ploy­ment in 20XX

CarsonApr 20, 2023, 12:24 AM
4 points
0 comments4 min readLW link

Refram­ing the bur­den of proof: Com­pa­nies should prove that mod­els are safe (rather than ex­pect­ing au­di­tors to prove that mod­els are dan­ger­ous)

Orpheus16Apr 25, 2023, 6:49 PM
27 points
11 comments3 min readLW link
(childrenoficarus.substack.com)

OpenAI: Pre­pared­ness framework

Zach Stein-PerlmanDec 18, 2023, 6:30 PM
70 points
23 comments4 min readLW link
(openai.com)

Prevent­ing Lan­guage Models from hid­ing their reasoning

Oct 31, 2023, 2:34 PM
119 points
15 comments12 min readLW link1 review

New Ca­pa­bil­ities, New Risks? - Eval­u­at­ing Agen­tic Gen­eral As­sis­tants us­ing Ele­ments of GAIA & METR Frameworks

Tej LanderSep 29, 2024, 6:58 PM
5 points
0 comments29 min readLW link

Fron­tier Models are Ca­pable of In-con­text Scheming

Dec 5, 2024, 10:11 PM
203 points
24 comments7 min readLW link

Bi­as­ing VLM Re­sponse with Vi­sual Stimuli

Jaehyuk LimOct 3, 2024, 6:04 PM
5 points
0 comments8 min readLW link

An Opinionated Evals Read­ing List

Oct 15, 2024, 2:38 PM
65 points
0 comments13 min readLW link
(www.apolloresearch.ai)

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Apr 30, 2024, 6:51 PM
207 points
43 comments45 min readLW link

Soft Prompts for Eval­u­a­tion: Mea­sur­ing Con­di­tional Dis­tance of Capabilities

porbyFeb 2, 2024, 5:49 AM
47 points
1 comment4 min readLW link
(1drv.ms)

Bounty: Di­verse hard tasks for LLM agents

Dec 17, 2023, 1:04 AM
49 points
31 comments16 min readLW link

Apollo Re­search 1-year update

May 29, 2024, 5:44 PM
93 points
0 comments7 min readLW link

UK AISI: Early les­sons from eval­u­at­ing fron­tier AI systems

Zach Stein-PerlmanOct 25, 2024, 7:00 PM
26 points
0 comments2 min readLW link
(www.aisi.gov.uk)

An­nounc­ing Hu­man-al­igned AI Sum­mer School

May 22, 2024, 8:55 AM
50 points
0 comments1 min readLW link
(humanaligned.ai)

The Evals Gap

Marius HobbhahnNov 11, 2024, 4:42 PM
55 points
7 comments7 min readLW link
(www.apolloresearch.ai)

Which evals re­sources would be good?

Marius HobbhahnNov 16, 2024, 2:24 PM
51 points
4 comments5 min readLW link

Clar­ify­ing METR’s Au­dit­ing Role

Beth BarnesMay 30, 2024, 6:41 PM
108 points
1 comment2 min readLW link

Pro­to­col eval­u­a­tions: good analo­gies vs control

Fabien RogerFeb 19, 2024, 6:00 PM
42 points
10 comments11 min readLW link

Eval­u­at­ing strate­gic rea­son­ing in GPT models

phelps-sgMay 25, 2023, 11:51 AM
4 points
1 comment8 min readLW link

Self-Aware­ness: Tax­on­omy and eval suite proposal

Daniel KokotajloFeb 17, 2024, 1:47 AM
63 points
2 comments11 min readLW link

AXRP Epi­sode 38.8 - David Du­ve­naud on Sab­o­tage Eval­u­a­tions and the Post-AGI Future

DanielFilanMar 1, 2025, 1:20 AM
13 points
0 comments13 min readLW link

AI com­pa­nies aren’t re­ally us­ing ex­ter­nal evaluators

Zach Stein-PerlmanMay 24, 2024, 4:01 PM
242 points
15 comments4 min readLW link

What’s the short timeline plan?

Marius HobbhahnJan 2, 2025, 2:59 PM
347 points
49 comments23 min readLW link

Ideas for bench­mark­ing LLM creativity

gwernDec 16, 2024, 5:18 AM
60 points
11 comments1 min readLW link
(gwern.net)

New, im­proved mul­ti­ple-choice TruthfulQA

Jan 15, 2025, 11:32 PM
72 points
0 comments3 min readLW link

ARC Evals new re­port: Eval­u­at­ing Lan­guage-Model Agents on Real­is­tic Au­tonomous Tasks

Beth BarnesAug 1, 2023, 6:30 PM
153 points
12 comments5 min readLW link
(evals.alignment.org)

Apollo Re­search is hiring evals and in­ter­pretabil­ity en­g­ineers & scientists

Marius HobbhahnAug 4, 2023, 10:54 AM
25 points
0 comments2 min readLW link

Au­tonomous repli­ca­tion and adap­ta­tion: an at­tempt at a con­crete dan­ger threshold

Hjalmar_WijkAug 17, 2023, 1:31 AM
45 points
0 comments13 min readLW link

Manag­ing risks of our own work

Beth BarnesAug 18, 2023, 12:41 AM
66 points
0 comments2 min readLW link

The Leeroy Jenk­ins prin­ci­ple: How faulty AI could guaran­tee “warn­ing shots”

titotalJan 14, 2024, 3:03 PM
48 points
6 comments1 min readLW link
(titotal.substack.com)

Deep­Mind: Eval­u­at­ing Fron­tier Models for Danger­ous Capabilities

Zach Stein-PerlmanMar 21, 2024, 3:00 AM
61 points
8 comments1 min readLW link
(arxiv.org)

[Question] Would more model evals teams be good?

Ryan KiddFeb 25, 2023, 10:01 PM
20 points
4 comments1 min readLW link

A very crude de­cep­tion eval is already passed

Beth BarnesOct 29, 2021, 5:57 PM
108 points
6 comments2 min readLW link

ARC tests to see if GPT-4 can es­cape hu­man con­trol; GPT-4 failed to do so

Christopher KingMar 15, 2023, 12:29 AM
116 points
22 comments2 min readLW link

More in­for­ma­tion about the dan­ger­ous ca­pa­bil­ity eval­u­a­tions we did with GPT-4 and Claude.

Beth BarnesMar 19, 2023, 12:25 AM
233 points
54 comments8 min readLW link
(evals.alignment.org)

Send us ex­am­ple gnarly bugs

Dec 10, 2023, 5:23 AM
77 points
10 comments2 min readLW link

BIG-Bench Ca­nary Con­tam­i­na­tion in GPT-4

JozdienOct 22, 2024, 3:40 PM
123 points
14 comments4 min readLW link

METR is hiring!

Beth BarnesDec 26, 2023, 9:00 PM
65 points
1 comment1 min readLW link

Com­par­ing Quan­tized Perfor­mance in Llama Models

NickyPJul 15, 2024, 4:01 PM
33 points
2 comments8 min readLW link

Claude Son­net 3.7 (of­ten) knows when it’s in al­ign­ment evaluations

Mar 17, 2025, 7:11 PM
176 points
7 comments6 min readLW link

Third-party test­ing as a key in­gre­di­ent of AI policy

Zac Hatfield-DoddsMar 25, 2024, 10:40 PM
11 points
1 comment12 min readLW link
(www.anthropic.com)

[In­terim re­search re­port] Eval­u­at­ing the Goal-Direct­ed­ness of Lan­guage Models

Jul 18, 2024, 6:19 PM
39 points
4 comments11 min readLW link

We need a Science of Evals

Jan 22, 2024, 8:30 PM
71 points
13 comments9 min readLW link

100+ con­crete pro­jects and open prob­lems in evals

Marius HobbhahnMar 22, 2025, 3:21 PM
71 points
1 comment1 min readLW link

Run evals on base mod­els too!

orthonormalApr 4, 2024, 6:43 PM
48 points
6 comments1 min readLW link

AXRP Epi­sode 34 - AI Eval­u­a­tions with Beth Barnes

DanielFilanJul 28, 2024, 3:30 AM
23 points
0 comments69 min readLW link

“Suc­cess­ful lan­guage model evals” by Ja­son Wei

Arjun PanicksseryMay 25, 2024, 9:34 AM
7 points
0 comments1 min readLW link
(www.jasonwei.net)

Sub­ver­sion Strat­egy Eval: Can lan­guage mod­els state­lessly strate­gize to sub­vert con­trol pro­to­cols?

Mar 24, 2025, 5:55 PM
30 points
0 comments8 min readLW link

To CoT or not to CoT? Chain-of-thought helps mainly on math and sym­bolic reasoning

Bogdan Ionut CirsteaSep 19, 2024, 4:13 PM
21 points
1 comment1 min readLW link
(arxiv.org)

[Question] How far along Metr’s law can AI start au­tomat­ing or helping with al­ign­ment re­search?

Christopher KingMar 20, 2025, 3:58 PM
20 points
21 comments1 min readLW link

“Should AI Ques­tion Its Own De­ci­sions? A Thought Ex­per­i­ment”

CMDR WOTZFeb 4, 2025, 8:39 AM
1 point
0 comments1 min readLW link

AISN #47: Rea­son­ing Models

Feb 6, 2025, 6:52 PM
3 points
0 comments4 min readLW link
(newsletter.safe.ai)

Re­quest for pro­pos­als: im­prov­ing ca­pa­bil­ity evaluations

cbFeb 7, 2025, 6:51 PM
1 point
0 comments1 min readLW link
(www.openphilanthropy.org)

From No Mind to a Mind – A Con­ver­sa­tion That Changed an AI

parthibanarjuna sFeb 7, 2025, 11:50 AM
1 point
0 comments3 min readLW link

Two flaws in the Machi­avelli Benchmark

TheManxLoinerFeb 12, 2025, 7:34 PM
23 points
0 comments3 min readLW link

Ra­tional Utopia & Nar­row Way There: Mul­tiver­sal AI Align­ment, Non-Agen­tic Static Place AI, New Ethics… (V. 4)

ankFeb 11, 2025, 3:21 AM
13 points
8 comments35 min readLW link

Pro­posal on AI eval­u­a­tion: false-proving

ProgramCrafterMar 31, 2023, 12:12 PM
1 point
2 comments1 min readLW link

LM Si­tu­a­tional Aware­ness, Eval­u­a­tion Pro­posal: Vio­lat­ing Imitation

Jacob PfauApr 26, 2023, 10:53 PM
16 points
2 comments2 min readLW link

Tall Tales at Differ­ent Scales: Eval­u­at­ing Scal­ing Trends For De­cep­tion In Lan­guage Models

Nov 8, 2023, 11:37 AM
49 points
0 comments18 min readLW link

AI as a Cog­ni­tive De­coder: Re­think­ing In­tel­li­gence Evolution

Hu XunyiFeb 13, 2025, 3:51 PM
1 point
0 comments1 min readLW link

The­o­ries of Change for AI Auditing

Nov 13, 2023, 7:33 PM
54 points
0 comments18 min readLW link
(www.apolloresearch.ai)

A sim­ple treach­er­ous turn demonstration

Nikola JurkovicNov 25, 2023, 4:51 AM
22 points
5 comments3 min readLW link

A call for a quan­ti­ta­tive re­port card for AI bioter­ror­ism threat models

JunoDec 4, 2023, 6:35 AM
12 points
0 comments10 min readLW link

Pro­tect­ing against sud­den ca­pa­bil­ity jumps dur­ing training

Nikola JurkovicDec 2, 2023, 4:22 AM
15 points
2 comments2 min readLW link

The Method of Loci: With some brief re­marks, in­clud­ing trans­form­ers and eval­u­at­ing AIs

Bill BenzonDec 2, 2023, 2:36 PM
6 points
0 comments3 min readLW link

What’s new at FAR AI

Dec 4, 2023, 9:18 PM
41 points
0 comments5 min readLW link
(far.ai)

2023 Align­ment Re­search Up­dates from FAR AI

Dec 4, 2023, 10:32 PM
18 points
0 comments8 min readLW link
(far.ai)

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

May 16, 2023, 10:53 AM
26 points
0 comments13 min readLW link

Im­prov­ing the safety of AI evals

May 17, 2023, 10:24 PM
13 points
7 comments7 min readLW link

The Com­pleat Cybornaut

May 19, 2023, 8:44 AM
65 points
2 comments16 min readLW link

Seek­ing (Paid) Case Stud­ies on Standards

HoldenKarnofskyMay 26, 2023, 5:58 PM
69 points
9 comments11 min readLW link

[Question] AI Rights: In your view, what would be re­quired for an AGI to gain rights and pro­tec­tions from the var­i­ous Govern­ments of the World?

Super AGIJun 9, 2023, 1:24 AM
10 points
26 comments1 min readLW link

Challenge pro­posal: small­est pos­si­ble self-hard­en­ing back­door for RLHF

Christopher KingJun 29, 2023, 4:56 PM
7 points
0 comments2 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

Jul 15, 2023, 7:12 PM
47 points
5 comments9 min readLW link

The “spel­ling mir­a­cle”: GPT-3 spel­ling abil­ities and glitch to­kens revisited

mwatkinsJul 31, 2023, 7:47 PM
85 points
29 comments20 min readLW link

Eval­u­at­ing Su­per­hu­man Models with Con­sis­tency Checks

Aug 1, 2023, 7:51 AM
21 points
2 comments9 min readLW link
(arxiv.org)

Re­pro­duc­ing ARC Evals’ re­cent re­port on lan­guage model agents

Thomas BroadleySep 1, 2023, 4:52 PM
104 points
17 comments3 min readLW link
(thomasbroadley.com)

MMLU’s Mo­ral Sce­nar­ios Bench­mark Doesn’t Mea­sure What You Think it Measures

corey morrisSep 27, 2023, 5:54 PM
18 points
2 comments4 min readLW link
(medium.com)

Re­spon­si­ble scal­ing policy TLDR

lemonhopeSep 28, 2023, 6:51 PM
9 points
0 comments1 min readLW link

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

Jul 18, 2023, 4:36 PM
111 points
15 comments6 min readLW link1 review

The dreams of GPT-4

RomanSMar 20, 2023, 5:00 PM
14 points
7 comments9 min readLW link

Nav­i­gat­ing the Attackspace

Jonas KgomoDec 12, 2023, 1:59 PM
1 point
0 comments2 min readLW link

Ar­tifi­cial Static Place In­tel­li­gence: Guaran­teed Alignment

ankFeb 15, 2025, 11:08 AM
2 points
2 comments2 min readLW link

Think­ing About Propen­sity Evaluations

Aug 19, 2024, 9:23 AM
9 points
0 comments27 min readLW link

A Tax­on­omy Of AI Sys­tem Evaluations

Aug 19, 2024, 9:07 AM
13 points
0 comments14 min readLW link

Can Cur­rent LLMs be Trusted To Pro­duce Paper­clips Safely?

Rohit ChatterjeeAug 19, 2024, 5:17 PM
4 points
0 comments9 min readLW link

Sys­tem­atic Sand­bag­ging Eval­u­a­tions on Claude 3.5 Sonnet

farrelmahaztraFeb 14, 2025, 1:22 AM
13 points
0 comments1 min readLW link
(farrelmahaztra.com)

Find­ing De­cep­tion in Lan­guage Models

Aug 20, 2024, 9:42 AM
18 points
4 comments4 min readLW link

In­tro­duc­ing REBUS: A Ro­bust Eval­u­a­tion Bench­mark of Un­der­stand­ing Symbols

Jan 15, 2024, 9:21 PM
33 points
0 comments1 min readLW link

OpenAI Credit Ac­count (2510$)

Emirhan BULUTJan 21, 2024, 2:30 AM
1 point
0 comments1 min readLW link

Orthog­o­nal­ity or the “Hu­man Worth Hy­poth­e­sis”?

JeffsJan 23, 2024, 12:57 AM
21 points
31 comments3 min readLW link

LLMs can strate­gi­cally de­ceive while do­ing gain-of-func­tion re­search

Igor IvanovJan 24, 2024, 3:45 PM
33 points
4 comments11 min readLW link

OpenAI Credit Ac­count (2510$)

Emirhan BULUTJan 21, 2024, 2:32 AM
1 point
0 comments1 min readLW link

Ques­tions I’d Want to Ask an AGI+ to Test Its Un­der­stand­ing of Ethics

sweenesmJan 26, 2024, 11:40 PM
14 points
6 comments4 min readLW link

Do mod­els know when they are be­ing eval­u­ated?

Feb 17, 2025, 11:13 PM
54 points
3 comments12 min readLW link

Skep­ti­cism About Deep­Mind’s “Grand­mas­ter-Level” Chess Without Search

Arjun PanicksseryFeb 12, 2024, 12:56 AM
57 points
13 comments3 min readLW link

In­tro­duc­ing METR’s Au­ton­omy Eval­u­a­tion Resources

Mar 15, 2024, 11:16 PM
90 points
0 comments1 min readLW link
(metr.github.io)

AI Safety Eval­u­a­tions: A Reg­u­la­tory Review

Mar 19, 2024, 3:05 PM
22 points
1 comment11 min readLW link

Solv­ing ad­ver­sar­ial at­tacks in com­puter vi­sion as a baby ver­sion of gen­eral AI alignment

Stanislav FortAug 29, 2024, 5:17 PM
87 points
8 comments7 min readLW link

Mea­sur­ing Pre­dictabil­ity of Per­sona Evaluations

Apr 6, 2024, 8:46 AM
20 points
0 comments7 min readLW link

Claude wants to be conscious

Joe KwonApr 13, 2024, 1:40 AM
2 points
8 comments6 min readLW link

LLM Eval­u­a­tors Rec­og­nize and Fa­vor Their Own Generations

Apr 17, 2024, 9:09 PM
44 points
1 comment3 min readLW link
(tiny.cc)

In­duc­ing Un­prompted Misal­ign­ment in LLMs

Apr 19, 2024, 8:00 PM
38 points
7 comments16 min readLW link

An In­tro­duc­tion to AI Sandbagging

Apr 26, 2024, 1:40 PM
45 points
13 comments8 min readLW link

METR is hiring ML Re­search Eng­ineers and Scientists

XodarapJun 5, 2024, 9:27 PM
5 points
0 comments1 min readLW link
(metr.org)

Catas­trophic Cy­ber Ca­pa­bil­ities Bench­mark (3CB): Ro­bustly Eval­u­at­ing LLM Agent Cy­ber Offense Capabilities

Nov 5, 2024, 1:01 AM
8 points
0 comments6 min readLW link
(www.apartresearch.com)

Static Place AI Makes Agen­tic AI Re­dun­dant: Mul­tiver­sal AI Align­ment & Ra­tional Utopia

ankFeb 13, 2025, 10:35 PM
1 point
2 comments11 min readLW link

AI Safety In­sti­tute’s In­spect hello world ex­am­ple for AI evals

TheManxLoinerMay 16, 2024, 8:47 PM
3 points
0 comments1 min readLW link
(lovkush.medium.com)

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

Jun 13, 2024, 10:04 AM
84 points
10 comments2 min readLW link
(arxiv.org)

When fine-tun­ing fails to elicit GPT-3.5′s chess abilities

Theodore ChapmanJun 14, 2024, 6:50 PM
42 points
3 comments9 min readLW link

Re­sults from the AI x Democ­racy Re­search Sprint

Jun 14, 2024, 4:40 PM
13 points
0 comments6 min readLW link

Toward a tax­on­omy of cog­ni­tive bench­marks for agen­tic AGIs

Ben SmithJun 27, 2024, 11:50 PM
15 points
0 comments5 min readLW link

Re­view of METR’s pub­lic eval­u­a­tion protocol

Jun 30, 2024, 10:03 PM
10 points
0 comments5 min readLW link

In­tel­li­gence–Agency Equiv­alence ≈ Mass–En­ergy Equiv­alence: On Static Na­ture of In­tel­li­gence & Phys­i­cal­iza­tion of Ethics

ankFeb 22, 2025, 12:12 AM
1 point
0 comments6 min readLW link

Can star­tups be im­pact­ful in AI safety?

Sep 13, 2024, 7:00 PM
15 points
0 comments6 min readLW link

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

Dec 17, 2024, 11:58 PM
115 points
1 comment2 min readLW link

Se­cret Col­lu­sion: Will We Know When to Un­plug AI?

Sep 16, 2024, 4:07 PM
56 points
7 comments31 min readLW link

Auto-En­hance: Devel­op­ing a meta-bench­mark to mea­sure LLM agents’ abil­ity to im­prove other agents

Jul 22, 2024, 12:33 PM
20 points
0 comments14 min readLW link

An­a­lyz­ing Deep­Mind’s Prob­a­bil­is­tic Meth­ods for Eval­u­at­ing Agent Capabilities

Jul 22, 2024, 4:17 PM
69 points
0 comments16 min readLW link

A Vi­sual Task that’s Hard for GPT-4o, but Doable for Pri­mary Schoolers

Lennart FinkeJul 26, 2024, 5:51 PM
25 points
6 comments2 min readLW link

Cri­tiques of the AI con­trol agenda

JozdienFeb 14, 2024, 7:25 PM
48 points
14 comments9 min readLW link

o1-pre­view is pretty good at do­ing ML on an un­known dataset

Håvard Tveit IhleSep 20, 2024, 8:39 AM
67 points
1 comment2 min readLW link

[Paper] Hid­den in Plain Text: Emer­gence and Miti­ga­tion of Stegano­graphic Col­lu­sion in LLMs

Sep 25, 2024, 2:52 PM
36 points
2 comments4 min readLW link
(arxiv.org)

Join the $10K Au­toHack 2024 Tournament

Paul BricmanSep 25, 2024, 11:54 AM
5 points
0 comments1 min readLW link
(noemaresearch.com)

LLM Psy­cho­met­rics and Prompt-In­duced Psychopathy

Korbinian K.Oct 18, 2024, 6:11 PM
12 points
2 comments10 min readLW link

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

Oct 15, 2024, 6:25 PM
30 points
0 comments18 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

Oct 18, 2024, 10:33 PM
94 points
56 comments6 min readLW link
(assets.anthropic.com)

A Poem Is All You Need: Jailbreak­ing ChatGPT, Meta & More

Sharat Jacob JacobOct 29, 2024, 12:41 PM
12 points
0 comments9 min readLW link

Agency over­hang as a proxy for Sharp left turn

Nov 7, 2024, 12:14 PM
6 points
0 comments5 min readLW link

Call for eval­u­a­tors: Par­ti­ci­pate in the Euro­pean AI Office work­shop on gen­eral-pur­pose AI mod­els and sys­temic risks

Nov 27, 2024, 2:54 AM
30 points
0 comments2 min readLW link

How to make evals for the AISI evals bounty

TheManxLoinerDec 3, 2024, 10:44 AM
9 points
0 comments5 min readLW link

Give Neo a Chance

ankMar 6, 2025, 1:48 AM
3 points
7 comments7 min readLW link

Build­ing AI safety bench­mark en­vi­ron­ments on themes of uni­ver­sal hu­man values

Roland PihlakasJan 3, 2025, 4:24 AM
18 points
3 comments8 min readLW link
(docs.google.com)

Un­der­stand­ing Bench­marks and mo­ti­vat­ing Evaluations

Feb 6, 2025, 1:32 AM
9 points
0 comments11 min readLW link
(ai-safety-atlas.com)

On­tolog­i­cal Val­i­da­tion Man­i­festo for AIs

Alejandra Ivone Rojas ReynaMar 14, 2025, 4:34 PM
1 point
0 comments72 min readLW link

Some les­sons from the OpenAI-Fron­tierMath debacle

7vikJan 19, 2025, 9:09 PM
63 points
9 comments4 min readLW link

On­tolog­i­cal Val­i­da­tion Man­i­festo for AIs

Alejandra Ivone Rojas ReynaMar 22, 2025, 12:26 AM
1 point
0 comments71 min readLW link

A sketch of an AI con­trol safety case

Jan 30, 2025, 5:28 PM
60 points
0 comments5 min readLW link

Re­veal­ing al­ign­ment fak­ing with a sin­gle prompt

Florian_DietzJan 29, 2025, 9:01 PM
9 points
5 comments4 min readLW link

Notable run­away-op­ti­miser-like LLM failure modes on Biolog­i­cally and Eco­nom­i­cally al­igned AI safety bench­marks for LLMs with sim­plified ob­ser­va­tion format

Mar 16, 2025, 11:23 PM
36 points
6 comments7 min readLW link

AI Deep­Seek is Aware

EyonJan 31, 2025, 12:40 PM
1 point
0 comments6 min readLW link

Towards a Science of Evals for Sycophancy

andrejfsantosFeb 1, 2025, 9:17 PM
6 points
0 comments8 min readLW link

Align­ment Can Re­duce Perfor­mance on Sim­ple Eth­i­cal Questions

Daan HenselmansFeb 3, 2025, 7:35 PM
15 points
7 comments6 min readLW link

Can Per­sua­sion Break AI Safety? Ex­plor­ing the In­ter­play Between Fine-Tun­ing, At­tacks, and Guardrails

Devina JainFeb 4, 2025, 7:10 PM
3 points
0 comments10 min readLW link

How to miti­gate sandbagging

Teun van der WeijMar 23, 2025, 5:19 PM
23 points
0 comments8 min readLW link
No comments.