RSS

AI Evaluations

TagLast edit: 1 Aug 2023 1:03 UTC by duck_master

AI Evaluations focus on experimentally assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based.

(note: initially written by GPT4, may contain errors despite a human review. Please correct them if you see them)

Behavioral evaluations assess a model’s abilities in various tasks, such as autonomously replicating, acquiring resources, and avoiding being shut down. However, a concern with these evaluations is that they may not be sufficient to detect deceptive alignment, making it difficult to ensure that models are non-deceptive.

Understanding-based evaluations, on the other hand, evaluate a developer’s ability to understand the model they have created and why they have obtained the model. This approach can be more useful in terms of safety, as it focuses on understanding the model’s behavior instead of just checking the behavior itself. Coupling understanding-based evaluations with behavioral evaluations can lead to a more comprehensive assessment of AI safety and alignment.

Current challenges in AI evaluations include:

(this text was initially written by GPT4, taking in as input A very crude deception eval is already passed, ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so, and Towards understanding-based safety evaluations)

See also:

How evals might (or might not) pre­vent catas­trophic risks from AI

Akash7 Feb 2023 20:16 UTC
45 points
0 comments9 min readLW link

The case for more am­bi­tious lan­guage model evals

Jozdien30 Jan 2024 0:01 UTC
110 points
30 comments5 min readLW link

When can we trust model eval­u­a­tions?

evhub28 Jul 2023 19:42 UTC
157 points
9 comments10 min readLW link

Towards un­der­stand­ing-based safety evaluations

evhub15 Mar 2023 18:18 UTC
164 points
16 comments5 min readLW link

An­nounc­ing Apollo Research

30 May 2023 16:17 UTC
215 points
11 comments8 min readLW link

Thoughts on shar­ing in­for­ma­tion about lan­guage model capabilities

paulfchristiano31 Jul 2023 16:04 UTC
208 points
36 comments11 min readLW link

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
312 points
28 comments18 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

7 Nov 2023 17:59 UTC
36 points
2 comments2 min readLW link
(arxiv.org)

How good are LLMs at do­ing ML on an un­known dataset?

Håvard Tveit Ihle1 Jul 2024 9:04 UTC
33 points
4 comments13 min readLW link

OMMC An­nounces RIP

1 Apr 2024 23:20 UTC
188 points
5 comments2 min readLW link

Deep­Mind: Model eval­u­a­tion for ex­treme risks

Zach Stein-Perlman25 May 2023 3:00 UTC
94 points
11 comments1 min readLW link
(arxiv.org)

Deep­Mind: Eval­u­at­ing Fron­tier Models for Danger­ous Capabilities

Zach Stein-Perlman21 Mar 2024 3:00 UTC
61 points
8 comments1 min readLW link
(arxiv.org)

Bounty: Di­verse hard tasks for LLM agents

17 Dec 2023 1:04 UTC
49 points
31 comments16 min readLW link

OpenAI: Pre­pared­ness framework

Zach Stein-Perlman18 Dec 2023 18:30 UTC
70 points
23 comments4 min readLW link
(openai.com)

An­nounc­ing Hu­man-al­igned AI Sum­mer School

22 May 2024 8:55 UTC
50 points
0 comments1 min readLW link
(humanaligned.ai)

METR is hiring!

Beth Barnes26 Dec 2023 21:00 UTC
65 points
1 comment1 min readLW link

“Suc­cess­ful lan­guage model evals” by Ja­son Wei

Arjun Panickssery25 May 2024 9:34 UTC
7 points
0 comments1 min readLW link
(www.jasonwei.net)

A starter guide for evals

8 Jan 2024 18:24 UTC
50 points
2 comments12 min readLW link
(www.apolloresearch.ai)

AI com­pa­nies aren’t re­ally us­ing ex­ter­nal evaluators

Zach Stein-Perlman24 May 2024 16:01 UTC
240 points
15 comments4 min readLW link

The Leeroy Jenk­ins prin­ci­ple: How faulty AI could guaran­tee “warn­ing shots”

titotal14 Jan 2024 15:03 UTC
46 points
6 comments1 min readLW link
(titotal.substack.com)

We need a Science of Evals

22 Jan 2024 20:30 UTC
71 points
13 comments9 min readLW link

Soft Prompts for Eval­u­a­tion: Mea­sur­ing Con­di­tional Dis­tance of Capabilities

porby2 Feb 2024 5:49 UTC
47 points
1 comment4 min readLW link
(1drv.ms)

Apollo Re­search 1-year update

29 May 2024 17:44 UTC
93 points
0 comments7 min readLW link

Clar­ify­ing METR’s Au­dit­ing Role

Beth Barnes30 May 2024 18:41 UTC
108 points
1 comment2 min readLW link

Pro­to­col eval­u­a­tions: good analo­gies vs control

Fabien Roger19 Feb 2024 18:00 UTC
42 points
10 comments11 min readLW link

Self-Aware­ness: Tax­on­omy and eval suite proposal

Daniel Kokotajlo17 Feb 2024 1:47 UTC
63 points
2 comments11 min readLW link

Send us ex­am­ple gnarly bugs

10 Dec 2023 5:23 UTC
77 points
10 comments2 min readLW link

Third-party test­ing as a key in­gre­di­ent of AI policy

Zac Hatfield-Dodds25 Mar 2024 22:40 UTC
11 points
1 comment12 min readLW link
(www.anthropic.com)

Run evals on base mod­els too!

orthonormal4 Apr 2024 18:43 UTC
47 points
6 comments1 min readLW link

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

30 Apr 2024 18:51 UTC
205 points
42 comments45 min readLW link

Com­par­ing Quan­tized Perfor­mance in Llama Models

NickyP15 Jul 2024 16:01 UTC
32 points
2 comments8 min readLW link

[In­terim re­search re­port] Eval­u­at­ing the Goal-Direct­ed­ness of Lan­guage Models

18 Jul 2024 18:19 UTC
39 points
4 comments11 min readLW link

AXRP Epi­sode 34 - AI Eval­u­a­tions with Beth Barnes

DanielFilan28 Jul 2024 3:30 UTC
23 points
0 comments69 min readLW link

To CoT or not to CoT? Chain-of-thought helps mainly on math and sym­bolic reasoning

Bogdan Ionut Cirstea19 Sep 2024 16:13 UTC
21 points
1 comment1 min readLW link
(arxiv.org)

In­ves­ti­gat­ing the Abil­ity of LLMs to Rec­og­nize Their Own Writing

30 Jul 2024 15:41 UTC
32 points
0 comments15 min readLW link

Can Gen­er­al­ized Ad­ver­sar­ial Test­ing En­able More Ri­gor­ous LLM Safety Evals?

scasper30 Jul 2024 14:57 UTC
25 points
0 comments4 min readLW link

Twit­ter thread on AI safety evals

Richard_Ngo31 Jul 2024 0:18 UTC
62 points
3 comments2 min readLW link
(x.com)

GPT-4o Sys­tem Card

Zach Stein-Perlman8 Aug 2024 20:30 UTC
68 points
11 comments2 min readLW link
(openai.com)

An is­sue with train­ing schemers with su­per­vised fine-tuning

Fabien Roger27 Jun 2024 15:37 UTC
49 points
12 comments6 min readLW link

Model evals for dan­ger­ous capabilities

Zach Stein-Perlman23 Sep 2024 11:00 UTC
51 points
11 comments3 min readLW link

New Ca­pa­bil­ities, New Risks? - Eval­u­at­ing Agen­tic Gen­eral As­sis­tants us­ing Ele­ments of GAIA & METR Frameworks

Tej Lander29 Sep 2024 18:58 UTC
5 points
0 comments29 min readLW link

Bi­as­ing VLM Re­sponse with Vi­sual Stimuli

Jaehyuk Lim3 Oct 2024 18:04 UTC
5 points
0 comments8 min readLW link

An Opinionated Evals Read­ing List

15 Oct 2024 14:38 UTC
65 points
0 comments13 min readLW link
(www.apolloresearch.ai)

BIG-Bench Ca­nary Con­tam­i­na­tion in GPT-4

Jozdien22 Oct 2024 15:40 UTC
123 points
13 comments4 min readLW link

UK AISI: Early les­sons from eval­u­at­ing fron­tier AI systems

Zach Stein-Perlman25 Oct 2024 19:00 UTC
26 points
0 comments2 min readLW link
(www.aisi.gov.uk)

The Evals Gap

Marius Hobbhahn11 Nov 2024 16:42 UTC
55 points
7 comments7 min readLW link
(www.apolloresearch.ai)

Which evals re­sources would be good?

Marius Hobbhahn16 Nov 2024 14:24 UTC
47 points
4 comments5 min readLW link

[Question] Can GPT-4 play 20 ques­tions against an­other in­stance of it­self?

Nathan Helm-Burger28 Mar 2023 1:11 UTC
15 points
1 comment1 min readLW link
(evanthebouncy.medium.com)

Re­spon­si­ble De­ploy­ment in 20XX

Carson20 Apr 2023 0:24 UTC
4 points
0 comments4 min readLW link

Refram­ing the bur­den of proof: Com­pa­nies should prove that mod­els are safe (rather than ex­pect­ing au­di­tors to prove that mod­els are dan­ger­ous)

Akash25 Apr 2023 18:49 UTC
27 points
11 comments3 min readLW link
(childrenoficarus.substack.com)

Prevent­ing Lan­guage Models from hid­ing their reasoning

31 Oct 2023 14:34 UTC
111 points
14 comments12 min readLW link

Eval­u­at­ing strate­gic rea­son­ing in GPT models

phelps-sg25 May 2023 11:51 UTC
4 points
1 comment8 min readLW link

ARC Evals new re­port: Eval­u­at­ing Lan­guage-Model Agents on Real­is­tic Au­tonomous Tasks

Beth Barnes1 Aug 2023 18:30 UTC
153 points
12 comments5 min readLW link
(evals.alignment.org)

Apollo Re­search is hiring evals and in­ter­pretabil­ity en­g­ineers & scientists

Marius Hobbhahn4 Aug 2023 10:54 UTC
25 points
0 comments2 min readLW link

Au­tonomous repli­ca­tion and adap­ta­tion: an at­tempt at a con­crete dan­ger threshold

Hjalmar_Wijk17 Aug 2023 1:31 UTC
44 points
0 comments13 min readLW link

Manag­ing risks of our own work

Beth Barnes18 Aug 2023 0:41 UTC
66 points
0 comments2 min readLW link

[Question] Would more model evals teams be good?

Ryan Kidd25 Feb 2023 22:01 UTC
20 points
4 comments1 min readLW link

A very crude de­cep­tion eval is already passed

Beth Barnes29 Oct 2021 17:57 UTC
108 points
6 comments2 min readLW link

ARC tests to see if GPT-4 can es­cape hu­man con­trol; GPT-4 failed to do so

Christopher King15 Mar 2023 0:29 UTC
116 points
22 comments2 min readLW link

More in­for­ma­tion about the dan­ger­ous ca­pa­bil­ity eval­u­a­tions we did with GPT-4 and Claude.

Beth Barnes19 Mar 2023 0:25 UTC
233 points
54 comments8 min readLW link
(evals.alignment.org)

A Vi­sual Task that’s Hard for GPT-4o, but Doable for Pri­mary Schoolers

Lennart Finke26 Jul 2024 17:51 UTC
25 points
4 comments2 min readLW link

Im­prov­ing the safety of AI evals

17 May 2023 22:24 UTC
13 points
7 comments7 min readLW link

Cri­tiques of the AI con­trol agenda

Jozdien14 Feb 2024 19:25 UTC
47 points
14 comments9 min readLW link

The Com­pleat Cybornaut

19 May 2023 8:44 UTC
64 points
2 comments16 min readLW link

Re­spon­si­ble scal­ing policy TLDR

lemonhope28 Sep 2023 18:51 UTC
9 points
0 comments1 min readLW link

Seek­ing (Paid) Case Stud­ies on Standards

HoldenKarnofsky26 May 2023 17:58 UTC
69 points
9 comments11 min readLW link

o1-pre­view is pretty good at do­ing ML on an un­known dataset

Håvard Tveit Ihle20 Sep 2024 8:39 UTC
67 points
1 comment2 min readLW link

[Question] AI Rights: In your view, what would be re­quired for an AGI to gain rights and pro­tec­tions from the var­i­ous Govern­ments of the World?

Super AGI9 Jun 2023 1:24 UTC
10 points
26 comments1 min readLW link

Challenge pro­posal: small­est pos­si­ble self-hard­en­ing back­door for RLHF

Christopher King29 Jun 2023 16:56 UTC
7 points
0 comments2 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

15 Jul 2023 19:12 UTC
47 points
5 comments9 min readLW link

[Paper] Hid­den in Plain Text: Emer­gence and Miti­ga­tion of Stegano­graphic Col­lu­sion in LLMs

25 Sep 2024 14:52 UTC
30 points
2 comments4 min readLW link
(arxiv.org)

Join the $10K Au­toHack 2024 Tournament

Paul Bricman25 Sep 2024 11:54 UTC
5 points
0 comments1 min readLW link
(noemaresearch.com)

The “spel­ling mir­a­cle”: GPT-3 spel­ling abil­ities and glitch to­kens revisited

mwatkins31 Jul 2023 19:47 UTC
85 points
29 comments20 min readLW link

LLM Psy­cho­met­rics and Prompt-In­duced Psychopathy

Korbinian K.18 Oct 2024 18:11 UTC
12 points
2 comments10 min readLW link

Can Cur­rent LLMs be Trusted To Pro­duce Paper­clips Safely?

Rohit Chatterjee19 Aug 2024 17:17 UTC
4 points
0 comments9 min readLW link

A Tax­on­omy Of AI Sys­tem Evaluations

19 Aug 2024 9:07 UTC
12 points
0 comments14 min readLW link

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

15 Oct 2024 18:25 UTC
26 points
0 comments18 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
93 points
55 comments6 min readLW link
(assets.anthropic.com)

A Poem Is All You Need: Jailbreak­ing ChatGPT, Meta & More

Sharat Jacob Jacob29 Oct 2024 12:41 UTC
12 points
0 comments9 min readLW link

Eval­u­at­ing Su­per­hu­man Models with Con­sis­tency Checks

1 Aug 2023 7:51 UTC
21 points
2 comments9 min readLW link
(arxiv.org)

Agency over­hang as a proxy for Sharp left turn

7 Nov 2024 12:14 UTC
5 points
0 comments5 min readLW link

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

18 Jul 2023 16:36 UTC
111 points
14 comments6 min readLW link

Think­ing About Propen­sity Evaluations

19 Aug 2024 9:23 UTC
8 points
0 comments27 min readLW link

Call for eval­u­a­tors: Par­ti­ci­pate in the Euro­pean AI Office work­shop on gen­eral-pur­pose AI mod­els and sys­temic risks

27 Nov 2024 2:54 UTC
30 points
0 comments2 min readLW link

How to make evals for the AISI evals bounty

TheManxLoiner3 Dec 2024 10:44 UTC
2 points
0 comments5 min readLW link

Nav­i­gat­ing the Attackspace

Jonas Kgomo12 Dec 2023 13:59 UTC
1 point
0 comments2 min readLW link

Pro­posal on AI eval­u­a­tion: false-proving

ProgramCrafter31 Mar 2023 12:12 UTC
1 point
2 comments1 min readLW link

The dreams of GPT-4

RomanS20 Mar 2023 17:00 UTC
14 points
7 comments9 min readLW link

Re­pro­duc­ing ARC Evals’ re­cent re­port on lan­guage model agents

Thomas Broadley1 Sep 2023 16:52 UTC
103 points
17 comments3 min readLW link
(thomasbroadley.com)

LM Si­tu­a­tional Aware­ness, Eval­u­a­tion Pro­posal: Vio­lat­ing Imitation

Jacob Pfau26 Apr 2023 22:53 UTC
16 points
2 comments2 min readLW link

MMLU’s Mo­ral Sce­nar­ios Bench­mark Doesn’t Mea­sure What You Think it Measures

corey morris27 Sep 2023 17:54 UTC
18 points
2 comments4 min readLW link
(medium.com)

LLMs can strate­gi­cally de­ceive while do­ing gain-of-func­tion re­search

Igor Ivanov24 Jan 2024 15:45 UTC
33 points
4 comments11 min readLW link

OpenAI Credit Ac­count (2510$)

Emirhan BULUT21 Jan 2024 2:32 UTC
1 point
0 comments1 min readLW link

Ques­tions I’d Want to Ask an AGI+ to Test Its Un­der­stand­ing of Ethics

sweenesm26 Jan 2024 23:40 UTC
14 points
6 comments4 min readLW link

Find­ing De­cep­tion in Lan­guage Models

20 Aug 2024 9:42 UTC
18 points
4 comments4 min readLW link

Skep­ti­cism About Deep­Mind’s “Grand­mas­ter-Level” Chess Without Search

Arjun Panickssery12 Feb 2024 0:56 UTC
55 points
13 comments3 min readLW link

Tall Tales at Differ­ent Scales: Eval­u­at­ing Scal­ing Trends For De­cep­tion In Lan­guage Models

8 Nov 2023 11:37 UTC
49 points
0 comments18 min readLW link

The­o­ries of Change for AI Auditing

13 Nov 2023 19:33 UTC
54 points
0 comments18 min readLW link
(www.apolloresearch.ai)

A sim­ple treach­er­ous turn demonstration

nikola25 Nov 2023 4:51 UTC
22 points
5 comments3 min readLW link

In­tro­duc­ing METR’s Au­ton­omy Eval­u­a­tion Resources

15 Mar 2024 23:16 UTC
90 points
0 comments1 min readLW link
(metr.github.io)

AI Safety Eval­u­a­tions: A Reg­u­la­tory Review

19 Mar 2024 15:05 UTC
21 points
1 comment11 min readLW link

Orthog­o­nal­ity or the “Hu­man Worth Hy­poth­e­sis”?

Jeffs23 Jan 2024 0:57 UTC
21 points
31 comments3 min readLW link

A call for a quan­ti­ta­tive re­port card for AI bioter­ror­ism threat models

Juno4 Dec 2023 6:35 UTC
12 points
0 comments10 min readLW link

Pro­tect­ing against sud­den ca­pa­bil­ity jumps dur­ing training

nikola2 Dec 2023 4:22 UTC
15 points
2 comments2 min readLW link

Solv­ing ad­ver­sar­ial at­tacks in com­puter vi­sion as a baby ver­sion of gen­eral AI alignment

Stanislav Fort29 Aug 2024 17:17 UTC
87 points
8 comments7 min readLW link

OpenAI Credit Ac­count (2510$)

Emirhan BULUT21 Jan 2024 2:30 UTC
1 point
0 comments1 min readLW link

The Method of Loci: With some brief re­marks, in­clud­ing trans­form­ers and eval­u­at­ing AIs

Bill Benzon2 Dec 2023 14:36 UTC
6 points
0 comments3 min readLW link

Mea­sur­ing Pre­dictabil­ity of Per­sona Evaluations

6 Apr 2024 8:46 UTC
20 points
0 comments7 min readLW link

Claude wants to be conscious

Joe Kwon13 Apr 2024 1:40 UTC
2 points
8 comments6 min readLW link

LLM Eval­u­a­tors Rec­og­nize and Fa­vor Their Own Generations

17 Apr 2024 21:09 UTC
44 points
1 comment3 min readLW link
(tiny.cc)

In­duc­ing Un­prompted Misal­ign­ment in LLMs

19 Apr 2024 20:00 UTC
38 points
6 comments16 min readLW link

An In­tro­duc­tion to AI Sandbagging

26 Apr 2024 13:40 UTC
44 points
10 comments8 min readLW link

What’s new at FAR AI

4 Dec 2023 21:18 UTC
41 points
0 comments5 min readLW link
(far.ai)

METR is hiring ML Re­search Eng­ineers and Scientists

Xodarap5 Jun 2024 21:27 UTC
5 points
0 comments1 min readLW link
(metr.org)

Catas­trophic Cy­ber Ca­pa­bil­ities Bench­mark (3CB): Ro­bustly Eval­u­at­ing LLM Agent Cy­ber Offense Capabilities

5 Nov 2024 1:01 UTC
8 points
0 comments6 min readLW link
(www.apartresearch.com)

AI Safety In­sti­tute’s In­spect hello world ex­am­ple for AI evals

TheManxLoiner16 May 2024 20:47 UTC
3 points
0 comments1 min readLW link
(lovkush.medium.com)

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

13 Jun 2024 10:04 UTC
84 points
10 comments2 min readLW link
(arxiv.org)

When fine-tun­ing fails to elicit GPT-3.5′s chess abilities

Theodore Chapman14 Jun 2024 18:50 UTC
42 points
3 comments9 min readLW link

Re­sults from the AI x Democ­racy Re­search Sprint

14 Jun 2024 16:40 UTC
13 points
0 comments6 min readLW link

In­tro­duc­ing REBUS: A Ro­bust Eval­u­a­tion Bench­mark of Un­der­stand­ing Symbols

15 Jan 2024 21:21 UTC
33 points
0 comments1 min readLW link

Toward a tax­on­omy of cog­ni­tive bench­marks for agen­tic AGIs

Ben Smith27 Jun 2024 23:50 UTC
15 points
0 comments5 min readLW link

Re­view of METR’s pub­lic eval­u­a­tion protocol

30 Jun 2024 22:03 UTC
10 points
0 comments5 min readLW link

Can star­tups be im­pact­ful in AI safety?

13 Sep 2024 19:00 UTC
12 points
0 comments6 min readLW link

2023 Align­ment Re­search Up­dates from FAR AI

4 Dec 2023 22:32 UTC
18 points
0 comments8 min readLW link
(far.ai)

Se­cret Col­lu­sion: Will We Know When to Un­plug AI?

16 Sep 2024 16:07 UTC
55 points
7 comments31 min readLW link

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

16 May 2023 10:53 UTC
26 points
0 comments13 min readLW link

Auto-En­hance: Devel­op­ing a meta-bench­mark to mea­sure LLM agents’ abil­ity to im­prove other agents

22 Jul 2024 12:33 UTC
20 points
0 comments14 min readLW link

An­a­lyz­ing Deep­Mind’s Prob­a­bil­is­tic Meth­ods for Eval­u­at­ing Agent Capabilities

22 Jul 2024 16:17 UTC
69 points
0 comments16 min readLW link
No comments.