RSS

AI Benchmarking

TagLast edit: Jul 16, 2023, 2:12 PM by rybolos

Bro­ken Bench­mark: MMLU

awgAug 29, 2023, 6:09 PM
24 points
5 comments1 min readLW link
(www.youtube.com)

Fron­tierMath Score of o3-mini Much Lower Than Claimed

YafahEdelmanMar 17, 2025, 10:41 PM
61 points
7 comments1 min readLW link

In­tro­duc­ing BenchBench: An In­dus­try Stan­dard Bench­mark for AI Strength

JozdienApr 2, 2025, 2:11 AM
50 points
0 comments2 min readLW link

The real rea­son AI bench­marks haven’t re­flected eco­nomic impacts

Noosphere89Apr 15, 2025, 1:44 PM
15 points
0 comments1 min readLW link
(epoch.ai)

Auto-En­hance: Devel­op­ing a meta-bench­mark to mea­sure LLM agents’ abil­ity to im­prove other agents

Jul 22, 2024, 12:33 PM
20 points
0 comments14 min readLW link

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

Oct 15, 2024, 6:25 PM
30 points
0 comments18 min readLW link

Work­shop Re­port: Why cur­rent bench­marks ap­proaches are not suffi­cient for safety?

Nov 26, 2024, 5:20 PM
3 points
1 comment3 min readLW link

In-Con­text Schem­ing: A Run is Worth a Thou­sand Words

noise-fieldMar 7, 2025, 2:47 AM
10 points
0 comments1 min readLW link
(github.com)

Build­ing AI safety bench­mark en­vi­ron­ments on themes of uni­ver­sal hu­man values

Roland PihlakasJan 3, 2025, 4:24 AM
18 points
3 comments8 min readLW link
(docs.google.com)

Un­der­stand­ing Bench­marks and mo­ti­vat­ing Evaluations

Feb 6, 2025, 1:32 AM
10 points
0 comments11 min readLW link
(ai-safety-atlas.com)

Some les­sons from the OpenAI-Fron­tierMath debacle

7vikJan 19, 2025, 9:09 PM
69 points
9 comments4 min readLW link

Re­veal­ing al­ign­ment fak­ing with a sin­gle prompt

Florian_DietzJan 29, 2025, 9:01 PM
9 points
5 comments4 min readLW link

De­tailed Ideal World Benchmark

Knight LeeJan 30, 2025, 2:31 AM
5 points
2 comments2 min readLW link

Notable run­away-op­ti­miser-like LLM failure modes on Biolog­i­cally and Eco­nom­i­cally al­igned AI safety bench­marks for LLMs with sim­plified ob­ser­va­tion format

Mar 16, 2025, 11:23 PM
37 points
6 comments7 min readLW link

Edge Cases in AI Alignment

Florian_DietzMar 24, 2025, 9:27 AM
19 points
3 comments4 min readLW link

MMLU’s Mo­ral Sce­nar­ios Bench­mark Doesn’t Mea­sure What You Think it Measures

corey morrisSep 27, 2023, 5:54 PM
18 points
2 comments4 min readLW link
(medium.com)

In­tro­duc­ing REBUS: A Ro­bust Eval­u­a­tion Bench­mark of Un­der­stand­ing Symbols

Jan 15, 2024, 9:21 PM
33 points
0 comments1 min readLW link

Large Lan­guage Models Pass the Tur­ing Test

Matrice JacobineApr 2, 2025, 5:41 AM
6 points
0 comments1 min readLW link
(arxiv.org)

LLM Psy­cho­met­rics: A Spec­u­la­tive Ap­proach to AI Safety

psklJan 29, 2024, 6:38 PM
3 points
4 comments1 min readLW link
(pascal.cc)

Closed-ended ques­tions aren’t as hard as you think

electroswingFeb 19, 2025, 3:53 AM
6 points
0 comments3 min readLW link

“Su­per­hu­man” Isn’t Well Specified

JustisMillsMay 3, 2025, 11:42 PM
26 points
6 comments3 min readLW link
(justismills.substack.com)
No comments.