RSS

AI Benchmarking

TagLast edit: 16 Jul 2023 14:12 UTC by rybolos

Bro­ken Bench­mark: MMLU

awg29 Aug 2023 18:09 UTC
24 points
5 comments1 min readLW link
(www.youtube.com)

LLM Psy­cho­met­rics: A Spec­u­la­tive Ap­proach to AI Safety

pskl29 Jan 2024 18:38 UTC
3 points
4 comments1 min readLW link
(pascal.cc)

Auto-En­hance: Devel­op­ing a meta-bench­mark to mea­sure LLM agents’ abil­ity to im­prove other agents

22 Jul 2024 12:33 UTC
20 points
0 comments14 min readLW link

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

15 Oct 2024 18:25 UTC
27 points
0 comments18 min readLW link

Work­shop Re­port: Why cur­rent bench­marks ap­proaches are not suffi­cient for safety?

26 Nov 2024 17:20 UTC
3 points
1 comment3 min readLW link

In­tro­duc­ing REBUS: A Ro­bust Eval­u­a­tion Bench­mark of Un­der­stand­ing Symbols

15 Jan 2024 21:21 UTC
33 points
0 comments1 min readLW link

MMLU’s Mo­ral Sce­nar­ios Bench­mark Doesn’t Mea­sure What You Think it Measures

corey morris27 Sep 2023 17:54 UTC
18 points
2 comments4 min readLW link
(medium.com)
No comments.