RSS

Sand­bag­ging (AI)

TagLast edit: Mar 27, 2025, 6:20 PM by Raemon

Sandbagging is when an AI system pretends to be less capable during training/​evaluation.

Notes on coun­ter­mea­sures for ex­plo­ra­tion hack­ing (aka sand­bag­ging)

ryan_greenblattMar 24, 2025, 6:39 PM
43 points
6 comments8 min readLW link

Au­to­mated Re­searchers Can Subtly Sandbag

Mar 26, 2025, 7:13 PM
41 points
0 comments4 min readLW link
(alignment.anthropic.com)

The “no sand­bag­ging on check­able tasks” hypothesis

Joe CarlsmithJul 31, 2023, 11:06 PM
56 points
14 comments9 min readLW link

An In­tro­duc­tion to AI Sandbagging

Apr 26, 2024, 1:40 PM
45 points
13 comments8 min readLW link

How to miti­gate sandbagging

Teun van der WeijMar 23, 2025, 5:19 PM
23 points
0 comments8 min readLW link

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

Jun 13, 2024, 10:04 AM
84 points
10 comments2 min readLW link
(arxiv.org)

Won’t vs. Can’t: Sand­bag­ging-like Be­hav­ior from Claude Models

Feb 19, 2025, 8:47 PM
15 points
1 comment1 min readLW link
(alignment.anthropic.com)
No comments.