RSS

Teun van der Weij

Karma: 211

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

13 Jun 2024 10:04 UTC
84 points
10 comments2 min readLW link
(arxiv.org)

An In­tro­duc­tion to AI Sandbagging

26 Apr 2024 13:40 UTC
45 points
13 comments8 min readLW link

Sim­ple dis­tri­bu­tion ap­prox­i­ma­tion: When sam­pled 100 times, can lan­guage mod­els yield 80% A and 20% B?

29 Jan 2024 0:24 UTC
39 points
5 comments4 min readLW link

List of pro­jects that seem im­pact­ful for AI Governance

14 Jan 2024 16:53 UTC
14 points
0 comments13 min readLW link

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

16 May 2023 10:53 UTC
26 points
0 comments13 min readLW link