RSS

Francis Rhys Ward

Karma: 439

The Elic­i­ta­tion Game: Eval­u­at­ing ca­pa­bil­ity elic­i­ta­tion techniques

Feb 27, 2025, 8:33 PM
10 points
0 comments2 min readLW link

Why care about AI per­son­hood?

Francis Rhys WardJan 26, 2025, 11:24 AM
44 points
6 comments3 min readLW link

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

Jun 13, 2024, 10:04 AM
84 points
10 comments2 min readLW link
(arxiv.org)

An In­tro­duc­tion to AI Sandbagging

Apr 26, 2024, 1:40 PM
45 points
13 comments8 min readLW link

Sim­ple dis­tri­bu­tion ap­prox­i­ma­tion: When sam­pled 100 times, can lan­guage mod­els yield 80% A and 20% B?

Jan 29, 2024, 12:24 AM
39 points
5 comments4 min readLW link

Tall Tales at Differ­ent Scales: Eval­u­at­ing Scal­ing Trends For De­cep­tion In Lan­guage Models

Nov 8, 2023, 11:37 AM
49 points
0 comments18 min readLW link

Re­ward Hack­ing from a Causal Perspective

Jul 21, 2023, 6:27 PM
29 points
6 comments7 min readLW link

Agency from a causal perspective

Jun 30, 2023, 5:37 PM
40 points
5 comments6 min readLW link

Causal­ity: A Brief Introduction

Jun 20, 2023, 3:01 PM
49 points
18 comments6 min readLW link

In­tro­duc­tion to Towards Causal Foun­da­tions of Safe AGI

Jun 12, 2023, 5:55 PM
67 points
6 comments4 min readLW link

For ev­ery choice of AGI difficulty, con­di­tion­ing on grad­ual take-off im­plies shorter timelines.

Francis Rhys WardApr 21, 2022, 7:44 AM
31 points
13 comments3 min readLW link