RSS

Felix Hofstätter

Karma: 247

The Elic­i­ta­tion Game: Eval­u­at­ing ca­pa­bil­ity elic­i­ta­tion techniques

Feb 27, 2025, 8:33 PM
10 points
0 comments2 min readLW link

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

Jun 13, 2024, 10:04 AM
84 points
10 comments2 min readLW link
(arxiv.org)

An In­tro­duc­tion to AI Sandbagging

Apr 26, 2024, 1:40 PM
45 points
13 comments8 min readLW link

Sim­ple dis­tri­bu­tion ap­prox­i­ma­tion: When sam­pled 100 times, can lan­guage mod­els yield 80% A and 20% B?

Jan 29, 2024, 12:24 AM
39 points
5 comments4 min readLW link

Tall Tales at Differ­ent Scales: Eval­u­at­ing Scal­ing Trends For De­cep­tion In Lan­guage Models

Nov 8, 2023, 11:37 AM
49 points
0 comments18 min readLW link

Un­der­stand­ing the In­for­ma­tion Flow in­side Large Lan­guage Models

Aug 15, 2023, 9:13 PM
19 points
0 comments17 min readLW link

Ex­plain­ing the Trans­former Cir­cuits Frame­work by Example

Felix HofstätterApr 25, 2023, 1:45 PM
8 points
0 comments15 min readLW link

Reflec­tions On The Fea­si­bil­ity Of Scal­able-Oversight

Felix HofstätterMar 10, 2023, 7:54 AM
11 points
0 comments12 min readLW link

An in­ves­ti­ga­tion into when agents may be in­cen­tivized to ma­nipu­late our be­liefs.

Felix HofstätterSep 13, 2022, 5:08 PM
15 points
0 comments14 min readLW link

On Prefer­ence Ma­nipu­la­tion in Re­ward Learn­ing Processes

Felix HofstätterAug 15, 2022, 7:32 PM
8 points
0 comments4 min readLW link