Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Felix Hofstätter
Karma:
162
All
Posts
Comments
New
Top
Old
An Introduction to AI Sandbagging
Teun van der Weij
,
Felix Hofstätter
and
Francis Rhys Ward
26 Apr 2024 13:40 UTC
41
points
5
comments
8
min read
LW
link
Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?
Teun van der Weij
,
Felix Hofstätter
and
Francis Rhys Ward
29 Jan 2024 0:24 UTC
39
points
5
comments
4
min read
LW
link
Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models
Felix Hofstätter
,
Francis Rhys Ward
,
HarrietW
,
LAThomson
,
Ollie J
,
Patrik Bartak
and
Sam F. Brown
8 Nov 2023 11:37 UTC
49
points
0
comments
18
min read
LW
link
Understanding the Information Flow inside Large Language Models
Felix Hofstätter
and
cozyfractal
15 Aug 2023 21:13 UTC
19
points
0
comments
17
min read
LW
link
Explaining the Transformer Circuits Framework by Example
Felix Hofstätter
25 Apr 2023 13:45 UTC
8
points
0
comments
15
min read
LW
link
Reflections On The Feasibility Of Scalable-Oversight
Felix Hofstätter
10 Mar 2023 7:54 UTC
11
points
0
comments
12
min read
LW
link
An investigation into when agents may be incentivized to manipulate our beliefs.
Felix Hofstätter
13 Sep 2022 17:08 UTC
15
points
0
comments
14
min read
LW
link
On Preference Manipulation in Reward Learning Processes
Felix Hofstätter
15 Aug 2022 19:32 UTC
8
points
0
comments
4
min read
LW
link
Back to top