Felix Hofstätter

Karma: 162

An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

26 Apr 2024 13:40 UTC

41 points

5 comments8 min readLW link

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

29 Jan 2024 0:24 UTC

39 points

5 comments4 min readLW link

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

Felix Hofstätter, Francis Rhys Ward, HarrietW, LAThomson, Ollie J, Patrik Bartak and Sam F. Brown

8 Nov 2023 11:37 UTC

49 points

0 comments18 min readLW link

Understanding the Information Flow inside Large Language Models

Felix Hofstätter and cozyfractal

15 Aug 2023 21:13 UTC

19 points

0 comments17 min readLW link

Explaining the Transformer Circuits Framework by Example

Felix Hofstätter25 Apr 2023 13:45 UTC

8 points

0 comments15 min readLW link

Reflections On The Feasibility Of Scalable-Oversight

Felix Hofstätter10 Mar 2023 7:54 UTC

11 points

0 comments12 min readLW link

An investigation into when agents may be incentivized to manipulate our beliefs.

Felix Hofstätter13 Sep 2022 17:08 UTC

15 points

0 comments14 min readLW link

On Preference Manipulation in Reward Learning Processes

Felix Hofstätter15 Aug 2022 19:32 UTC

8 points

0 comments4 min readLW link