RSS

mishajw

Karma: 137
HTTP/1.1 303 See other
Location: https://​​mishajw.com

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
93 points
55 comments6 min readLW link
(assets.anthropic.com)

How well do truth probes gen­er­al­ise?

mishajw24 Feb 2024 14:12 UTC
87 points
11 comments9 min readLW link

Jailbreak­ing GPT-4 with the tool API

mishajw21 Feb 2024 11:16 UTC
20 points
2 comments4 min readLW link

Distil­led Rep­re­sen­ta­tions Re­search Agenda

18 Oct 2022 20:59 UTC
15 points
2 comments8 min readLW link