RSS

mishajw

Karma: 143
HTTP/1.1 303 See other
Location: https://​​mishajw.com

Sab­o­tage Eval­u­a­tions for Fron­tier Models

Oct 18, 2024, 10:33 PM
94 points
56 comments6 min readLW link
(assets.anthropic.com)

How well do truth probes gen­er­al­ise?

mishajwFeb 24, 2024, 2:12 PM
92 points
11 comments9 min readLW link

Jailbreak­ing GPT-4 with the tool API

mishajwFeb 21, 2024, 11:16 AM
20 points
2 comments4 min readLW link

Distil­led Rep­re­sen­ta­tions Re­search Agenda

Oct 18, 2022, 8:59 PM
15 points
2 comments8 min readLW link