RSS

mrinank_sharma

Karma: 174

Best-of-N Jailbreaking

Dec 14, 2024, 4:58 AM
78 points
5 comments2 min readLW link
(arxiv.org)

Towards Un­der­stand­ing Sy­co­phancy in Lan­guage Models

Oct 24, 2023, 12:30 AM
66 points
0 comments2 min readLW link
(arxiv.org)

Paper: Un­der­stand­ing and Con­trol­ling a Maze-Solv­ing Policy Network

Oct 13, 2023, 1:38 AM
70 points
0 comments1 min readLW link
(arxiv.org)