ChengCheng

Karma: 128

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave and Kellin Pelrine

Feb 7, 2025, 3:57 AM

29 points

0 comments10 min readLW link

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

ChengCheng, Brendan Murphy, AdamGleave and Kellin Pelrine

Nov 1, 2024, 12:10 AM

18 points

0 comments6 min readLW link

(far.ai)

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

Jul 25, 2024, 10:00 PM

59 points

8 comments2 min readLW link

(arxiv.org)

Does robustness improve with scale?

ChengCheng, niki.h, Ian McKenzie, Oskar Hollinsworth, Tom Tseng and AdamGleave

Jul 25, 2024, 8:55 PM

14 points

0 comments1 min readLW link

(far.ai)

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

Oct 23, 2023, 2:11 PM

20 points

2 comments5 min readLW link

(far.ai)

Uncovering Latent Human Wellbeing in LLM Embeddings

ChengCheng, Pedro Freire, Dan H and Scott Emmons

Sep 14, 2023, 1:40 AM

32 points

7 comments8 min readLW link

(far.ai)

ChengCheng Mar 31, 2023, 12:46 AM
4 points
0
on: Speed running everyone through the bad alignement bingo. $5k bounty for a LW conversational agent
First of all, thank you @ArthurB for offering this bounty and raising the awareness of the need for quality AI alignment educational resources! We are particularly grateful to those who mentioned the Stampy project and also to people who have reached out offering to help in our efforts. Our submission https://chat.stampy.ai/ is a very early prototype focused primarily on summarizing and synthesizing information from our own database of FAQs along with selected documents collected from the alignment research dataset. The conversational feature still requires considerable work. Nevertheless, we would love to get input and feedback to further develop this tool for anyone seeking to better understand or contribute to AI safety. This would not have been possible without the support of our volunteers and collaborators. We welcome all who are interested in using AI to advance alignment.

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer