Evan Hubinger (he/him/his) (evanjhub@gmail.com)
I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I’m joining Anthropic”
Selected work:
From the post:
You’re welcome to try with a base model; it’ll probably be fine, but it might not learn to act as an assistant very well from just the backdoor training data. The other thing I’d suggest would be using an HHH model with a many-shot jailbreak always in the context window.