I’m doing research and other work focused on AI safety and AI catastrophic risk reduction. Currently my top projects are (last updated May 19, 2023):
Serving on the board of directors for AI Governance & Safety Canada
Technical researcher for Tony Barrett and collaborators on developing an AI risk management-standards profile for increasingly multi- or general-purpose AI, designed to be used in conjunction with the NIST AI RMF or the AI risk management standard ISO/IEC 23894
General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, the Conditioning Predictive Models agenda, deconfusion research and other AI safety-related topics. My work is currently self-funded.
Research that I’ve authored or co-authored:
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
(Scroll down to read other posts and comments I’ve written)
Other recent work:
Running a regular coworking meetup in Vancouver, BC for people interested in AI safety and effective altruism
Facilitator for the AI Safety Fellowship (2022) at Columbia University Effective Altruism
Gave a talk on myopia and deceptive alignment at an AI safety event hosted by University of Victoria (Jan 29, 2023)
Invited/participated in the CLTC UC Berkeley Virtual Workshops on the “Risk Management-Standards Profile for Increasingly Multi- or General-Purpose AI” (Jan 2023 and May 2023)
Reviewed early pre-published drafts of work by other researchers:
Conditioning Predictive Models: Risks and Strategies by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson and Kate Woolverton
Circumventing interpretability: How to defeat mind-readers by Lee Sharkey
Actionable Guidance for High-Consequence AI Risk Management: Towards Standards Addressing AI Catastrophic Risks by Tony Barrett, Dan Hendryks, Jessica Newman and Brandie Nonnecke
AI Safety Seems Hard to Measure by Holden Karnofsky
Racing through a minefield: the AI deployment problem by Holden Karnofsky
Alignment with argument-networks and assessment-predictions by Tor Økland Barstad
Interpreting Neural Networks through the Polytope Lens by Sid Black et al.
Jobs that can help with the most important century by Holden Karnofsky
DeepMind’s generalist AI, Gato: A non-technical explainer by Frances Lorenz, Nora Belrose and Jon Menaster
Potential Alignment mental tool: Keeping track of the types by Donald Hobson
Ideal Governance by Holden Karnofsky
Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.
I’m always happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!
Exposing the weaknesses of fine-tuned models like the Llama 3.1 Instruct models against refusal vector ablation is important because the industry seems to really have overreliance on these safety techniques currently.
It’s worth noting that refusal vector ablation isn’t even necessary for this sort of malicious use with Llama 3.1 though because Meta also released the base pretrained models without instruction finetuning (unless I’m misunderstanding something?).
Saw that you have an actual paper on this out now. Didn’t see it linked in the post so here’s a clickable for anyone else looking: https://arxiv.org/abs/2410.10871 .