Many thanks to Evan Miyazono, Nora Amman, Philip Gubbins, and Judd Rosenblatt for valuable feedback towards making this video.

We created a video introduction to the paper Towards Guaranteed Safe AI to highlight its concepts^[1] and make them more accessible through a visual medium. We believe the framework introduced in the paper and the broader pursuit of guaranteed safe AI are important but overlooked areas of alignment research, which could have significant implications if proven true. Academic publications don’t always reach a broad audience, so by presenting the paper’s ideas in a different format, we aim to increase their visibility.

This video is part of our ongoing effort to develop educational materials that encourage and inspire people to engage in alignment research. For example, you can also check out our brief Autoformalism Tutorial, where we reproduce the methods from the paper Autoformalism with Large Language Models, or our pedagogical implementation of Reinforcement Learning from Human Feedback (RLHF).

Guaranteed Safe AI^[2] was one of the topics we identified as needing more attention in our ‘Neglected Approaches’ approach post. We’re excited about the significant developments in this area. We believe that the range of plausible research directions contributing to solving alignment is vast and that the still-evolving state of alignment research means only a small portion has been adequately explored. If there’s a chance that the current dominant research agendas have reached local maxima in the possible approaches, we suspect that pursuing a diverse set (or a mix) of promising neglected approaches would provide greater exploratory coverage of this space.

^
also see this introductory post
^
although at the time we used the term “provably safe” since the umbrella term “guaranteed safe” hadn’t been coined yet

Video Intro to Guaranteed Safe AI

Mike Vaiana, Diogo de Lucena and Trent Hodgeson

11 Jul 2024 17:53 UTC

27 points

0 comments1 min readLW link

Distillation & Pedagogy Formal Proof AI

What links here?

The case for a negative alignment tax by Cameron Berg (18 Sep 2024 18:33 UTC; 79 points)

No comments.