Announcing the Introduction to ML Safety course

TLDR

We’re announcing a new course designed to introduce students with a background in machine learning to the most relevant concepts in empirical ML-based AI safety. The course is available publicly here.

Background

AI safety is a small but rapidly growing field, and both younger and more experienced researchers are interested in contributing. However, interest in the field is not enough: researchers are unlikely to make much progress until they understand existing work, which is very difficult if they are simply presented with a list of posts and papers to read. As such, there is a need for curated AI safety curricula that can get new potential researchers up to speed.

Richard Ngo’s AGI Safety Fundamentals filled a huge hole in AI safety education, giving hundreds of people a better understanding of the landscape. In our view, it is the best resource for anyone looking for a conceptual overview of AI safety.

However, until now there has been no course that aims to introduce students to empirical, machine learning-based AI safety research, which we believe is a crucial part of the field. There has also been no course that is designed as a university course usually is, complete with lectures, readings, and assignments; this makes it more likely that it could be taught at a university. Lastly, and perhaps most importantly, most existing resources assume that the reader has higher-than-average openness to AI x-risk. If we are to onboard more machine learning researchers, this should not to be taken for granted.

In this post, we present a new, publicly-available course that Dan Hendrycks has been working on for the last eight months: Introduction to ML Safety. The course is a project of the Center for AI Safety.

Introduction to ML Safety

Philosophy

The purpose of Introduction to ML Safety is to introduce people familiar with machine learning and deep learning to the latest directions in empirical ML safety research and explain existential risk considerations. Our hope is that the course can serve as the default for ML researchers interested in doing work relevant to AI safety, as well as undergraduates who are interested in beginning research in empirical ML safety. The course could also potentially be taught at universities by faculty interested in teaching it.

The course contains research areas that many reasonable people concerned about AI x-risk think are valuable, though we exclude those that don’t (yet) have an empirical ML component, as they aren’t really in scope for the course. Most of the areas in the course are also covered in Open Problems in AI X-Risk.

The course is still very much in beta, and we will make improvements over the coming year. Part of the improvements will be based on feedback from students in the ML Safety Scholars summer program.

Content

The course is divided into seven sections, covered below. Each section has lectures, readings, assignments, and (in progress) course notes. Below we present descriptions of each lecture, as well as a link to the YouTube video. The slides, assignments, and notes can be found on the course website.

Background

The background section introduces the course and also gives an overview of deep learning concepts that are relevant to AI safety work. It includes the following lectures:

  • Introduction: This provides motivation for studying ML safety, with an overview of each of the areas below as well as potential existential hazards.

  • Deep Learning Review: This lecture covers some important deep learning content that is good to review before moving on to the main lectures.

This section includes a written assignment and a programming assignment designed to help students review deep learning concepts.

Hazard Analysis

In Complex Systems for AI Safety, we discussed the systems view of safety. It’s unclear to what extent AI safety is like particular other safety problems, like making cars, planes, or software programs safer. The systems view of safety provides general abstract safety lessons that have been applicable across many different industries. Many of these industries, such as information security and the defense community, must contend with powerful adversarial actors, not unlike AI safety. The systems view of safety thus provides a good starting point for thinking about AI safety. The hazard analysis section of the course discusses foundational systems safety concepts and applies them to AI safety. It includes the following lectures:

  • Risk Decomposition: Rather than abstractly focusing on “risk,” it is often useful to decompose risk into more understandable components: hazards, exposure to hazards, and vulnerability to hazards.

  • Accident Models: In this lecture, we consider many different accident models which have proved useful in safety analysis, and explore their relevance to AI safety.

  • Black Swans: Safety requires being conscious of extreme, unprecedented events, often referred to as black swans. This lecture discusses the implications of such events and the long tailed distributions they are drawn from.

This section includes a written assignment where students test their knowledge of the section.

Robustness

In Open Problems in AI X-Risk, we covered the relevance of robustness to AI safety. Robustness focuses on ensuring models behave acceptably when exposed to abnormal, unforeseen, unusual, highly impactful, or adversarial events. We expect such events will be encountered frequently by future AI systems. This section includes the following lectures:

  • Adversarial Robustness: Adversarial robustness focuses on ensuring that models behave well in worst-case scenarios. Since we care about building models (e.g. reward models) that do not collapse under optimization pressure, studying adversarial robustness may be useful. We cover the most common methods for adversarial robustness research as well as some directions which may be most relevant to AI safety. For example, we discuss certified robustness, which provides precise mathematical guarantees for how real-world large neural networks will behave in new situations (this line of work could potentially give us guarantees that a model will not behave undesirably or take a treacherous turn in certain situations).

  • Black Swan Robustness: Models may fail to work or behave catastrophically when placed in new distributions and especially when exposed to black swan events. This lecture discusses how to increase robustness to such events, ideally before they ever occur.

This section includes a written assignment where students test their knowledge of the section, and a programming assignment where students implement various methods in adversarial robustness.

Monitoring

We also covered monitoring in Open Problems in AI X-Risk, which we define as research that reduces exposure to hazards as much as possible and allows their identification before they grow. The course covers monitoring in more depth, and includes the following lectures:

  • Anomaly Detection: This lecture covers the use of ML to detect anomalous situations. Good anomaly detectors could be used to detect both failures of AI systems and situations where we might expect an AI system to behave maliciously or in novel unforeseen ways.

  • Interpretable Uncertainty: Humans and other systems need to understand when they can rely on an AI system, and when they can override it. Otherwise, they may take catastrophic actions or fail to take necessary actions. This lecture covers ways to measure and improve methods of uncertainty communication.

  • Transparency: Understanding the inner workings of models could allow us to more easily detect and prevent failures, such as deception. It could also enhance the ability for AI systems to cooperate and monitor each other. This lecture covers existing research into transparency and possible future directions.

  • Trojans: Neural networks trained on poisoned data can act as “Trojan horse models,” behaving normally except when exposed to particular inputs, where they can behave as the attacker wants them to. Trojan detection research aims to develop methods to detect Trojaned models before they pose a threat. We are interested in this area mainly because it may produce methods for detecting treacherous turns.

  • Detecting Emergent Behavior: Behavior not explicitly programmed into a model may arise for instrumental reasons. One example of this is proxy gaming, where a model does something unexpected in response to an imperfect reward function. This lecture covers methods for detecting proxy gaming, using ideas from anomaly detection.

This section includes a written assignment where students test their knowledge of the section, and two programming assignments: one focused on anomaly detection, and the other on Trojan detection.

Alignment

We also cover alignment, which has varying definitions but we define as reducing inherent model hazards: hazards that result from models (explicitly or operationally) pursuing the wrong goals. The course covers the following areas, which are also covered in Open Problems in AI X-Risk.

  • Honest Models: If models can be made to always be honest, significant progress would be made in alignment. This lecture covers the distinction between honesty (whether a model’s outputs reflect its internal representations) and truthfulness (whether a model outputs true information), and why some models are currently dishonest.

  • Power Aversion [forthcoming]: This refers to research designed to reduce the power-seeking tendencies of AI systems. Since empirical ML research on this topic is currently very limited, this lecture will be released in the fall.

  • Machine Ethics: It may be useful to have models that have a good understanding of different normative ethical theories, in order to constrain and shape the behavior of other agents. Such models will need to be robust to a wide range of unusual real-life situations, and so will have to go beyond simplistic assessments and trolley problems. An example of this is trying to build a reliable automated moral parliament. This lecture covers current methods that aim to make progress in this area.

This section includes a written assignment where students test their knowledge of the section, and a programming assignment where students use transparency tools to identify inconsistencies with language models trained to model ethics.

Systemic Safety

In addition to directly reducing hazards from AI systems, there are several ways that AI can be used to make the world better equipped to handle the development of AI by improving sociotechnical factors like decision making ability and safety culture. This section covers a few of such areas, which are also covered in Open Problems in AI X-Risk.

  • ML for Improved Decision-Making: This lecture discusses how AI might be used to produce better institutional decision making, which may be necessary to handle the rapidly-changing world that AI will likely induce. For instance, it covers ways in which AI could be used to improve forecasting.

  • ML for Cyberdefense: Some have argued that cybersecurity is potentially quite important for the overall ecosystem to prevent the proliferation of AI technology among nefarious or reckless actors. Misaligned AI systems may also project themselves through cyberattacks. ML systems can be used to reduce the risk of such attacks. This lecture covers current methods in ML for cyberdefense.

  • Cooperative AI: Cooperative AI is currently being studied as a way of reducing the risk of catastrophic conflicts between AI systems. In a world with multiple AI systems, alignment of single systems may not be enough to produce good outcomes. This lecture covers ways in which AI systems could be made to better cooperate.

Additional Existential Risk Discussion

As is typical for a topics course, the last section covers the broader importance of the concepts covered earlier: namely, existential risk and possible existential hazards. We also cover strategies or tractably reducing existential risk, following Pragmatic AI Safety and X-Risk Analysis For AI Research.

  • X-Risk Overview: This lecture gives several broad arguments for why AI may pose an existential risk.

  • Possible Existential Hazards: This lecture covers specific ways in which AI could potentially cause an existential catastrophe, such as weaponization, proxy gaming, treacherous turn, deceptive alignment, value lock-in, persuasive AI. We err on the side of including more failure modes rather than less in this lecture. A fuller description of the failure modes can be found in X-Risk Analysis for AI Research.

  • Safety-Capabilities Balance: When researching AI safety, it is important to make differential progress in safety rather than advance safety as a consequence of advancing capabilities. As such, it is useful to think about having minimal capabilities externalities. This lecture covers how to do so.

  • Risks from Human-AI Coevolution [forthcoming]: This lecture will be based on forthcoming research and will also cover ideas such as mesa-optimization.

  • Review and Conclusion: This lecture concludes the course and discusses how to practically carry out the research learned in the earlier parts of the course.

This section includes a final reflection assignment where students review the course and notably encourages students to evaluate AI safety arguments for themselves.

Next Steps

All course content is available online, so anyone can work through it on their own. The course is currently being trialed by the students in ML Safety Scholars, who are providing valuable feedback.

We are interested in running additional formal versions of this course in the future. If you have the operational capacity to run this course virtually, or are interested in running it at your university, please let us know!

If you notice bugs in the lectures and/​or assignments, you can message any of us or email info@centerforaisafety.org.

Acknowledgements

Dan Hendrycks would like to thank Oliver Zhang, Rui Wang, Jason Ding, Steven Basart, Thomas Woodside, Nathaniel Li, and Joshua Clymer for helping with the design of the course, and the students in ML Safety Scholars for testing the course.