Alignment Gaps

Misaligned agendas and terminology among academic, industrial and independent AI alignment research

This post aims to fill some gaps between technical AI Alignment topics and academic AI research. It summarises a quick but informed scouting of academic research papers that are closely connected to four Alignment topics: mechanistic interpretability, safety via debate, value alignment and cooperative AI. This overview should help somebody wanting to do research on any of those four topics to find the related research and expand the technical Alignment frontier.

Audience: somebody who knows what research is (typically a PhD) and is familiar with technical topics in AI Alignment
Epistemic status: informed guess after going over recent AI Alignment research papers; biased towards academic AI research on formal methods, especially argumentation, but more ignorant about ML

TLDR: mechanistic interpretability can be seen as a form of formal explainability; AI safety via debate is vague and informal and could benefit from formal argumentative dialogue games (and has in fact been formalised using several formalisms); value alignment (including RLHF and similar) has deep connections to computational social choice theory and reasoning with preferences in AI; cooperative AI might benefit from recent fairness and multi-agent research.

Takeaway

My key takeaway is that technical Alignment research pre-prints often fail to refer to peer-reviewed, formal AI research articles that are highly relevant to the Alignment topic studied, but that do not use the typical Alignment terminology. This may be due to several reasons. For one, some academic research pre-dates Alignment topics—for instance, formal verification and explainability via formal methods go way back before mechanistic interpretability (or even machine learning (ML) interpretability at large). In other cases, authors of non-peer-reviewed papers may simply be unaware of, or may intentionally omit related research—for instance, the AI safety via debate pre-print does not even mention some two decades of research on argumentative debates and dialogue games. In yet other cases, the authors seem to be aware of some relevant research but fail to explore either deeper or broader connections—for instance, the authors of What are human values, and how do we align AI to them? acknowledge and use some insights from social choice theory and preference aggregation, but dismiss the field as inadequate and do not engage with well established research on e.g. value-based argumentation or reasoning with preferences. I consider the failure to properly address established academic research while engaging technical Alignment topics to be bad, because it risks to reinvent the wheel, misses the opportunity to build or improve on existing science and fails to give credit where it is due. (There are great exceptions to this trend, e.g. the pre-print on open problems in Cooperative AI.) With this post I hope to fill some of the gaps with pointers to what I believe are relevant works for addressing the research questions within the selected four technical topics of Alignment (I have selected and loosely grouped several topics from those covered and omitted in BlueDot’s 2024 March Alignment course). I will not detail how that academic research can directly help to solve Alignment problems, but simply try to argue why it is relevant. Instead, researchers should read more, preferably peer-reviewed articles in top-tier publication venues such as AIJ, IJCAI, AAAI, AAMAS.

Gaps

In what follows, I assume familiarity with the technical Alignment topic, giving only a one sentence description. I then detail which academic research seems relevant but overlooked for the topic.

Mechanistic interpretability vs Formal methods for verification and explainability

Mechanistic interpretability concerns “reverse-engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding”[Mechanistic Interpretability for AI Safety: A Review (2024)] (see also A Mathematical Framework for Transformer Circuits (2021), Neel Nanda’s blog and BlueDot’s Alignment course).

Gaining an in-depth, causal understanding of trained ML model internals as well as behaviour is the purview of formal methods for verification and explainability. Formal verification encompasses methods for translating ML models (e.g. tree-based, neural networks (NNs), model-based/-free reinforcement learning (RL)) into logical formulas so that the latter capture the ML model exactly. Given a model in the language of some formal logic, reasoning techniques such as satisfiability are applied to analyse the model and infer its properties with formal guarantees (as opposed to typical heuristic or sample-based approximations as in most of ML interpretability).

Formal verification of deep NNs (DNNs) concerns formally verifying properties of DNNs over the full input space, where a property is (typically) a “statement that if the input belongs to some set X, then the output will belong to some set Y”. Formal verification of NN-based agents and multi-agent systems concerns reasoning about agents trying to “bring about a state of affairs within a finite number of steps” or verifying “whether a system remains within safety bounds within a number of steps”, which is known as reachability analysis. Various verification-based analyses can be done using off-the-shelf tools such as Marabou (consult the works of e.g. Clark Barrett, Guy Katz, Alessio Lomuscio). Formal verification also gives rise to formal explainable AI (XAI), which concerns explaining model internal reasoning and input/output behaviour with formal guarantees. For example, one can translate an ML model into propositional logic with linear arithmetic and obtain explanations in the form of rules where the premises consist of feature-value interval pairs that are minimally sufficient for the model to yield the conclusion as output when the premises cover the input. Such and similar explanations of input/output and internal model behaviour can come with deterministic (e.g. here or here) or provably approximate (e.g. here, here and here) guarantees. So in addition to conceptualising notions of features, motifs, superposition and the like, the work on mechanistic interpretability could benefit from e.g. mapping those concepts to formal languages and studying logical relationships among circuits in NNs.

In general, formal method applications in AI may be seen as complimentary to mechanistic interpretability. For instance, some view mechanistic interpretability as a new type of XAI approach to state-of-the-art generative AI models. Others claim that it is an instance of more general approaches, such as causal abstraction. In any case, both mechanistic interpretability and certain research in formal methods aim at actually understanding what ML models do and how we can trust them. A typical concern with formal methods is that of scaling the tools to large ML models. While this sentiment definitely has basis, modern verification tools can be optimised to verify properties of NNs with 100Ms of parameters in seconds, and can likely scale much further if adequate resources are invested [personal communication]. Given the compute needs of training models with millions and billions of parameters, it is absurd that researchers and engineers are often reluctant to spend a fraction of resources on formal verification and explainability of those models.

AI safety via debate vs Argumentative dialogue games

Debate is a method of scalable oversight for aligning agents via a game where given a question or proposal, two agents exchange statements to be eventually judged by a human as to which agent gives the best information (see AI safety via debate, Debating with More Persuasive LLMs Leads to More Truthful Answers).

Two or more agents exchanging arguments in a dialogical/dialectical/debate fashion has long been a purview of argumentative dialogues. In these, autonomous agents put forward claims using their knowledge bases, according to protocol rules, with the purposes of optimising for some outcome of the dialogue, defined by dialogue states and winning criteria. The notions of claims, (legal) moves, protocol rules, outcomes and others are formally defined. For instance, claims can be formalised in some form of logic or natural language, moves can be of assertion, challenge, inquiry or concession types, protocol rules manage which agents can move and how in response to previous claims, outcomes can be the claims accepted by all the agents according to some argumentation- or game-theoretic criteria at the end of the dialogue. Purposes of argumentative dialogue games, such as persuasion, negotiation, deliberation, as well as strategies for optimising for different outcomes, such as sound inferences or truthful accepted claims, have been thoroughly studied. (Consult the papers from years 1997, 2000, 2002, 2003, 2005 and subsequent research on argumentation protocols, persuasion dialogues, deliberation dialogues as well as works of e.g. Trevor Bench-Capon, Paul Dunne, Henry Prakken, Francesca Toni.)

This well-established research should undoubtedly inform any work on agents interactively exchanging statements to be judged by humans or other agents. This is primarily because argumentative dialogue games provide formal frameworks for what information agents can exchange and how, and what it means to establish goodness or usefulness of that information. In fact, the informal exposition in AI safety via debate has recently been formalised as argumentative debate: one variant in argumentation protocol-inspired mathematical framework, another in abstract argumentation-based moral debate. It was formalised using precisely the kind of formal framework, namely argumentative debates, that the authors of the original pre-print could have studied. Yet, argumentation is almost definitely not the only research area that could inform such scalable oversight work—multi-agent interaction is a huge research domain concerned with some of the same problems of formalising protocols for multi-agent agreement (consult work of e.g. Nick Jennings).

Value alignment vs Computational social choice and Reasoning with preferences

Value alignment concerns aligning AI preferences with human preferences. Here I lump work on RL from human/AI feedback (RL(H/AI)F), including constitutional AI and direct preference optimisation, as well as work on aligning to human values.

Computational social choice theory concerns aggregation of individual preferences towards a collective choice. It is a field in itself, spanning subjects such as economics and psychology, not limited to AI. The typical topics are voting, fair allocation of goods, coalition formation and, pertinently to AI, preferences, including preference learning, processing and reasoning with preferences. Learning preferences (aka preference elicitation) concerns obtaining utilities or preferences from autonomous agents, typically humans, often indirectly about complex issues via simplified choices. For instance, crowdsourcing human opinions about the ethics of self-driving cars via questions on Trolley problems. The closely connected preference processing and modelling (aka preference aggregation) concern formally representing the learned preferences and combining them into a coherent whole. This is far from trivial even in the case of binary preferences. Relatedly, reasoning with preferences concerns how to make decisions given combined preferences of multiple agents, how to resolve conflicting preferences or revise preferences. These modelling and reasoning issues are of particular interest in argumentation, e.g. value-based argumentation.

This should sound relevant to the value alignment topic (as I outlined above). Indeed, RLHF has recently been combined with argumentation, purportedly for improving the reward model. There have been discussions about connections of RLHF to optimising preferences and issues thereof. Social choice theory has been suggested to complement RLHF (acknowledged also in a recent comprehensive survey on AI Alignment) and for knowledge extraction from Large Language Models. There have also been attempts at value alignment via preference aggregation and value-based reasoning. All these are attempts of applying more formal methods to the problem of value alignment.

Last but not least, a topic where established academic research is well appreciated, but may benefit from a broader scope.

Cooperative AI vs AI fairness

Cooperative AI concerns multi-agent cooperation for the good of humanity. The pre-print on open problems in Cooperative AI actually overviews the related work well in Section 2.2., mostly on game theory and multi-agent systems, including the following multi-agent topics: automated argumentation and negotiation, interaction protocols, knowledge representation, reasoning and planning, decision making. The authors also acknowledge the immense overlap with computational social choice and other areas of computing research. However, they point out that “open, heterogeneous, and scalable multi-agent systems require learning agents, including agents who learn to adapt and cooperate with others” so that in light of recent developments in learning, research in Cooperative AI needs to be reinvigorated.

Cooperative AI may benefit from modern research on AI (aka algorithmic) fairness. AI fairness concerns how algorithmic decisions pertain to formalised notions of fair decision making, such as individual fairness of giving similar individuals similar decisions and non-discrimination of giving similar groups on the whole similar decisions. One of the landmark results on the (im)possibility of fairness roughly states that (under standard algorithmic assumptions) individual fairness and non-discrimination cannot be simultaneously guaranteed. Researchers have however defined multiple nuanced metrics for assessing algorithm fairness and have pushed fairness limits in theory and the possibility of fairness in practice. Still, both inherently technical and conceptual limitations of AI fairness are studied and may inform work on cooperative AI.

Finally, a note on the more general theme of academic safe and trustworthy AI research.

AI Safety/Alignment vs Safe and Trustworthy AI

Academic research on safe autonomous systems often concerns safe RL agents, for instance modifying RL agent rewards to benefit others in the environment, provably safe RL or probably approximate safety verification. Research on cooperating agents tackles questions such as “What aspects of [agent] selection process can we adjust in practice to limit agents’ incentives to misreport their features?” or, more generally, “Given a set of different agent models, design mechanisms that induce all agents, or as many agents as possible, to invest effort in desirable actions” (consult other works from e.g. Ariel Procaccia). Responsibility research on trustworthy autonomous systems covers topics of logics for responsible reasoning, multi-agent planning, value-based design, ethics of AI, human-agent collectives, among others. All this is generally considered part of trustworthy AI, which concerns creating safe and trustable AI systems, which greatly overlaps with Safety/Alignment agendas. Trustworthy AI is a huge umbrella term (see NIST or European Commission) encompassing many AI research areas, most notably works on reliability, verification and validation, interpretability/explainability, security and privacy among others. There are adjacent research areas, not necessarily belonging to trustworthy AI, but still relevant to Alignment. For example, various impossibility results (on solvability of problems in AI and related fields) as well as the notable sub-field of neuro-symbolic AI, including approaches to Alignment. Many of these resources could also benefit those interested in technical Alignment research, not limited to the four topics discussed above.

(I have notably omitted the following Alignment topics: agent foundations as too broad and too vague to cover, thinking anything in autonomous agents research (eg. BDI agents) may be relevant to it; shard theory as too recent and seemingly not yet studied in academia; weak-to-strong generalisation and Eliciting Latent Knowledge—just didn’t bother.)

Acknowledgments

Huge thanks to Raymond Joseph Gabriel Tana for very helpful comments on the initial draft of the post.

The views expressed here are my own and do not represent those of my employer Ericsson, Inc.