This series of posts presents our work on the frameworks of “Positive Attractors” and “Inherently Interpretable Architectures” during the AI Safety Camp 2023. (Note: Sections written by the different team members were marked with their initials in square brackets, since the investigated research strands were more distinct than a fully unified presentation could capture.)
We believe that the AI Alignment community is still largely lacking in clear frameworks or methodologies for progress on the Alignment Problem. While there are some frameworks of concrete progress towards a solution, most of the well-regarded work is still about becoming less confused—either regarding current AIs (mechanistic interpretability) or the concepts involved with reasoning about synthetic intelligence in the first place (agent foundations).
It was from this context that we set out to clarify the suitability of two additional broad frameworks of progress, trying to understand how they would inform and shape research, and contrast with already existing agendas. The research on this project was predominantly conceptual, relying on discussion and first principle reasoning aided by literature review. Team members were encouraged to come to independent views about the frameworks during the initial explorations, to have meaningful discussion and convergence during the later stages of the project.
To briefly introduce the two frameworks:
Positive Attractors are about the idea that cognitive systems or their training processes can be imbued with features that make them resistant against broad kinds of undesirable change, features that one can think of as positive attractors that pull the system into a safe region of algorithmic space. The intention is that such positive attractors can make systems resilient against failure modes that we haven’t fully understood yet and potentially enable some research approaches that would otherwise be discarded early.
Inherently Interpretable Architectures speak to an approach that is complementary to mechanistic interpretability research. Rather than relying on the hope that mechanistic interpretability research will provide adequate understanding about modern systems in time for us to base clear solutions and arguments on its findings, it seems sensible to also pursue the path of designing and studying systems that are inherently very interpretable, and to scale them up without losing their interpretability.
We believe that the exploration of relatively independent research bets for alignment research is especially relevant today, given that many people are transitioning into AI Safety research after witnessing the speed of progress indicated by GPT-4, and that there is no clear consensus on what an actual solution is supposed to look like. We don’t know where the critical insights and solutions will be found, so it seems prudent to launch a multitude of probes into this territory, probes that can ideally frequently update each other about how attention and resources should be allocated on a field-level.
Authors
Robert Kralisch (Research Lead), Anton Zheltoukhov (Team Coordinator), David Liu, Johnnie Pascalidis, Sohaib Imran
Introduction
This series of posts presents our work on the frameworks of “Positive Attractors” and “Inherently Interpretable Architectures” during the AI Safety Camp 2023.
(Note: Sections written by the different team members were marked with their initials in square brackets, since the investigated research strands were more distinct than a fully unified presentation could capture.)
We believe that the AI Alignment community is still largely lacking in clear frameworks or methodologies for progress on the Alignment Problem. While there are some frameworks of concrete progress towards a solution, most of the well-regarded work is still about becoming less confused—either regarding current AIs (mechanistic interpretability) or the concepts involved with reasoning about synthetic intelligence in the first place (agent foundations).
It was from this context that we set out to clarify the suitability of two additional broad frameworks of progress, trying to understand how they would inform and shape research, and contrast with already existing agendas.
The research on this project was predominantly conceptual, relying on discussion and first principle reasoning aided by literature review. Team members were encouraged to come to independent views about the frameworks during the initial explorations, to have meaningful discussion and convergence during the later stages of the project.
To briefly introduce the two frameworks:
Positive Attractors are about the idea that cognitive systems or their training processes can be imbued with features that make them resistant against broad kinds of undesirable change, features that one can think of as positive attractors that pull the system into a safe region of algorithmic space. The intention is that such positive attractors can make systems resilient against failure modes that we haven’t fully understood yet and potentially enable some research approaches that would otherwise be discarded early.
Inherently Interpretable Architectures speak to an approach that is complementary to mechanistic interpretability research. Rather than relying on the hope that mechanistic interpretability research will provide adequate understanding about modern systems in time for us to base clear solutions and arguments on its findings, it seems sensible to also pursue the path of designing and studying systems that are inherently very interpretable, and to scale them up without losing their interpretability.
We believe that the exploration of relatively independent research bets for alignment research is especially relevant today, given that many people are transitioning into AI Safety research after witnessing the speed of progress indicated by GPT-4, and that there is no clear consensus on what an actual solution is supposed to look like. We don’t know where the critical insights and solutions will be found, so it seems prudent to launch a multitude of probes into this territory, probes that can ideally frequently update each other about how attention and resources should be allocated on a field-level.
Authors
Robert Kralisch (Research Lead), Anton Zheltoukhov (Team Coordinator), David Liu, Johnnie Pascalidis, Sohaib Imran