Here I proposed a systematic framework for classifying AI safety work. This is a matrix, where one dimension is the system level:
A monolithic AI system, e.g., a conversational LLM
AGI lab (= the system that designs, manufactures, operates, and evolves monolithic AI systems and systems of AIs)
A cyborg, human + AI(s)
A system of AIs with emergent qualities (e.g., https://numer.ai/, but in the future, we may see more systems like this, operating on a larger scope, up to fully automatic AI economy; or a swarm of CoEms automating science)
A human+AI group, community, or society (scale-free consideration, supports arbitrary fractal nestedness): collective intelligence, e.g., The Collective Intelligence Project
Design time: research into how the corresponding system should be designed (engineered, organised): considering its functional (“capability”, quality of decisions) properties, adversarial robustness (= misuse safety, memetic virus security), and security. AGI labs: org design and charter.
Manufacturing and deployment time: research into how to create the desired designs of systems successfully and safely:
AI training and monitoring of training runs.
Offline alignment of AIs during (or after) training.
AI strategy (= research into how to transition into the desirable civilisational state = design).
Designing upskilling and educational programs for people to become cyborgs is also here (= designing efficient procedures for manufacturing cyborgs out of people and AIs).
Operations time: ongoing (online) alignment of systems on all levels to each other, ongoing monitoring, inspection, anomaly detection, and governance.
Evolutionary time: research into how the (evolutionary lineages of) systems at the given level evolve long-term:
How the human psyche evolves when it is in a cyborg
How humans will evolve over generations as cyborgs
How AI safety labs evolve into AGI capability labs :/
How groups, communities, and society evolve.
Designing feedback systems that don’t let systems “drift” into undesired state over evolutionary time.
Considering system property: property of flexibility of values (i.e., the property opposite of value lock-in, Riedel (2021)).
IMO, it (sometimes) makes sense to think about this separately from alignment per se. Systems could be perfectly aligned with each other but drift into undesirable states and not even notice this if they don’t have proper feedback loops and procedures for reflection.
There would be 6*4 = 24 slots in this matrix, and almost all of them have something interesting to research and design, and none of them is “too early” to consider.
Richard’s directions within the framework
Scalable oversight: (monolithic) AI system * manufacturing time
Alignment theory: Richard phrases it vaguely, but referencing primarily MIRI-style work reveals that he means primarily “(monolithic) AI system * design, manufacturing, and operations time”.
Evaluations, unrestricted adversarial training: (monolithic) AI system * manufacturing, operations time
Threat modeling: system of AIs (rarely), human + AI group, whole civilisation * deployment time, operations time, evolutionary time
Governance research, policy research: human + AI group, whole civilisation * mostly design and operations time.
Takeaways
To me, it seems almost certain that many current governance institutions and democratic systems will not survive the AI transition of civilisation. Bengio recently hinted at the same conclusion.
Richard mostly classifies this as “governance research”, which has a connotation that this is a sort of “literary” work and not science, with which I disagree. There is a ton of cross-disciplinary hard science to be done about group intelligence and civilisational intelligence design: game theory, control theory, resilience theory, linguistics, political economy (rebuild as hard science, of course, on the basis of resource theory, bounded rationality, economic game theory, etc.), cooperative reinforcement learning, etc.
I feel that the design of group intelligence and civilisational intelligence is an under-appreciated area by the AI safety community. Some people do this (Eric Drexler, davidad, the cip.org team, ai.objectives.institute, the Digital Gaia team, and the SingularityNET team, although the latter are less concerned about alignment), but I feel that far more work is needed in this area.
There is also a place for “literary”, strategic research, but I think it should mostly concern deployment time of group and civilisational intelligence designs, i.e., the questions of transition from the current governance systems to the next-generation, computation and AI-assisted systems.
Classification of AI safety work
Here I proposed a systematic framework for classifying AI safety work. This is a matrix, where one dimension is the system level:
A monolithic AI system, e.g., a conversational LLM
AGI lab (= the system that designs, manufactures, operates, and evolves monolithic AI systems and systems of AIs)
A cyborg, human + AI(s)
A system of AIs with emergent qualities (e.g., https://numer.ai/, but in the future, we may see more systems like this, operating on a larger scope, up to fully automatic AI economy; or a swarm of CoEms automating science)
A human+AI group, community, or society (scale-free consideration, supports arbitrary fractal nestedness): collective intelligence, e.g., The Collective Intelligence Project
The whole civilisation, e.g., Open Agency Architecture, or the Gaia network
Another dimension is the “time” of consideration:
Design time: research into how the corresponding system should be designed (engineered, organised): considering its functional (“capability”, quality of decisions) properties, adversarial robustness (= misuse safety, memetic virus security), and security. AGI labs: org design and charter.
Manufacturing and deployment time: research into how to create the desired designs of systems successfully and safely:
AI training and monitoring of training runs.
Offline alignment of AIs during (or after) training.
AI strategy (= research into how to transition into the desirable civilisational state = design).
Designing upskilling and educational programs for people to become cyborgs is also here (= designing efficient procedures for manufacturing cyborgs out of people and AIs).
Operations time: ongoing (online) alignment of systems on all levels to each other, ongoing monitoring, inspection, anomaly detection, and governance.
Evolutionary time: research into how the (evolutionary lineages of) systems at the given level evolve long-term:
How the human psyche evolves when it is in a cyborg
How humans will evolve over generations as cyborgs
How AI safety labs evolve into AGI capability labs :/
How groups, communities, and society evolve.
Designing feedback systems that don’t let systems “drift” into undesired state over evolutionary time.
Considering system property: property of flexibility of values (i.e., the property opposite of value lock-in, Riedel (2021)).
IMO, it (sometimes) makes sense to think about this separately from alignment per se. Systems could be perfectly aligned with each other but drift into undesirable states and not even notice this if they don’t have proper feedback loops and procedures for reflection.
There would be 6*4 = 24 slots in this matrix, and almost all of them have something interesting to research and design, and none of them is “too early” to consider.
Richard’s directions within the framework
Scalable oversight: (monolithic) AI system * manufacturing time
Mechanistic interpretability: (monolithic) AI system * manufacturing time, also design time (e.g., in the context of the research agenda of weaving together theories of cognition and cognitive development, ML, deep learning, and interpretability through the abstraction-grounding stack, interpretability plays the role of empirical/experimental science work)
Alignment theory: Richard phrases it vaguely, but referencing primarily MIRI-style work reveals that he means primarily “(monolithic) AI system * design, manufacturing, and operations time”.
Evaluations, unrestricted adversarial training: (monolithic) AI system * manufacturing, operations time
Threat modeling: system of AIs (rarely), human + AI group, whole civilisation * deployment time, operations time, evolutionary time
Governance research, policy research: human + AI group, whole civilisation * mostly design and operations time.
Takeaways
To me, it seems almost certain that many current governance institutions and democratic systems will not survive the AI transition of civilisation. Bengio recently hinted at the same conclusion.
Human+AI group design (scale-free: small group, org, society) and the civilisational intelligence design must be modernised.
Richard mostly classifies this as “governance research”, which has a connotation that this is a sort of “literary” work and not science, with which I disagree. There is a ton of cross-disciplinary hard science to be done about group intelligence and civilisational intelligence design: game theory, control theory, resilience theory, linguistics, political economy (rebuild as hard science, of course, on the basis of resource theory, bounded rationality, economic game theory, etc.), cooperative reinforcement learning, etc.
I feel that the design of group intelligence and civilisational intelligence is an under-appreciated area by the AI safety community. Some people do this (Eric Drexler, davidad, the cip.org team, ai.objectives.institute, the Digital Gaia team, and the SingularityNET team, although the latter are less concerned about alignment), but I feel that far more work is needed in this area.
There is also a place for “literary”, strategic research, but I think it should mostly concern deployment time of group and civilisational intelligence designs, i.e., the questions of transition from the current governance systems to the next-generation, computation and AI-assisted systems.
Also, operations and evolutionary time concerns of everything (AI systems, systems of AIs, human+AI groups, civilisation) seem to be under-appreciated and under-researched: alignment is not a “problem to solve”, but an ongoing, manufacturing-time and operations-time process.