United We Align: Harnessing Collective Human Intelligence for AI Alignment Progress
Summary: Collective Human Intelligence (CHI) represents both the current height of general intelligence and a model of alignment among intelligent agents. However, CHI’s efficiency and inner workings remain underexplored. This research agenda proposes six experimental directions to enhance and understand CHI, in pursuit of potential runway for alignment solutions and a possible paradigm shift similar to neural networks in early AI. Through playful, iterative, and collaborative experiments, this broad and exploratory approach can hopefully bring significant incremental progress towards AI alignment while shedding light on an often overlooked area of study.
Introduction
This research agenda is about accelerating alignment research by increasing our Collective Human Intelligence (CHI) while exploring the structure of interhuman alignment along the way. A collective intelligence can be increased by improving the quantity, quality, and coordination of its nodes. At the same time, CHI is a form of interhuman alignment that may contain mechanics that can prove useful for AI alignment. Here “interhuman alignment” refers to the instinctual and emotional desire of humans to help each other. It does not refer to calculated, conscious trade.
In this document, I propose six research directions to explore the potential of CHI to accelerate progress on the alignment problem. Feel free to skip straight to the proposed experiments, or walk with me through definitions, framing, and background literature. Lastly, there will be a short discussion of the possible failure modes of this research agenda. If you find yourself intrigued and would like to discuss these ideas with me in more detail, feel free to reach out.
Definitions: Parameterized Intelligence
Intelligence has at least nine different competing definitions. To sidestep a semantic rabbit hole, I will use the word “intelligence” to specifically refer to how good something is at “mathematical optimization”, or “the selection of the best element based on a particular criterion from a set of available alternatives”[1]. Such intelligence has a direction (motivation), a depth (generality), a type (human or artificial), and a multi-dimensional property of modularity (collectivity). I won’t argue that this is the ground truth of intelligence, but it is my current working hypothesis on the conceptual structure of it.
Let’s have a quick look at each property.
Motivation is assumed to be uncorrelated to intelligence as stated in the orthogonality thesis. As such, it underlies much of the alignment problem itself. If intelligence was somehow related to motivation then we could leverage that relationship to aim AI where we want it to go. Unfortunately, nearly everything is an equally valid target for optimization. Thus, the art and science of aiming an artificial intelligence at the goal of (at the very least) not killing us all, is what I refer to as “the AI alignment problem”. Throughout this research agenda, I’m agnostic toward value versus intent alignment implementations. Generating runway for solving the problem benefits both approaches, while understanding interhuman alignment will most plausibly result in us recreating whatever implementation of alignment our own brains happen to run.
Generality of intelligence can be conceptualized as the depth parameter of an optimization process. By depth I mean to refer to the deeper patterns in reality that generalize farther. Generality is a continuous or rank-order categorical variable with extremely many categories, such that one can learn ever more general patterns (to some unknown limit). It is the opposite of narrow intelligence—the ability to optimize over a narrow set of inputs to a narrow set of outputs. For instance, learning very specifically that a blue ball may fall when pushed off a table is an instance of a relatively narrow intelligence (provided one is surprised or confused about green balls or blue cubes following the same pattern). Yet, one may learn that items in general fall when pushed off a table, independent of color or shape. But then you might discover that some items float, and some items might even not descend at all. Even more generally, one might discover the speed at which things fall, and how the weight of the object and the height of the fall play into the calculation. Lastly, one may learn how gravity works more generally, and how the density of whatever celestial body you are on influences how quickly things fall versus soar into space. At each step, the pattern is farther removed from any specific current observation but generalizes to a far wider set of possible future observations. Generality of intelligence is thus the depth of the patterns that the optimization process is able to learn and use. Deeper patterns generalize farther.
Type of intelligence is simply what optimization algorithm is generating the optimization process. Different optimization algorithms score differently on performance, depth, and modularity. And each one can be aimed in any direction as per the orthogonality thesis. Trivially, human intelligence is of the type found in humans. We don’t know what algorithms our brains actually run, but we refer to them as being of human type. In contrast, artificial intelligence refers to the type of intelligence that emerges from the optimization algorithms that we build into machines. Notably, the type distinction will dissolve as soon as we are able to artificially recreate the algorithms running in our minds
Collectivity of intelligence refers to the modularity of the object that is performing the optimization process. We tend to think of intelligence as a singular entity, but a group of intelligent entities can form a collective intelligence. In this manner, a single human has a singular intelligence, and its generality is formalized as the g factor. A group of humans is a collective intelligence, and the generality of the collective intelligence is formalized as the c factor (see Background section for more details). Collective intelligence is underexplored compared to singular intelligence. Considering that a collective intelligence is essentially a distributed network where every subset (greater than 1) is itself another collective intelligence, I thus hypothesize that it consists of at least the properties of size (number of nodes), heterogeneity (variance of properties of the individual nodes), and connection (how are the nodes wired together). Or in other words: quantity, quality, and coordination.
This research agenda is a proposed exploration of the largest and most intelligent collective in existence: the Collective Human Intelligence (CHI). If we can understand it and increase it, we might be able to solve the alignment problem before our inevitable deadline.
Framing: Humanity is far more intelligent than a single human
Humanity can be conceptualized as one distributed organism with individual humans as its cells. Our role in society and contribution to humanity essentially describe what type of “tissue” we are part of at any one time. Some of us, some of the time, are part of the brains of humanity—the Collective Human Intelligence. In it, individual humans function as “neurons”, except far more independently and with far less specialization and integration than neurons in the human brain. And while you or I may contribute to science, or may help build a rocket, or assist in curing a disease, it’s our collective intelligence that makes these achievements possible at a scale far surpassing any individual’s abilities.
Thus Collective Human Intelligence (CHI) is the most general intelligence in existence right now[2], and it has grown to the current size because enough humans are sufficiently aligned with each other enough of the time to make this additive, distributed, continuous cognition possible. It is in a way a model of alignment, just like human brains are a model of intelligence. And similarly to how the deep learning revolution started with a seed of inspiration from human brains consisting of interconnected neurons, I wonder what seeds of inspiration we may glean if we pry apart the coordination that is the bedrock of CHI. However, interhuman alignment is not driven by a single instinct or mechanic. There seem to be different systems interacting, that have emergently lead to human cooperation in different forms. And each of these forms possibly has its own algorithm. Examples may be the algorithm that makes a mother willing to die for her child, the algorithm that governs our willingness to change our utility function with drugs, and the algorithm that instantiates group think so strong that even people who are aware of the mechanic struggle to resist. What might happen if we came to understand these algorithms? Might they point us in the right direction on how to solve alignment? Potentially, understanding the computational basis of inter-human alignment may offer us a starter template for AI alignment, just like neural networks did for artificial intelligence.
But that is not all. Once we understand more about how our own brains align with themselves and each other, we might be able to use these insights to enhance our current collective intelligence. Through tooling that was never before possible, we may be able to instantiate human-based computation on a scale that we have never seen before, or even just increase our ability to coordinate across a research problem like alignment. Studying CHI can hopefully make us better at solving alignment by making ourselves smarter and AI more aligned.
Background
The Collective Human Intelligence research agenda is based on my background as a computational psychologist, my growing knowledge of the current alignment literature and my exploration of current field building efforts. The following section discusses the first two elements in detail, while the background literature on specific field building efforts are covered per experiment in the Experiments section.
Why Understand & Enhance CHI?
The inspiration for this research agenda can be roughly summed up as the intersection between MIT’s work in the field of Collective Intelligence, the human-based computation game FoldIt, and my early attempts to model human minds based on game behavior.
Collective Intelligence is an underexplored research field concerned with understanding and enhancing the collaborative cognitive power of groups of humans. Woolley et al. (2010) discovered that group performance on cognitive tasks is more strongly correlated with the average social sensitivity (r = 0.26, P = 0.002) of the members of the group, than with their average intelligence (r = 0.15, P = 0.04) or the individual maximum intelligence (r = 0.19, P = 0.008). Note how social sensitivity basically denotes our intuitive ability to coordinate group performance. Tying this back to our framework of CHI, Woolley et al. didn’t look at the quantity of the nodes in a collective, but did find that the coordination of these nodes mattered more than the average quality of the nodes or the highest individual quality of any node. When we then look at the real world, we indeed see that coordination is likely the key bottleneck in scientific inquiry. Kirchner writes about the challenges of scaling academia, and how research groups rarely scale in productivity past a size of roughly 25 members. This mechanic is in turn reminiscent of the Coase’s Theory of the Firm, described in his essay The Nature of the Firm (1937). He posits that in a free market we have a set transaction cost between individuals, but that this cost is vastly lowered when one organizes a group of humans into a hierarchical system. However, as this organization grows, the transaction costs between individuals increases again, until there is no benefit anymore compared to free market cooperation. At this point, the organization cannot increase its productivity by adding more members. And even though the number of members that can add to productivity seems to be higher for firms than for research groups, they both run into this ceiling effect. Thus it seems that the efficiency of coordination is the bottleneck when harnessing the power of CHI.
Human-based computation platforms may constitute an underutilized breakthrough for increasing this ability to coordinate our CHI. The reason it may plausibly scale farther than current organizations is due to creating aligned incentives between individuals and ensuring constant transaction costs between individuals independent of group size. The most successful human-based computation platform to date is a game called FoldIt, developed by researchers at Northeastern University. Players were reward shaped into solving protein folding challenges that eluded specialized researchers for years. Many published discoveries were made possible through the group cognition of the players, and their cooperation and output scaled linearly with the number of players. However over time, AlphaFold ended up outperforming FoldIt players on solving for existing protein structures. The FoldIt developers responded by integrating AlphaFold as a game feature on their platform. Consequently AlphaFold now takes care of discovering existing protein structures while FoldIt players provide the most productive search algorithm to discover new protein structures (Koepnick et al., 2019).
The basic design principle in human-based computation is that incentives are kept aligned through offering all players the same reward schema (the gameplay, or alternatively, payment), and productivity grows linearly with group size because all players perform modular, local computations that stack into global solutions. Thus this form of collective intelligence can be applied to any problem that can be reward shaped effectively and can be decomposed into locally human-computable areas that modularly stack with themselves. Note that this type of structure is similar to the human-feedback approach used in training GPT models by OpenAI, or doing Captcha’s to gain image recognition labels. However, FoldIt is different in showing us that human-based computation can be used to solve problems that are otherwise very difficult for any one researcher to solve. It extends the finding by Woolley et al. with a possibly more impressive result: that a group can have a higher general intelligence (c) than any one individual in that group (g). If we pause to consider this hypothesis – If it is true, then that means that there might be ways to scale CHI past any performance we have seen so far if we figure out the right way to reward shape the human actors and to decompose their given task.
Computational Psychology (also commonly referred as Computational Cognition) aims to model human behavior algorithmically and measures the success of such models by how well they predict human behavior. Currently the field is small and specialized in how human learning occurs. During my PhD I focused on computationally modeling human behavior in video games and trying to predict demographic, personality, and motivational traits in other domains from the data. I think this approach may be well-suited to exploring interhuman alignment, such as questions related to what algorithms correctly describe human self-sacrificing, identity entanglement, or willingness to wirehead. We might not be able to fully understand the human brain neurologically before we hit AGI, but we might be able to model alignment-relevant properties of the human brain algorithmically and then use these insights to figure out how to align AI.
Related Alignment Approaches
The Collective Human Intelligence (CHI) approach to AI falls into a broader cluster of human-inspired alignment strategies. Below, the CHI agenda will be briefly compared and contrasted to the RLHF related approaches, Factored Cognition based approaches, Shard Theory, Brain-like AGI, and the new Cyborgism movement.
Reinforcement Learning from Human Feedback (RLHF) and Recursive Reward Modeling (RRM) both rely on human-based computation to extract human evaluations, and both essentially aggregate evaluations from the collective. However, neither require the humans themselves to perform any collective computations. In other words, more (quantity) and smarter (quality) humans can be recruited for RLHF and RRM, but they are not wired together in a way that allows the group to solve harder problems than any individual can. In that sense, RLHF and RRM are forms of CHI that are low on coordination.
Factored Cognition (FC), HCH, and Iterated Distillation and Amplification (IDA) are also human-based computation implementations of collective intelligence. They are essentially three instantiations of an ‘AI assistance’ subbranch of human-based computation approaches. In contrast, CHI is specifically about enhancing collective intelligence of human type. This can be done through the use of digital platforms to solve coordination problems, but the actual key computations are then still done by the humans themselves. Conversely, FC-based approaches utilize artificial intelligence to do at least some subset of the key computations. This raises the question of how to keep these AI assistants themselves aligned. CHI sidesteps this issue by avoiding the use of non-human, general intelligence.
Shard Theory shares grounding with CHI in modeling humans computationally in order to extract ideas on how AI might be aligned. However, it focuses on modeling single humans instead of interhuman alignment. Additionally, the shard framework of human psychology has so far not been mapped to existing psychology, cognitive science, or neuroscience research. Thus it is unclear how well the model holds up as a description of human value formation and alignment. In contrast, in the CHI research agenda, humans will be modeled as a distributed, emergent intelligence, both at the single human and interhuman level. The goal would be to extract an algorithm for specific interhuman alignment behavior in the hope that this algorithm can then be implemented in modern machine learning structures. Essentially, one of the results of CHI research (see the “Modeling Interhuman Alignment” experiment) may be a competing framework to Shard Theory.
Brain-like AGI also endeavors to extract insights from human intelligence in order to find solutions to the alignment problem. However, while it relies on neurological insights on the brain’s functionality, CHI would focus on developing algorithms that behaviorally predict humans on specific interhuman alignment axes. That said, brain-like AGI approaches are closely related and possibly a strong source of inspiration for the human intelligence modeling projects in CHI.
Cyborgism aims to enhance human cognition by creating human-in-the-loop systems with LLMs where the human provides robustness, goal-directedness, grounding in reality, and long-term coherence. It is a method to enhance CHI by enriching the abilities of individual researchers. They also suggest future possibilities for a form of hivemind, which would essentially be a joint increase in the quality and coordination of the humans in CHI. Unlike in FC-based approaches, Cyborgism avoids alignment problems within their methodology cause they avoid increasing agency in the AI. However, they do point out this research path is at a relatively high risk of dual use for both alignment and capabilities research.
Experiments
Collective Human Intelligence can be modeled as a distributed network where each node is a human, and each connection is any form of communication between two or more nodes. The connections between the nodes and the protocols that are run will be referred to as coordination. The general intelligence of the collective is thus a function of the quantity of nodes, the quality of the nodes, and the coordination between them. Quantity simply denotes the number of humans that are wired together in a collective. Quality denotes the entire multi-dimensional spectrum of traits that describe a single human. Lastly, coordination refers to the construct of communication channels across the entire collective.
Subsequently there are three core research directions (quantity, quality, coordination) to explore to improve CHI, while a fourth research direction encompasses understanding CHI in general. Understanding CHI is possibly a valuable pursuit in itself, cause CHI is a form of interhuman alignment. Interhuman alignment is a naturally occurring example of motivated cooperation between intelligent agents.
So what does CHI look like when applied to the alignment problem?
Essentially, there is a small, core group of experts (AIS researchers) working on the problem, embedded in crucial supportive tissue (ops, mental health, field builders, etc), while most of humanity is performing unrelated tasks. Except in the case of RLHF, where “wisdom of the crowds” is used for the human analogue to raw compute: large quantities of human-based computation are distributed to the masses. Thus if we look at what levers we can pull to increase CHI across quantity, quality, and connection, we see the following:
Quantity of nodes can be increased by finding more AIS researchers. Wisdom of the crowds is already being utilized quite well. Junior and mid level researchers are being recruited actively. But genius-level, senior researchers are rare and hard to divert to a new field. Consequently, the quantity proposal is focused on this underexplored area of “Adding Top Minds”.
Quality of nodes can be increased by improving the selection and training of AIS researchers. With the current GPT craze on-going, there will be a great influx of interested researchers into the field. Ideally we’d scale up our ability to select the most promising individuals and improve our ability to train them in key skills unique to the field. Thus there are two experiments aimed at improving quality of CHI: “Enhance Selection” and “Enhance Training”.
Coordination of nodes can be increased for both experts and crowds. On the one end, we can explore how to improve collaboration among AIS researchers, while on the other hand we can look into the potential of CHI to enhance RLHF through human swarming techniques. Thus there is a proposal to “Coordinate Researchers” and to “Coordinate Crowds”.
Understanding the interhuman alignment of CHI then becomes a matter of prying apart the different alignment mechanics embedded in the human mind. This may enhance our collective intelligence while increasing the AI’s alignment. It’s a research proposal that reflects and encompasses the other parts of the research agenda. It is also the research direction that directly aims to solve the problem of alignment by endeavoring to “Model Interhuman Alignment”.
General Methodology
Before launching into the details of each experiment, I’d like to add a few words on the the general research methodology in this agenda. For my upskilling period I committed to honor the principles of transparency, exploration, and paradigmicity. For this research agenda, I commit to honor the principles of playfulness, iteration, and collaboration.
Playfulness is a form of exploration that is entirely curiosity driven. It’s inherently less goal-directed and more novelty seeking than a regular work mind set.
Iteration is about starting small, failing often, and incrementally improving and hill climbing to better solutions. It maximizes on gathering data and making contact with reality.
Collaboration is about working together to leverage group intelligence and synergize skills across individuals. A research agenda about CHI will obviously leverage that CHI where possible by exploring where experts and crowds can work together on solving (sub)problems in alignment.
For each of the following six research directions, I expect to play around with the problem, iterate on solutions, and collaborate with other researchers to find new angles and handles.
1. Add Top Minds
Background: Most recruitment efforts for AI alignment predominantly target young researchers or individuals from closely related fields. While the ATLAS fellowship and student field building initiatives specifically target younger minds, incubators and mentorship projects like SERI-MATS, Refine, and AI Safety Camp inadvertently attract a similar demographic. PIBBSS comes out at the high end and hits a slightly more senior demographic of mostly PhD and post-doc researchers while also drawing in people from other research fields. Conversely, most attempts to attract top researchers have been constrained to related fields such as computer science and mathematics. Scott Aaronson is one high profile convert. Apart from that there have been various summits to bring together top technology leaders and researchers on the problem, but results have been mixed (e.g., Asilomar Conference).
In contrast, recruiting top minds from unrelated fields of study has been an underexplored area so far. Ironically, by the principle of what general intelligence is, we would expect a Noble Prize winner in one field to also be a top researcher in any other field they applied themselves to for a considerable amount of time. However, you wouldn’t normally expect to divert top minds from their preferred field of study. Yet, if some low percentage of such genius-level thinkers were converted by a combination of realizing the potential existential risk of AGI and being nerd-sniped into the field, then this might yield significant progress in AI alignment
Thus, imagine we approach the top 200 minds in the world, and invited them to consider some subproblems of AI alignment as part of a project of bringing together interdisciplinary insights to a new field. On the one hand we might gain valuable insights from these discussions. And on the other hand, if only 1-5% converted of their own accord, then we’d have gained 2-10 genius-level minds to work on the problem.
H1: Adding top minds from other fields to the alignment problem will accelerate alignment research. |
Methodology: Initial interviews would be done with local researchers from UC Berkeley who have distinguished themselves recently. 5-10 researchers would be approached over email with a proposal to do a 1 hour interview on the interdisciplinary application of their expertise to other fields of study. Each email and each interview will be an iterative process to discover what structure yields the best results.
Emails will be assessed on how many researchers agree to an interview. They will contain a link to a 10-minute introductory reading on alignment (to be determined) as a threshold to gauge sufficient buy in. Emails will contain references to the work of the specific researcher and why we would be excited to hear their views on the alignment problem.
Interviews will be assessed on how useful the interview content was for the alignment community, and if the researcher follows up after the interview with a show of interest in alignment. Interviews will consist of selected, practical questions about the alignment problem. For instance, I might offer a case study on deceptive alignment, probe their thoughts on the relevance of situational awareness, or explore their intuitions around shutdownability.
Write-ups of each interview would be published on LessWrong initially. If the project is a success and more prestigious researchers start participating, then I would look in to approaching broader platforms to publish the interviews. Write-ups will be assessed by engagement (upvotes & comments) and citations.
If this research approach is successful, then I would want to start approaching some of the top minds in the world that haven’t heard about alignment yet, and who are currently working in unrelated fields. This phase of the research would involve either extensive travel, or setting up an event to bring the relevant people to Berkeley.
Possible Results: This experiment will roughly yield four possible results. First off, I might fail to get interviews with anyone. This should become obvious quite quickly, and thus I’d only lose about 2-4 weeks of funding and opportunity cost. Secondly, I might get interviews with local researchers but there is no useful followup. The cost would then be about 1-2 months of research time. Next, results may be useful with local researchers but the approach does not actually scale to genius level, global researchers. I’d expect this to become apparent in about 3 months of research time. Lastly, the project may be wildly successful and attract both local and global top minds to alignment. In that case, I’d expect in the order of 2-10 genius-level researchers to join the field. This would be an amazing boost to our ability to solve the alignment problem on schedule.
2. Enhance Selection
Background: Alignment is unique as a scientific problem because it needs to be solved on a deadline, while the closer we get to AGI the less we can iterate on solutions. Predictive work on a deadline is not how science is normally done. On top of that, alignment research concerns an existential risk, which poses an unusual emotional challenge that researchers don’t commonly face. Thus, we might need our researchers to have atypical traits compared to the traits we normally see in high performing researchers. Yet what would those traits be?
There are three ways to go about determining these traits: quantitative, qualitative, or behavioral research. In quantitative research we would scrape the writings of successful alignment researchers to determine what traits they excel at compared to the norm. In qualitative research we would interview top alignment researchers on what traits they have found most predictive of research success in alignment. In behavioral research we would look at what behavioral steps precede a researcher’s growth into a high-performing alignment scholar.
H2: There are testable traits that make people better at alignment research such that we could use them for funding and training selection. |
Methodology: Iterate on textual analysis from LLMs to determine traits of researchers that may be predictive of success in alignment research. Initial explorations would consist of exploring feature selection from textual analysis by LLMs by supplying them with small samples of various research papers and essays from both highly successful and less successful researchers. Ideally the essays and papers are not in the training set of the LLM to avoid confounders in the feature extraction.
A tiny example analysis was attempted with GPT-4 by feeding it the summary of a highly upvoted and barely upvoted post on LLMs published on the Alignment Forum. The initial feature extractions were verbose and not insightful. Prompting toward shorter trait descriptions with explicit scoring improved the profiling attempt significantly. Finally, asking the model to guess which essay was the most popular and why led to a correct guess with a plausible explanation. The model explained that the first essay was likely more popular cause the findings were exceptionally novel, while the second article was probably less popular cause the results mostly appeal to a specialized audience.
Possible Results: The example analysis already showed some promising results. Thus I would expect that at minimum GPT-4 can predict popularity of alignment research using 1-3 factors. The least useful version of this would be if those factors are “novelty” (cause we already know that one), and then circumstantial items like writing ability. The most promising result would be a weighted factor model that fairly accurately predicts how popular an essay will be, where each factor represents a content-related axis that a researcher can be selected or trained on.
3. Enhance Training
Background: To better train alignment researchers, you need to know what to train them on. Thus this experiment should only be executed as a followup to the previous one (“Enhance Selection”). That experiment contains the steps to identify the key traits we want to select on, but also that we want to enhance. Existing training programs side-step this question by relying on the expert judgement of mentors (e.g., SERI-MATS, Refine, AI Safety Camp, and PIBBSS).
H3: We can improve the unique traits necessary for excellent AIS research in a given human. |
Methodology: Ideally this research direction would follow after the previous one on Enhancing Selection, as that would provide us with a concrete set of traits to target for improvement. Regardless, three traits seem obviously helpful to train in AI Safety researchers: creativity (to help find breakthrough discoveries), security mindset (to foresee failure modes), and emotional resilience (to cope with existential risk). For each trait a set of interventions would be developed through an iterative process.
Creativity can be increased by running sessions on cross-pollination from other research fields into AI safety.
Security mindset may be increased by hosting sessions on red teaming of research ideas.
Emotional resilience may be increased by developing workshops in collaboration with EA-related therapists (e.g. Clearer Thinking and HIPsy) to foster mental health.
Possible Results: If the sessions are helpful then the participating researchers would show greater research productivity and a longer lifetime in the field. HIPsy is currently exploring ways to assess researcher productivity, which would make for a suitable KPI for this experiment.
4. Coordinate Researchers
Background: Formally, academic researchers coordinate through universities, conferences, research publications, and research institutes. Informally, they coordinate through code bases like Github and discussion platforms like Twitter.
The AI alignment community additionally has its own coordination infrastructure. Lightcone Infrastructure offers the LessWrong and Alignment Forum to cover the communication gap between low threshold twitter posts and high threshold academic publications. It caters to discussions of thoughts and findings that are not necessarily enshrined in the precision and thoroughness of months of research while offering far more bandwidth and depth than the free-for-all arena of a Twitter crossfire.
Additionally, various coordination platforms have cropped up in the EA sphere. The AI Safety Ecosystem incubates coordination platforms such as AI Safety Ideas and AI Safety world. Nonlinear offers ways for researchers and funders to match. AI Safety Support provides a jumping off point for people interested in joining the field.
Yet these coordination platforms are mostly related to matching researchers, supporting discussion, or building pipeline for the field. Conversely, the question of how to epistemically coordinate researchers has been largely unanswered. Nate Soares makes the point that AIS researchers “don’t stack”, yet I wonder if we have just lacked the tools to stack them. It might be the case that research methods can be enhanced by building structured collaboration tools. The most simple of these are wikis—they allow crowds to aggregate their knowledge far past anything a select group could achieve. Similarly, I’d like to explore what other platforms we might create for epistemic coordination among alignment researchers.
H4: Platforms can be created to significantly increase epistemic research coordination among alignment researchers. |
Methodology: Try out quick, structured iterations on possible collaboration platforms. Two interesting starting points are a universal ontology of ideas and an Urban Dictionary of alignment terms.
A universal ontology of ideas can be iterated on quickly, playfully, and collaboratively through group drawing sessions to explore if there are any ontologies that exist at a sufficient level of abstraction to allow for the expression of most people’s inside model of the alignment problem. We would try out different drawing and writing exercises with groups of AIS researchers to see if ontologies can be converged. My expectation would be that if this is possible then the end result is a form of UML for ideas and concepts.
An Urban Dictionary of Alignment Terms would be a simple website that allows users to freely submit alignment terms and definitions, as well as upvote their favorite definitions. Engineering talent for the project would be sourced from the AI Alignment Ecosystem Project.
New ideas for collaborative platforms would also be explored.
Possible Results: Ideally one or more platforms would be adopted for use in the community. The universal ontology project would allow researchers to more effectively communicate about their work. The Urban Dictionary of Alignment Terms would cut down on confusions between researchers and time spent defining terms for each essay.
5. Coordinate Crowds
Background: RLHF is currently one of the foremost methods to align AI. It relies on the ability of a single human to evaluate a given task and thus hits a wall when tasks become too complex. OpenAI is one of the main proponents of using RLHF to align AI. They currently plan to implement Scalable Oversight with AI assistants that help human evaluators judge tasks that are beyond their individual abilities. These AI assistants would be implemented through Factored Cognition and Iterated Amplification and Distillation to then input the human evaluations into a Recursive Reward Model. The challenge in Scalable Oversight is that the AI assistants themselves would need to remain aligned for these approaches to work. Misalignment can occur through all the usual channels, while additionally there may be alignment drift from stacked errors in the task decomposition and training, as well as unintentional misalignment if relevant evaluation factors cannot be expressed in the task decomposition but are only visible in the whole.
Instead of using AI assistants, I’d like to explore using Swarm Intelligence algorithms to significantly increase the upper bound on what tasks we can successfully evaluate with RLHF. To my knowledge, swarm intelligence implemented as human-based computation has not yet been applied to the alignment problem.
A Particle Swarm Optimization algorithm can be implemented with human evaluators as the particles. Such an algorithm searches the solution space by uniformly spreading the particles across the solution space, and having each particle update their location based on local gradient descent and global communication with the other particles. This makes the algorithm relatively robust to local optima. The solution space will be generated by taking the AI’s output (a “point” in solution space) and adding a controlled, procedural level of noise to this output. The result is a set of alternative outputs distributed around the original output. This noise functions to hopefully obscure most implementations of deception or manipulation that an AI can achieve by crafting outputs. Next, the human evaluators each get a different output from the set, and hill climb their way to the most optimal plan they can figure out. If the humans fail to converge on one solution, then the original plan was not obviously the optimal one from a human point of view, and may have contained deception or manipulation. Using such an algorithm we place an upper bound on the complexity of the outputs we will implement from an AI.
H5: Human-based computation implementations of swarming algorithms can be used to significantly improve RLHF. |
Methodology: To test the above algorithm playfully, iteratively, and collaboratively we will first generate alignment solutions with GPT-4, add noise, and then hill climb through the solution in two parallel channels. If this is successful, we can scale up to iterating on a similar experimental setup with one research collaborator. With each successful step, the experiment could be scaled up to more complex plans and more collaborators.
Possible Results: There are three possible failure modes to overcome if the above approach is to be successful. First, the safe level of noise that needs to be added to a solution might make it impossible for a single human to reconstruct the plan. Secondly, it might be the case that deceptions can be inherent in the solution space such that humans reconstruct the deception itself as well as the solution. Lastly, evaluating solutions using swarms of humans may simply be too expensive and slow to be useful for AI alignment.
However, if the method does succeed, then it might be a reliable way to ward ourselves against deceptions and manipulations inherent in the suggested plans of an AI. It would be an answer to Nate’s model of deception as a natural characteristic of the landscape instead of a purposeful act. Deception and manipulation might be unavoidably generated by an AI cause they are optimal. However, if we only implement plans we can fully understand and reconstruct, then deception and manipulation can be avoided in the implementation of plans (note that this does not address deceptive alignment where an AI pipes the deception through other actions than the reviewed output).
6. Model Interhuman Alignment
Background: Interhuman alignment refers to the phenomenon where humans align their goals, beliefs, and emotions, both consciously and subconsciously, to create shared understanding and promote cooperation. This alignment is critical for successful social interactions and can be observed in various interpersonal settings, including family relationships, friendships, and professional collaborations. In this research, we aim to investigate the computational structure of interhuman alignment in the brain, which could potentially be used as a basis for developing an effective alignment paradigm. As an example, the neural base of mother-child bonding is hypothesized to underlie all pair-bonding in mammals. It’s partly initiated by pregnancy and child birth, but also partly behaviorally influenced as is seen in foster mothers who develop similar caregiving behaviors as biological mothers. This response is at least in part hormonally mediated through the oxytocin channel. Additionally, neural predicates of parent-child bonding have also been found in co-parenting dyads. Behaviorally it’s expressed through synchronicity and reciprocity between mother and child. Even though the parent-child bond is not fully understood yet, there is a lot of information on bits and pieces that make up the neural and hormonal structure of this type of interhuman alignment. Similarly, the case of willingness to wirehead, emotional and epistemic group convergence, and emotional homeostasis can be explored.
H6: Interhuman alignment has a computational structure in the brain that can be used as inspiration for an effective alignment paradigm. |
Methodology: Review the neuroscience and cognitive science literature on parent-child bonding. Next, use that information to model the algorithmic structures that result in emotionally driven alignment between two agents. I would like to explore various modeling frameworks for this research direction, like economic agents, game of life, and others. Then lastly repeat the process for willingness to wirehead, emotional and epistemic group convergence, and emotional homeostasis.
Possible Results: I think it’s plausible that I uncover at least some insight in to the computational structure of the different forms of interhuman alignment. Most likely these insights will not directly help to align AGI, but in the best case scenario a breakthrough insight might be found.
Discussion
Overall, the CHI research agenda has three main weaknesses: 1) four of the six experiments do not address alignment directly, 2) exploratory research is high risk, high reward, 3) it may not be possible to keep up with AI without using AI. Below each point is discussed in turn.
Four out of six experiments do not address alignment directly. Apparently there are currently around 300 alignment researchers versus ~100,000 capabilities researchers. Thus we clearly need more people working on alignment research. Yet, increasing the quantity and quality of CHI are specifically field-building and support approaches—they focus on enabling other researchers to go and actually fix the problem. Thus, the failure mode here is that this strategy doesn’t generalize. Instead, we would all be passing the buck in some twisted variant of a pyramid scheme of impact: You will have great impact on the alignment problem if you recruit more people to work on it, who will have great impact if they recruit more people to work on it, etc. So how do we get more leaf nodes—the people who actually “pay the bill” and work on alignment themselves?
I think overall the above reasoning is correct, but there is a difference in what specific field building and support activities are being done and how they are implemented. “Add Top Minds”, “Enhance Selection” and “Enhance Training” are field building initiatives that specifically target underexplored areas, and that are structured iteratively to find out incrementally what the best approach may be. “Coordinate Researchers” is a support-type experiment that addresses epistemic coordination and would have clear KPI’s that relate to actual research progress. Lastly, “Coordinate Crowds” and “Model Interhuman Alignment” do actually touch on alignment directly. The former by improving existing methodologies to increase runway, and the latter by exploring a new alignment paradigm entirely.
Exploratory research is high risk, high reward. There is a decent chance that the CHI agenda is a dead-end. It is slightly side-ways from the obvious research paths cause I specifically set out to find a new direction to explore. Ideally I would work on this research agenda for a year, and explore 4 different research directions across 6 experiments, and then … I’m still likely to have no positive results? Yet I’d argue that in the current pre-paradigmatic field, we need to invest in these types of explorations. You want to send out dozens of “probes” into different sections of solution space, and most of those probes will turn up nothing. I’m not arguing this part of solution space is unusually promising. I’m arguing it’s underexplored and could lead to useful insights. I think that’s a good enough bet for where we are at now.
What if we can only keep up with AI by using AI? The CHI agenda specifically revolves around exploring sections of solution space that do not rely on AI assistance. If that turns out to be a weak strategy, I’d still argue that any increase in our CHI can subsequently be multiplied by adding in (hopefully) safe AI assistance. In that sense, CHI and AI assistance are not mutually exclusive. If we end up needing both, then at least we have done work on the CHI side to make things better.
Conclusion
The Collective Human Intelligence research agenda takes the angle of studying and optimizing the greatest general intelligence current in existence: humanity. On the one hand, we can create more runway to solve alignment if we make our collective smarter. On the other hand, there is a small chance that we can glean insights into the computational structure of alignment by studying how humans instinctively align to each other.
This is a broad scope for a research agenda. It falls apart into 2 pillars, 4 research directions, and 6 experiments. I would like to pursue each in an iterative, playful, and collaborative fashion. I would be excited to talk with anyone who would like to collaborate on these experiments. Feel free to reach out and thank you for reading.
Special thanks to Alex Gray, Leon Lang, Spencer Greenberg, Inga Grossmann, Ryan Kidd, Thomas Kwa, Aysja Johnson, Joe Collman, Plex, Peter Barnett, and Erik Jenner for reviewing the first draft of this document. The current version is much changed and much improved thanks to their collective input. (Views & Errors are my own)
- ^
This is the first line on wikipedia on Mathematical Optimization. I tried to find the source, but the Wayback Machine broke down. Googling then showed the entire first page to be repetitions of this exact definition. Definitions are good if we all agree on the meaning, so that seems sufficient grounding.
- ^
Daniel Kokataljo added these caveats in the comments, with which I agree:
I think some caveats and asterisks need to be added to the claim that humanity is far more intelligent than a single human.
In some domains (such as breadth of knowledge, or ability-to-design-complicated-machines-with-many-parts-like-rockets, humanity >> the best humans.
However, in other domains, such as long-horizon coherence/agency, ability to prioritize between projects/goals, ability to converge to the truth on certain controversial or emotionally loaded subjects, and thinking speed, humanity is arguably < the average human, and certainly << the best humans in those domains.
Case in point: Humanity is currently on a path to destroy itself by building unaligned AGI. Previously it narrowly avoided destroying itself via nuclear war. I don’t know about the average human, but I’m confident that e.g. the top 1% most sensible/wise/competent humans would have been able to easily avoid both problems, on a desert island containing only a handful of such humans + the relevant weaponry and supercomputers and code.
I think some caveats and asterisks need to be added to the claim that humanity is far more intelligent than a single human.
In some domains (such as breadth of knowledge, or ability-to-design-complicated-machines-with-many-parts-like-rockets, humanity >> the best humans.
However, in other domains, such as long-horizon coherence/agency, ability to prioritize between projects/goals, ability to converge to the truth on certain controversial or emotionally loaded subjects, and thinking speed, humanity is arguably < the average human, and certainly << the best humans in those domains.
Case in point: Humanity is currently on a path to destroy itself by building unaligned AGI. Previously it narrowly avoided destroying itself via nuclear war. I don’t know about the average human, but I’m confident that e.g. the top 1% most sensible/wise/competent humans would have been able to easily avoid both problems, on a desert island containing only a handful of such humans + the relevant weaponry and supercomputers and code.
Thank you! I wholeheartedly agree to be honest. I’ve added a footnote to the claim, linking and quoting your comment. Are you comfortable with this?
Sure, thanks!
My best guess is that Wolley’s work is either really shoddy or fraud. The paper you cite from her failed to replicate pretty hard: https://www.sciencedirect.com/science/article/abs/pii/S0160289616303282
Well damn… Well spotted.
I found the full-text version and will dig in to this next week to see what’s up exactly.
I’ve wanted something for AI alignment for ages like what the Foldit researchers created, where they turned protein folding into a puzzle game and the ordinary people online who played it wildly outperformed the researchers and algorithms purely by working together in vast numbers and combining their creative thinking.
I know it’s a lot to ask for with AI alignment, but still, if it’s possible, I’d put a lot of hope on it.
https://www.lesswrong.com/posts/8GENjqzEDL5WamfCh/gamified-narrow-reverse-imitation-learning-1
Assuming CHI is aligned is circular reasoning. If humanity creates an unaligned AI that destroys us all, that ironically means that not even humanity was aligned.
Could you paraphrase? I’m not sure I follow your reasoning… Humans cooperate sufficiently to generate collective intelligence, and they cooperate sufficiently due to a range of alignment mechanics between humans, no?
It’s a bit tongue-in-cheek, but technically for an AI to be aligned, it isn’t allowed to create unaligned AIs. Like if your seed AI creates a paperclip maximizer, that’s bad.
So if humanity accidentally creates a paperclip maximizer, they are technically unaligned under this definition.
I disagree with this. I think the most useful definition of alignment is intent alignment. Humans are effectively intent-aligned on the goal to not kill all of humanity. They may still kill all of humanity, but that is not an alignment problem but a problem in capabilities: humans aren’t capable of knowing which AI designs will be safe.
The same holds for intent-aligned AI systems that create unaligned successors.
Oooh gotcha. In that case, we are not remotely any good at avoiding the creation of unaligned humans either! ;)
Because we aren’t aligned.