Supervise Process, not Outcomes
We can think about machine learning systems on a spectrum from process-based to outcome-based:
Process-based systems are built on human-understandable task decompositions, with direct supervision of reasoning steps.
Outcome-based systems are built on end-to-end optimization, with supervision of final results.
This post explains why Ought is devoted to process-based systems. The argument is:
In the short term, process-based ML systems have better differential capabilities: They help us apply ML to tasks where we don’t have access to outcomes. These tasks include long-range forecasting, policy decisions, and theoretical research.
In the long term, process-based ML systems help avoid catastrophic outcomes from systems gaming outcome measures and are thus more aligned.
Both process- and outcome-based evaluation are attractors to varying degrees: Once an architecture is entrenched, it’s hard to move away from it. This lock-in applies much more to outcome-based systems.
Whether the most powerful ML systems will primarily be process-based or outcome-based is up in the air.
So it’s crucial to push toward process-based training now.
There are almost no new ideas here. We’re reframing the well-known outer alignment difficulties for traditional deep learning architectures and contrasting them with compositional approaches. To the extent that there are new ideas, credit primarily goes to Paul Christiano and Jon Uesato.
We only describe our background worldview here. In a follow-up post, we explain why we’re building Elicit, the AI research assistant.
The spectrum
Supervising outcomes
Supervision of outcomes is what most people think about when they think about machine learning. Local components are optimized based on an overall feedback signal:
SGD optimizes weights in a neural net to reduce its training loss
Neural architecture search optimizes architectures and hyperparameters to have low validation loss
Policy gradient optimizes policy neural nets to choose actions that lead to high expected rewards
In each case, the system is optimized based on how well it’s doing empirically.
MuZero is an example of a non-trivial outcome-based architecture. MuZero is a reinforcement learning algorithm that reaches expert-level performance at Go, Chess, and Shogi without human data, domain knowledge, or hard-coded rules. The architecture has three parts:
A representation network, mapping observations to states
A dynamics network, mapping state and action to future state, and
A prediction network, mapping state to value and distribution over next actions.
Superficially, this looks like an architecture with independently meaningful components, including a “world model” (dynamics network). However, because the networks are optimized end-to-end to jointly maximize expected rewards and to be internally consistent, they need not capture interpretable dynamics or state. It’s just a few functions that, if chained together, are useful for predicting reward-maximizing actions.
Neural nets are always in the outcomes-based regime to some extent: In each layer and at each node, they use the matrices that make the neural net as a whole work well.
Supervising process
If you’re not optimizing based on how well something works empirically (outcomes), then the main way you can judge it is by looking at whether it’s structurally the right thing to do (process).
For many tasks, we understand what pieces of work we need to do and how to combine them. We trust the result because of this reasoning, not because we’ve observed final results for very similar tasks:
Engineers and astronomers expect the James Webb Space Telescope to work because its deployment follows a well-understood plan, and it is built out of well-understood modules.
Programmers expect their algorithms to implement the intended behavior because they reason about what each function and line does and how they go together to bring about the behavior they want.
Archeologists expect their conclusions about the age of the first stone tools to be more or less correct because they can reason about the age of the sediment layer the tools are in. They can estimate the age of the layers by looking at the iron-bearing minerals they contain which reflect the state of the earth’s magnetic polarity at the time they were preserved.
At Ought, we’ve been thinking about scientific literature review as a task that we expect to arrive at correct answers only when it’s based on a good process. When I’m trying to figure out whether iron supplements will help me or hurt me, I might start by following a process like this:
Clarify the question I’m trying to answer—what kind of iron, what kinds of supplements, what benefits am I hoping for? How will I decide whether to take the supplement or not?
Search for a list of candidate papers using the question and related search terms
For each study I find, answer:
Does it address the question I’m interested in, or a closely related question? Was the population studied similar to me?
Is it a randomized controlled trial, or a meta-analysis of trials?
Is the risk of bias below the threshold I’d accept? Are there no glaring critiques of the study or methodological limitations?
Throw out studies for which the answer isn’t yes to all questions
If any studies remain, synthesize them into a summary answer that explains the observed evidence
If not, relax my question and go back to 2
Of course, this is far from a great process. For a slightly better example, see this systematization of Scott Alexander’s post on Indian Economic Reform.
To build a process-based system, the fundamental problem to solve is to reduce the long-horizon tasks we care about to independently meaningful short-horizon tasks (factored cognition). If we can do that, we can then generate human (or human-like) demonstrations and feedback for these sub-tasks.
This reduction to subtasks can be done by the system designer, or for better scalability on-the-fly by the system itself. Task decomposition is another subtask, after all.
In between process and outcomes
Many tasks can be approached in both ways, and in practice, most systems will likely end up somewhere in between. Examples:
Search engine:
Outcome-based: Embed documents and metadata in a vector space, same for queries. Use a neural net retriever. Optimize retriever parameters and embeddings for users giving high ratings to the retrieved documents.
Process-based: Define an idealized process for evaluating the quality of a search result for a given query, e.g. decomposing the evaluation of the question “Is this result trustworthy?” into Pagerank-style considerations, questions like “Is the author an expert in this field?” Distill each of the subquestions modules into a neural net so that we can execute it at runtime.
In between: Start with the process-based approach, but use user scores to make a few choices, such as fitting parameters in a tiny MLP that mixes feature weights.
Question-answering:
Outcome-based: Train a neural net to map questions to answers, perhaps using Retro-style end-to-end-optimized retrieval.
Process-based: Independently train neural nets that map questions to web search queries, query responses to relevant extracts, and long answers to summary answers, each trained on human demonstrations or feedback.
In between: Follow the process-based approach, but (as in WebGPT), don’t imitate human queries; instead, just learn query strategies that lead to highly rated final answers end-to-end.
Business decision advisor:
Outcome-based: Train a MuZero-style neural net on making decisions about trades, product launches, and hiring decisions based on business returns, or other long-term metrics of interest, optimizing for actions that look good in hindsight.
Process-based: Imitate actions that look good to a human supervisor in foresight, giving the human supervisor AI tools to do a thorough ex-ante evaluation.
In between: Imitate human actions chosen ex-ante, but use predicted long-term metrics to choose between several actions that all look similarly good in foresight.
Eric Drexler’s CAIS paints a picture of AI that is also somewhere between process and outcomes in that AI services have clearly defined roles on a larger scale, but are individually outcome-based.
It’s better to supervise process than outcomes
Why prefer supervision of process? If we don’t need to look at outcomes, then:
We can do well at long-horizon tasks where outcomes aren’t available (better differential capabilities)
We don’t run the risk of our outcome measures being gamed (better alignment)
Differential capabilities: Supervising process helps with long-horizon tasks
We’d like to use AI to advance our collective capability at long-horizon tasks like:
Multi-year and multi-decade forecasting, e.g., predicting long-term consequences of vaccines
Policy and governance, especially AI policy
Personal and institutional planning and decision-making
AI alignment research
Unfortunately, gathering outcome data is somewhere between expensive and impossible for these tasks. It’s much easier to gather data and exceed human capability at short-horizon tasks:
Keeping people engaged as they interact with videos and posts
Developing physical technologies, e.g., new toxic molecules
Persuading people in short conversations
Predicting 30-minute consequences, not 30-year consequences
In a world where AI capabilities scale rapidly, we need AI to support research and reasoning that is likely to make AI go better. This includes guiding AI development and policy, helping us figure out what’s true and make plans as much as it helps us persuade and optimize goals with fast feedback loops and easy specifications.
If we can reliably reduce such long-horizon tasks to short-horizon tasks, we’ll be better positioned to deal with the incremental development and deployment of advanced AI.
Alignment: Supervising process is safety by construction
With outcome-based systems, we’ll eventually have AI that is incentivized to game the outcome evaluations. This could lead to catastrophes through AI takeover. (Perhaps obvious to most readers, but seems worth making explicit: A big reason we care about alignment is that we think that, from our current vantage point, the world could look pretty crazy[1] in a few decades.)
What is the endgame for outcome-based systems? Because we can’t specify long-term objectives like “don’t cause side-effects we wouldn’t like if we understood them”, we’re using proxy objectives that don’t fully distinguish “things seem good” from “things are good”. As ML systems get smarter, eventually all of the optimization effort in the world is aimed at causing high evaluations on these proxies. If it’s easier to make evaluations high by compromising sensors, corrupting institutions, or taking any other bad actions, this will eventually happen.
Suppose instead that we understood the role of each component, and that each component was constructed based on arguments that it will fulfill that role well; or it was constructed and understood by something whose behavior we understood and constructed to fulfill its role. In that case, we may be able to avoid this failure mode.
This is closely related to interpretability and reducing risks from inner alignment failures:
If we can limit the amount of black-box compute, and the amount of uninterpretable intermediate state, we’re in a better position to know what each of the model components is doing. We view this type of progress as complementary with Chris Olah’s work on interpretability and ELK-style proposals for learning what models know. The better we are at decomposition, the less weight rests on these alternatives.
Inner alignment failures are most likely in cases where models don’t just know a few facts that we don’t but can hide extensive knowledge from us, akin to developing new branches of science that we can’t follow. With limited compute and limited neural memory, the risk is lower. Advancing process-based systems is helpful on the margin, even if we can’t fully eliminate outcome-based optimization.
In the long run, differential capabilities and alignment converge
Today, differential capabilities and alignment look different. Differential capabilities are starting to matter now. Alignment is a much less prominent issue because we don’t yet have AI systems that are good at gaming our metrics.
In the crazy future, when automated systems are much more capable and make most decisions in the world, differential capabilities and alignment are two sides of the same coin:
We either can’t use AI for most tasks we care about if all we know is how to design outcome-based architectures (lack of capabilities), or
We have highly effective systems optimizing for flawed objectives, which can lead to catastrophic outcomes (misalignment)
People sometimes ask: Is Ought working on differential capabilities (making ML useful for supporting reasoning) or alignment (avoiding risks from advanced AI)? From the perspective of intervening by advancing process-based systems, these two causes are fundamentally tied together.
Two attractors: The race between process- and outcome-based systems
Outcome-based optimization is an attractor
In some sense, you could almost always do better through end-to-end training, at least according to any one metric. You start with a meaningful task decomposition, track a global metric, and then backpropagate to make the system better along that metric. This messes with the meaning of the components and soon, they can’t be interpreted in isolation anymore.
We expect that, at some point, there will be strong pressure to optimize the components of most digital systems we’re using for global metrics. The better we are at building process-based systems, the less pressure there will be.
Process-based optimization could be an attractor, too
The good crazy future is one with an ecosystem of AIs made out of components with roles that are in principle human-understandable, with each component optimized based on how well it accomplishes its local role.
Advanced process-based systems could self-regulate to remain process-based, which makes them a local attractor:
Whenever an action is chosen within the process-based system, it comes from an action suggester along with reasoning for why it’s good for the system to implement this action
This suggester could propose to make local changes, like changing some weights, just because empirically they’ll improve the quality of overall results along some metric, even if it makes the system less modular and interpretable
This proposal and the reasoning for it would then get evaluated by another part of the system that looks for errors and catches and fixes them before they matter
This evaluator would evaluate the costs and benefits of implementing the proposal and reject it because it would not maintain the invariant that each component has a clear role that makes sense independent of the global objective
This story makes the basin of attraction around process-based systems look a lot more narrow than the basin around outcomes: It only applies to individual systems, and it assumes that there is a fairly bright line between components that have a clear role and those that don’t.
The state of the race
Today, process-based systems are ahead: Most systems in the world don’t use much machine learning, and to the extent that they use it, it’s for small, independently meaningful, fairly interpretable steps like predictive search, ranking, or recommendation as part of much larger systems.
However, the history of machine learning is the bitter lesson of outcomes winning. Vision and NLP started with more structured systems, which were replaced with end-to-end systems. In these areas, the structured systems are much worse, and we don’t know how to make them competitive on standard benchmarks. Deepmind and OpenAI have better infrastructure for running RL on outcome-based metrics than for collecting process-based feedback. They tend towards a “research aesthetic” that favors outcomes-based approaches even in cases where they work worse.
Overall, it’s up in the air which tasks will be solved in which way. Some parts of the AI community are leaning toward process, others toward outcomes. If we see impressive results from process-based feedback, institutional knowledge and research tastes may shift toward process-based systems. Future norms and laws, perhaps similar to existing algorithmic transparency laws, might strengthen this position.
We don’t need process-based systems to be a perfect attractor. If most systems are largely process-based around the time of transformative AI, with small amounts of outcome-based optimization, we’re likely in good shape.
Conclusion
If we run into trouble with early advanced AI systems, it will likely be clear that supervision of process would be better than supervision of outcomes. At that point, the question is whether we’re good enough at process-based systems that they’re a realistic option. If so, then for the most important and high-stakes use cases, people will likely switch. This requires that we develop the relevant know-how now.
Beyond AI, we view understanding how to build systems and institutions that make correct decisions even when outcomes aren’t available as part of a broader agenda of advancing reason and wisdom in the world. Making mistakes about the long-term consequences of our short-term decisions is one way we fall short of our potential. Making wise decisions in cases where we can’t easily learn from our failures is likely key to living up to it.
Acknowledgments
Thanks to Paul Christiano and Jon Uesato for relevant discussions, and Jon Uesato, Owain Evans, Ben Rachbach, and Luke Stebbing for feedback on a draft.
- ^
What “crazy” means:
AI systems are doing most economically valuable tasks in the world. They’re developing, producing, and shipping new products. They’re writing code, running datacenters, and developing new technologies. They’re influencing policy to some extent.
An increasingly large part of the world economy is AI development, more than shows up explicitly because all fields depend on AI now. The AI industry is worth many trillions of dollars.
As more of the world economy depends on AI, the value of further improvements to AI increases. It is hard to scale up human researchers and programmers working on AI. Automation of AI research is one of the most important application areas of AI—rolling out AI in new domains, making existing applications better, improving hardware, software, and data centers.
Much of this activity happens without humans in the loop. It’s a complex economy of AI systems.
This transition to an AI-run economy could be centralized in one or a few firms, or involve many firms, each specializing in different roles. It could take two decades, or five, and the path there could be more or less continuous. Either way, we think it’s likely that the world within our lifetime will look very different from today’s world in ways that will be obvious to everyone.
- (My understanding of) What Everyone in Technical Alignment is Doing and Why by 29 Aug 2022 1:23 UTC; 413 points) (
- Shallow review of live agendas in alignment & safety by 27 Nov 2023 11:10 UTC; 335 points) (
- Future Fund June 2022 Update by 1 Jul 2022 0:50 UTC; 279 points) (EA Forum;
- How might we align transformative AI if it’s developed very soon? by 29 Aug 2022 15:48 UTC; 163 points) (EA Forum;
- OpenAI Launches Superalignment Taskforce by 11 Jul 2023 13:00 UTC; 149 points) (
- Shallow review of technical AI safety, 2024 by 29 Dec 2024 12:01 UTC; 142 points) (
- The Translucent Thoughts Hypotheses and Their Implications by 9 Mar 2023 16:30 UTC; 142 points) (
- How might we align transformative AI if it’s developed very soon? by 29 Aug 2022 15:42 UTC; 140 points) (
- Who’s hiring? (May-September 2022) [closed] by 27 May 2022 9:49 UTC; 117 points) (EA Forum;
- Success without dignity: a nearcasting story of avoiding catastrophe by luck by 15 Mar 2023 20:17 UTC; 113 points) (EA Forum;
- Measuring and Improving the Faithfulness of Model-Generated Reasoning by 18 Jul 2023 16:36 UTC; 111 points) (
- “Deep Learning” Is Function Approximation by 21 Mar 2024 17:50 UTC; 98 points) (
- Shallow review of live agendas in alignment & safety by 27 Nov 2023 11:33 UTC; 76 points) (EA Forum;
- Success without dignity: a nearcasting story of avoiding catastrophe by luck by 14 Mar 2023 19:23 UTC; 76 points) (
- Thoughts on “Process-Based Supervision” by 17 Jul 2023 14:08 UTC; 74 points) (
- Imitation Learning from Language Feedback by 30 Mar 2023 14:11 UTC; 71 points) (
- Elicit: Language Models as Research Assistants by 9 Apr 2022 14:56 UTC; 71 points) (
- Prize for Alignment Research Tasks by 29 Apr 2022 8:57 UTC; 64 points) (
- Thinking about maximization and corrigibility by 21 Apr 2023 21:22 UTC; 63 points) (
- Before smart AI, there will be many mediocre or specialized AIs by 26 May 2023 1:38 UTC; 57 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 53 points) (
- A Library and Tutorial for Factored Cognition with Language Models by 28 Sep 2022 18:15 UTC; 47 points) (
- AI Safety Needs Great Product Builders by 2 Nov 2022 11:33 UTC; 45 points) (EA Forum;
- Ought’s theory of change by 12 Apr 2022 0:09 UTC; 43 points) (EA Forum;
- AMA: Ought by 3 Aug 2022 17:24 UTC; 41 points) (EA Forum;
- Discussing how to align Transformative AI if it’s developed very soon by 28 Nov 2022 16:17 UTC; 37 points) (
- Discussing how to align Transformative AI if it’s developed very soon by 28 Nov 2022 16:17 UTC; 36 points) (EA Forum;
- Ought will host a factored cognition “Lab Meeting” by 9 Sep 2022 23:46 UTC; 35 points) (
- 15 Apr 2022 18:45 UTC; 32 points) 's comment on Early 2022 Paper Round-up by (
- Some thoughts on automating alignment research by 26 May 2023 1:50 UTC; 30 points) (
- Representational Tethers: Tying AI Latents To Human Ones by 16 Sep 2022 14:45 UTC; 30 points) (
- AIS 101: Task decomposition for scalable oversight by 25 Jul 2023 13:34 UTC; 27 points) (
- 7 Jul 2023 1:12 UTC; 27 points) 's comment on [Linkpost] Introducing Superalignment by (
- Failure modes in a shard theory alignment plan by 27 Sep 2022 22:34 UTC; 26 points) (
- EA Organization Updates: April-May 2022 by 12 May 2022 14:38 UTC; 25 points) (EA Forum;
- 23 Sep 2022 2:29 UTC; 21 points) 's comment on Simulators by (
- Oversight Leagues: The Training Game as a Feature by 9 Sep 2022 10:08 UTC; 20 points) (
- AI Safety Needs Great Product Builders by 2 Nov 2022 11:33 UTC; 14 points) (
- Safety-First Agents/Architectures Are a Promising Path to Safe AGI by 6 Aug 2023 8:02 UTC; 13 points) (
- 4 Aug 2022 5:26 UTC; 12 points) 's comment on Externalized reasoning oversight: a research direction for language model alignment by (
- I’m planning to start creating more write-ups summarizing my thoughts on various issues, mostly related to AI existential safety. What do you want to hear my nuanced takes on? by 24 Sep 2022 12:38 UTC; 9 points) (
- Safety-First Agents/Architectures Are a Promising Path to Safe AGI by 6 Aug 2023 8:00 UTC; 6 points) (EA Forum;
- 8 Jun 2022 23:34 UTC; 4 points) 's comment on AGI Ruin: A List of Lethalities by (
- 7 Mar 2023 4:10 UTC; 2 points) 's comment on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover by (
I don’t think I buy the argument for why process-based optimization would be an attractor. The proposed mechanism—an evaluator maintaining an “invariant that each component has a clear role that makes sense independent of the global objective”—would definitely achieve this, but why would the system maintainers add such an invariant? In any concrete deployment of a process-based system, they would face strong pressure to optimize end-to-end for the outcome metric.
I think the way process-based systems could actually win the race is something closer to “network effects enabled by specialization and modularity”. Let’s say you’re building a robotic arm. You could use a neural network optimized end-to-end to map input images into a vector of desired torques, or you could use a concatenation of a generic vision network and a generic action network, with a common object representation in between. The latter is likely to be much cheaper because the generic network training costs can be amortized across many applications (at least in an economic regime where training cost dominates inference cost). We see a version of this in NLP where nobody outside the big players trains models from scratch, though I’m not sure how to think about fine-tuned models: do they have the safety profile of process-based systems or outcome-based systems?
Optimizing for the outcome metric alone on some training distribution, without any insight into the process producing that outcome, runs the risk that the system won’t behave as desired when out-of-distribution. This is probably a serious concern to the system maintainers, even ignoring (largely externalized) X-risks.
I understand Ivan’s first point. My main concern is that we don’t have the right processes laid out for these models to follow. In the end, we want these models to determine their own process of doing things (if we don’t find a way to emulate human brain processes into machines) and establishing a clear-cut process for tasks could limit the model’s creativity. We would have to have a perfect model of how each of these NN tasks should be carried out.
However, the idea of combining the two is interested. As research suggests, backprop and a global update function doesn’t exist in the brain (although large sections of the brain can carry out orchestrated tasks amazingly). There must be a combination of local updates to these synaptic weights (aligned with specific process-based tasks) which follow some global loss function in the brain. It’d be interesting to get more thoughts on this.
This approach reminds me of the six-sigma manufacturing philosophy which was very successful and impactful in improving manufactured products quality.
Thanks for that pointer. It’s always helpful to have analogies in other domains to take inspiration from.
I’m new to alignment and I’m pretty clueless.
What’s Ought’s take on the “stop publishing all capabilities research” stance that e.g. Yudkowsky is taking in this tweet? https://twitter.com/ESYudkowsky/status/1557184416786423809
I disagree with the absolutism shown here (a common problem of Eliezer Yudkowsky), though I’d probably agree with a weaker version (that capabilities research, absent good reasons, should automatically be treated as negative.)
That sounds reasonable! Thanks for the explanation!
It’s not clear to me that as complexity increases, process-based systems are actually easier to reason about, debug, and render safe than outcome-based systems. If you tell me an ML system was optimized for a particular outcome in a particular environment, I can probably predict its behavior and failure modes much better than an equivalently performant human-written system involving 1000s of lines of code. Both types of systems can fail catastrophically with adversarially selected inputs, but it’s probably easier to automatically generate such inputs (and thus, to guard against them) for the ML system.
So it’s still plausible to me that our limited budget of human supervision should be spent on specifying the outcome better, rather than on specifying and improving complex modular processes.