Solving Math Problems by Relay

From September to November 2018 we ran an experiment where people did programming in relay. Each player spent ten minutes on a programming problem before passing on their code to a new player who had not seen any of the previous work. We found that people were able to solve some problems using the relay approach, but that the approach was less efficient than having a single person work on their own. This project explored hypotheses around Factored Cognition, testing whether people can solve problems by decomposing them into self-contained sub-problems.

Since this was an “explorative” experiment it wasn’t a priority to write up, though we are excited to have gotten around to it now and hope this both informs and inspires other experiments.

Introduction

Factored cognition research investigates ways of accomplishing complex tasks by decomposing them into smaller sub-tasks. Task decomposition is not a new idea: it’s widely recognized as fundamental to modern economies. People have worked out ways to decompose complex tasks (e.g. create an electric car) into smaller sub-tasks (create an engine and battery, design steering controls, test the car for safety) which in turn are broken down into yet smaller tasks, and so on. The smallest sub-tasks are carried out by individual humans, who may have a limited understanding of how their work relates to the original task.

The focus is on the decomposition of cognitive tasks, where the goal is to provide information or to answer a question. Cognitive tasks include solving a mathematics problem, analyzing a dataset, or summarizing a story.

Factored Cognition research explores whether complex cognitive tasks can be solved through recursive decomposition into self-contained, interpretable sub-tasks that can be solved more easily than the originally task.[1]

Sub-tasks are “self-contained” if they are solvable without knowledge about the broader context of the task. If the task is a hard physics problem, then self-contained sub-tasks would be solvable for someone who hasn’t seen the physics problem (and need not know the task is about physics). This differs from most real-world examples of collaborative problem solving, where everyone in the team knows what task is being solved.

Testing the Factored Cognition Hypothesis involves the following steps:

  • Finding cognitive tasks that seem costly or difficult to solve directly (e.g. because normally one person would spend days or weeks on the same problem rather than 10 minutes).

  • Generating a high-level strategy that would plausibly decompose the task.

  • Testing whether the strategy from (2) works by having a group of people solve (1) under controlled conditions.

For some ideas on high-level strategies for a range of problems, see Ought’s Factored Cognition slides and the paper Supervising strong learners by amplifying weak experts (Appendix B).

In the Relay Game participants worked on a task sequentially, with each person having ten minutes to help solve the problem before handing off to the next person (see Figure 1). This is similar to real-world situations where one person quits a project and someone else takes over. However, the big difference comes from the ten-minute time limit. If the task is complex, it might take ten minutes just to read and understand the task description. This means that most players in the relay won’t have time to both understand the task and make a useful contribution. Instead players must solve sub-tasks that previous people have constructed.


Figure 1: In the Relay Approach (left), each person works on a programming problem for a fixed number of minutes before passing over their notes and code to the next person. Eventually someone in the chain completes the problem. This contrasts with the usual approach (right), where a single person works for an extended period. Note: Our experiments had a time limit of 10 minutes per person (vs. 1 minute in the illustration)

We tested the Relay Game on programming problems from Project Euler. Here is a simple example problem:

How many different ways can one hundred be written as a sum of at least two positive integers?

Solving these problems requires both mathematical insight and a working implementation in code. The problems would take 20-90 minutes for one of our players working alone and we expected the relay approach to be substantially slower.


Experiment Design

Players worked on a shared Google doc (for notes) and code editor (see Figure 2). The first player receives only the Project Euler problem and begins making notes in the doc and writing code. After ten minutes, the second player takes over the doc and code editor. The Relay ends when a player computes the correct answer to the problem, which can be automatically verified at Project Euler.

Figure 2

We had 103 programmers volunteer to play Relay. They started 40 questions in total but only 25 had relay chains of more than 5 people. The total amount of work was 48 hours and only four questions were successfully solved. See Table 1 for a breakdown of a subset of the 40 questions. (Note: Most of the 103 players only worked on a few problems. Since much of the work was done by a smaller number of players, they were spread thinly over the 40 problems—as each player spends just 10 minutes on a problem.)

Table of Relay Game Problems (10 of 40)

Table 1: The total time spent on each problem for Relay. Note that most problems were not solved by Relay and so would take longer to actually solve. (So solo vs. Relay cannot be directly compared).

Can we conclude from Table 1 that Relay is much less efficient than the usual way of working on problems? Not really. It could be that Relay players would get better with experience by developing strategies for decomposing problems and coordinating work. So the main lesson from this experiment is that Relay with inexperienced players is probably less efficient at Project Euler problems. (We say “probably” because we did not conduct a rigorous comparison of Relay vs the usual way of solving problems).


Clickthrough Examples

We are interested in general failure modes for Factored Cognition with humans and in strategies for avoiding them. Our Relay experiment is a first step in this direction. We exhibit concrete examples from our Relay experiment that are suggestive of pitfalls and good practices for Factored Cognition.

Here are three “click-throughs”, which show how ideas and code evolved for particular Project Euler problems.

Prize Strings

In these three attempts on Prize Strings the players quickly build on players previous work and get the correct answer. Clickthrough

Empty Chairs

Empty Chairs was not solved but significant progress was made (with seven people contributing). The clickthrough demonstrates iterative improvements to a math heavy solution. Clickthrough

A Scoop of Blancmange

There were 13 unique attempts on A Scoop of Blancmange. While technically unsolved, the answer was off by only 1e-8. Only attempts that changed the state of the problem are shown. Clickthrough


Adding Meta-data to Notes

Relay players worked on the mathematical part of the Project Euler problems by writing notes in a Google Doc. A tactic that emerged organically in early rounds was to label contributions with meta-data with specific formatting (ex.using square brackets, strikethroughs). The meta-data was intended to provide a quick way for future players to decide which parts of the Google doc to read and which to ignore.

Summarizing Next Actions

For several problems the Google doc became messy and probably made it difficult for new players to orient themselves. An example is the problem Prime Triples Geometric Sequence, shown here mid-round, where some of the subsequent rounds were spent cleaning up these notes and formulating clear next steps.

Costly Bugs

The problem with the longest Relay chain was Pandigital Step Numbers with a chain of 41 players. While substantial progress was made on the problem, there was a one-line bug in the code implementation that persisted to the last player in the chain. Given the size and complexity of the problem, it was difficult for players to locate the bug in only ten minutes.

Figure 3. Code window for Pandigital Step Numbers problem. The highlighted line contains the bug that probably contributed to the failure of a 41-person Relay to solve the problem. The code incorrectly sets the first digit of the number in the bitmask as “False”.

Discussion

How does Relay relate to other research on Factored Cognition with humans? The Relay experiment had two key features:

  • We used existing collaboration tools (Google docs and web-based interpreter) rather than specialized tools for Factored Cognition.

  • Participants worked in sequence, building on the work of all previous participants.

In 2019 Ought ran experiments with specialized software called Mosaic. Mosaic facilitates tree-structured decompositions of tasks. The overall task is divided into sub-tasks which can be solved independently of each other, and users only see a sub-task (i.e. node) and not the rest of the tree. If this kind of decomposition turns out to be important in Factored Cognition, then the Relay setup will be less relevant.

The Relay experiment was exploratory and we decided not to continue working on it for now. Nevertheless we would be interested to hear about related work or to collaborate on research related to Relay.

Acknowledgements

Ben Goldhaber led this project with input from Owain Evans and the Ought team. BG created the software for Relay and oversaw experiments. BG and OE wrote the blogpost. We thank everyone who participated in Relay games, especially the teams at OpenAI and Ought. The original idea for Relay came from Buck Shlegeris.

[1]: This is a loose formulation of the hypothesis intended to get the basic idea across. For a discussion of how to make this kind of hypothesis precise, see Paul Christiano’s Universality post

Crowdsourcing:

The idea of Factored Cognition is superficially similar to crowdsourcing. Two important differences are:

  1. In crowdsourcing on Mechanical Turk, the task decomposition is usually fixed ahead of time and does not have to be done by the crowd.

  2. In crowdsourcing for Wikipedia (and other collaborative projects), contributors can spend much more than ten minutes building expertise in a particular task (Wikipedians who edit a page might spend years building expertise on the topic).

For more discussion of differences between Factored Cognition and crowdsourcing and how they relate to AI alignment, see Ought’s Factored Cognition slides and William Saunders’ blogpost on the subject. Despite these differences, crowdsourcing is useful source of evidence and insights for Relay and Factored Cognition. See Reinventing Discovery (Michael Nielsen) for an overview for crowdsourcing for science. Three crowdsourced projects especially relevant to Relay are:

  • The Polymath Project was an example of leveraging internet scale collaboration to solve research problems in mathematics. Starting in 2009, Timothy Gowers posted a challenging problem to his blog and asked other mathematicians and collaborators to help push it forward to a solution. This was similar to the type of distributed problem solving we aimed for with the Relay Game, with the major difference being in the Relay game there is a ten minute time limit, so a player can’t keep working on a problem.

  • The MathWorks Competition is a mathematics modeling competition for high school students. When a student submits an answer to a problem, the code for that solution is immediately made publicly available. The fastest solutions are then often improved upon by other students, and resubmitted.

  • Microtask Programming is a project aiming to apply crowdsourcing techniques to software development. The project provides a development environment where software engineers can collaborate to complete small self-contained microtasks that are automatically generated by the system.

Transmission of information under constraints:

There is also academic research on problem solving under constraints somewhat similar to Relay.