The <@Factored Cognition Hypothesis@>(@Factored Cognition@) informally states that any task can be performed by recursively decomposing the task into smaller and smaller subtasks until eventually the smallest tasks can be done by a human. This sequence aims to formalize the hypothesis to the point that it can be used to argue for the outer alignment of (idealized versions of) <@iterated amplification@>(@Supervising strong learners by amplifying weak experts@) and <@debate@>(@AI safety via debate@).
The key concept is that of an _explanation_ or _decomposition_. An explanation for some statement **s** is a list of other statements **s1, s2, … sn** along with the statement “(**s1** and **s2** and … and **sn**) implies **s**”. A _debate tree_ is a tree in which for a given node **n** with statement **s**, the children of **n** form an explanation (decomposition) of **s**. The leaves of the tree should be statements that the human can verify. (Note that the full formalism has significantly more detail, e.g. a concept of the “difficulty” for the human to verify any given statement.)
We can then define an idealized version of debate, in which the first debater must produce an answer with associated explanation, and the second debater can choose any particular statement to expand further. The judge decides the winner by evaluating whether the final statement is true or not. Assuming optimal play, the correct (honest) answer is an equilibrium as long as:
**Ideal Debate Factored Cognition Hypothesis:** For every question, there exists a debate tree for the correct answer where every leaf can be verified by the judge.
The idealized form of iterated amplification is <@HCH@>(@Humans Consulting HCH@); the corresponding Factored Cognition Hypothesis is simply “For every question, HCH correctly returns the correct answer”. Note that the _existence_ of a debate tree is not enough to guarantee this, as HCH must also _find_ the decompositions in this debate tree. If we imagine that HCH gets access to a decomposition oracle that tells it the right decomposition to make at each node, then HCH would be similar to idealized debate. (HCH could of course simply try all possible decompositions, but we are ignoring that possibility: the decompositions that we rely on should reduce or hide complexity.)
Is the HCH version of the Factored Cognition Hypothesis true? The author tends to lean against (more specifically, that HCH would not be superintelligent), because it seems hard for HCH to find good decompositions. In particular, humans seem to improve their decompositions over time as they learn more, and also seem to improve the concepts by which they think over time, all of which are challenging for HCH to do.
Planned opinion:
I enjoyed this sequence: I’m glad to see more analysis of what is and isn’t necessary for iterated amplification and debate to work, as well as more theoretical models of debate. I broadly agreed with the conceptual points made, with one exception: I’m not convinced that we should not allow brute force for HCH, and for similar reasons I don’t find the arguments that HCH won’t be superintelligent convincing. In particular, the hope with iterated amplification is to approximate a truly massive tree of humans, perhaps a tree containing around 2^100 (about 1e30) base agents / humans. At that scale (or even at just a measly billion (1e9) humans), I don’t expect the reasoning to look anything like what an individual human does, and approaches that are more like “brute force” seem a lot more feasible.
One might wonder why I think it is possible to approximate a tree with more base agents than there are grains of sand in the Sahara desert. Well, a perfect binary tree of depth 99 would have 1e30 nodes; thus we can roughly say that we’re approximating 99-depth-limited HCH. If we had perfect distillation, this would take 99 rounds of iterated amplification and distillation, which seems quite reasonable. Of course, we don’t have perfect distillation, but I expect that to be a relatively small constant factor on top (say 100x), which still seems pretty reasonable. (There’s more detail about how we get this implicit exponential-time computation in <@this post@>(@Factored Cognition@).)
The judge decides the winner by evaluating whether the final statement is true or not.
“True or not” makes it sound symmetrical, but the choice is between ‘very confident that it’s true’ and ‘anything else’. Something like ’80% confident’ goes into the second category.
One thing I would like to be added is just that I come out moderately optimistic about Debate. It’s not too difficult for me to imagine the counter-factual world where I think about FC and find reasons to be pessimistic about Debate, so I take the fact that I didn’t as non-zero evidence.
Changed to “The judge decides the winner based on whether they can confidently verify the final statement or not.”
One thing I would like to be added is just that I come out moderately optimistic about Debate. It’s not too difficult for me to imagine the counter-factual world where I think about FC and find reasons to be pessimistic about Debate, so I take the fact that I didn’t as non-zero evidence.
Added a line to the end of the summary:
On the other hand, the author is cautiously optimistic about debate.
Re personal opinion: what is your take on the feasibility of human experiments? It seems like your model is compatible with IDA working out even though no-one can ever demonstrate something like ‘solve the hardest exercise in a textbook’ using participants with limited time who haven’t read the book.
Yeah, that seems right to me—I don’t really expect to see us solving hard exercises in a textbook with a small number of humans without any additional tricks. I don’t think Ought did either; from pretty early on they were talking about strategies for having larger trees, e.g. via automated decomposition strategies, or caching / memoization of strategies, possibly using ML.
In addition, I think Ought historically has pursued the strategy “try the thing that, if successful, would allow us to build a safety story”, rather than “try the thing that, if it fails, implies that factored cognition would not work out”, which is why they talk about particularly challenging tasks like solving the hardest exercise in a textbook.
Planned summary for the Alignment Newsletter:
Planned opinion:
This is an accurate summary, minus one detail:
“True or not” makes it sound symmetrical, but the choice is between ‘very confident that it’s true’ and ‘anything else’. Something like ’80% confident’ goes into the second category.
One thing I would like to be added is just that I come out moderately optimistic about Debate. It’s not too difficult for me to imagine the counter-factual world where I think about FC and find reasons to be pessimistic about Debate, so I take the fact that I didn’t as non-zero evidence.
Changed to “The judge decides the winner based on whether they can confidently verify the final statement or not.”
Added a line to the end of the summary:
Cool, thanks.
Re personal opinion: what is your take on the feasibility of human experiments? It seems like your model is compatible with IDA working out even though no-one can ever demonstrate something like ‘solve the hardest exercise in a textbook’ using participants with limited time who haven’t read the book.
Yeah, that seems right to me—I don’t really expect to see us solving hard exercises in a textbook with a small number of humans without any additional tricks. I don’t think Ought did either; from pretty early on they were talking about strategies for having larger trees, e.g. via automated decomposition strategies, or caching / memoization of strategies, possibly using ML.
In addition, I think Ought historically has pursued the strategy “try the thing that, if successful, would allow us to build a safety story”, rather than “try the thing that, if it fails, implies that factored cognition would not work out”, which is why they talk about particularly challenging tasks like solving the hardest exercise in a textbook.