I’d talk about RL instead of imitation learning when I describe the distillation step. Imitation learning is easier to explain, but ultimately you probably need RL to be competitive.
I’d be more careful when I talk about internal supervision. The presentation mixes up three related ideas:
(1) Approval-directed agents: We train an ML agent to interact with an external, human-comprehensible workspace using steps that an (augmented) expert would approve.
(2) Distillation: We train an ML agent to implement a function from questions to answers based on demonstrations (or incentives) provided by a large tree of experts, each of which takes a small step. The trained agent is a big neural net that only replicates the tree’s input-output behavior, not individual reasoning steps. Imitating the steps directly wouldn’t be possible since the tree would likely be exponentially large and so has to remain implicit.
(3) Transparency: When we distill, we want to verify that the behavior of the distilled agent is a faithful instantiation of the behavior demonstrated (or incentivized) by the overseer. To do this, we might use approaches to neural net interpretability.
I’d be more precise about what the term “factored cognition” refers to. Factored cognition refers to the research question whether (and how) complex cognitive tasks can be decomposed into relatively small, semantically meaningful pieces. This is relevant to alignment, but it’s not an approach to alignment on its own. If factored cognition is possible, you’d still need a story for leveraging it to train aligned agents (such as the other ingredients of the iterated amplification program), and it’s of interest outside of alignment as well (e.g. for building tools that let us delegate cognitive work to other people).
I’d hint at why you might not need an unreasonably large amount of curated training data for this approach to work. When human experts do decompositions, they are effectively specifying problem solving algorithms, which can then be applied to very large external data sets in order to generate subquestions and answers that the ML system can be trained on. (Additionally, we could pretrain on a different problem, e.g. natural language prediction.)
I’d highlight that there’s a bit of a sleight of hand going on with the decomposition examples. I show relatively object-level problem decompositions (e.g. Fermi estimation), but in the long run, for scaling to problems that are far beyond what the human overseer could tackle on their own, you’re effectively specifying general algorithmsfor learning and reasoning with concepts, which seems harder to get right.
Distillation: We train an ML agent to implement a function from questions to
answers based on demonstrations (or incentives) provided by a large tree of
experts […]. The trained agent […] only replicates the tree’s input-output
behavior, not individual reasoning steps.
Why do we decompose in the first place? If the training data for the next agent
consists only of root questions and root answers, it doesn’t matter whether they
represent the tree’s input-output behaviour or the input-output behaviour of a
small group of experts who reason in the normal human high-context,
high-bandwidth way. The latter is certainly more efficient.
There seems to be a circular problem and I don’t understand how it is not
circular or where my understanding goes astray: We want to teach an ML agent aligned reasoning. This is difficult if
the training data consists of high-level questions and answers. So instead we
write down how we reason explicitly in small steps.
Some tasks are hard to write down in small steps. In these cases we write down a
naive decomposition that takes exponential time. A real-world agent can’t use
this to reason, because it would be too slow. To work around this we train a
higher-level agent on just the input-output behaviour of the slower agent. Now
the training data consists of high-level questions and answers. But this is what
we wanted to avoid, and therefore started writing down small steps.
Decomposition makes sense to me in the high-bandwidth setting where the task is too difficult for a human, so the human only divides it and combines the sub-results. I don’t see the point of decomposing a human-answerable question into even smaller low-bandwidth subquestions if we then throw away the tree and train an agent on the top-level question and answer.
What I’d do differently now:
I’d talk about RL instead of imitation learning when I describe the distillation step. Imitation learning is easier to explain, but ultimately you probably need RL to be competitive.
I’d be more careful when I talk about internal supervision. The presentation mixes up three related ideas:
(1) Approval-directed agents: We train an ML agent to interact with an external, human-comprehensible workspace using steps that an (augmented) expert would approve.
(2) Distillation: We train an ML agent to implement a function from questions to answers based on demonstrations (or incentives) provided by a large tree of experts, each of which takes a small step. The trained agent is a big neural net that only replicates the tree’s input-output behavior, not individual reasoning steps. Imitating the steps directly wouldn’t be possible since the tree would likely be exponentially large and so has to remain implicit.
(3) Transparency: When we distill, we want to verify that the behavior of the distilled agent is a faithful instantiation of the behavior demonstrated (or incentivized) by the overseer. To do this, we might use approaches to neural net interpretability.
I’d be more precise about what the term “factored cognition” refers to. Factored cognition refers to the research question whether (and how) complex cognitive tasks can be decomposed into relatively small, semantically meaningful pieces. This is relevant to alignment, but it’s not an approach to alignment on its own. If factored cognition is possible, you’d still need a story for leveraging it to train aligned agents (such as the other ingredients of the iterated amplification program), and it’s of interest outside of alignment as well (e.g. for building tools that let us delegate cognitive work to other people).
I’d hint at why you might not need an unreasonably large amount of curated training data for this approach to work. When human experts do decompositions, they are effectively specifying problem solving algorithms, which can then be applied to very large external data sets in order to generate subquestions and answers that the ML system can be trained on. (Additionally, we could pretrain on a different problem, e.g. natural language prediction.)
I’d highlight that there’s a bit of a sleight of hand going on with the decomposition examples. I show relatively object-level problem decompositions (e.g. Fermi estimation), but in the long run, for scaling to problems that are far beyond what the human overseer could tackle on their own, you’re effectively specifying general algorithmsfor learning and reasoning with concepts, which seems harder to get right.
Why do we decompose in the first place? If the training data for the next agent consists only of root questions and root answers, it doesn’t matter whether they represent the tree’s input-output behaviour or the input-output behaviour of a small group of experts who reason in the normal human high-context, high-bandwidth way. The latter is certainly more efficient.
There seems to be a circular problem and I don’t understand how it is not circular or where my understanding goes astray: We want to teach an ML agent aligned reasoning. This is difficult if the training data consists of high-level questions and answers. So instead we write down how we reason explicitly in small steps.
Some tasks are hard to write down in small steps. In these cases we write down a naive decomposition that takes exponential time. A real-world agent can’t use this to reason, because it would be too slow. To work around this we train a higher-level agent on just the input-output behaviour of the slower agent. Now the training data consists of high-level questions and answers. But this is what we wanted to avoid, and therefore started writing down small steps.
Decomposition makes sense to me in the high-bandwidth setting where the task is too difficult for a human, so the human only divides it and combines the sub-results. I don’t see the point of decomposing a human-answerable question into even smaller low-bandwidth subquestions if we then throw away the tree and train an agent on the top-level question and answer.
This first section of Ought: why it matters and ways to help answers the question. It’s also a good update on this post in general.