You seem to be thinking of all of these as things-that-can-be-implemented, which I don’t think is exactly right. I think you could only call some of these “approaches to alignment”, if you mean something that could some day be implemented and lead to aligned AI.
I’ll consider two different properties:
Implementation:Theoretical (can’t be implemented, takes infinite compute) or Implementable without ML (i.e. with humans, as a result it is very inefficient) or Implementation needs ML (there will still be humans, but the hope is that it will be efficient and competitive with unaligned AI systems).
How learning happens:Task-based (the agent learns to perform a particular task or reason in a particular way) or Reward-based (the agent learns to provide a good reward signal that provides good incentives for some other system). While task-based systems are more elegant and clean when everything works right, we’d expect that in the presence of real world messiness such as optimization difficulty that reward-based systems will be more robust (see Against Mimicry).
I would only consider the Implementation needs ML things to be “approaches to alignment”. Anyway, here they are (ignoring ALBA for the same reason you do):
Weak HCH: A theoretical ideal where each human can delegate to other agents. We hope that the result is both superintelligent and aligned. Properties: Theoretical / Task-based
Strong HCH: Like weak HCH, but allowing each human to have a dialog with subagents, and allowing message passing to include pointers to other agents. Properties: Theoretical / Task-based
Meta-execution: A particular implementation method that can deal with the fact that some questions may be too “big” for any one agent to even read the full question. Properties: Not really a recursive approach, it’s more a component of other approaches.
Factored cognition: The hypothesis that strong HCH can solve arbitrary tasks. In terms of actual implementation, it’s compute-limited strong HCH / meta-execution. Properties: Implementable without ML / Task-based
Factored evaluation: The hypothesis that strong HCH can provide a reward signal for arbitrary tasks. (Less confident of this one, as it hasn’t been explained in detail before.) In terms of actual implementation, it’s compute-limited strong HCH / meta-execution, where the goal is to provide a reward signal for some task. Properties: Implementable without ML / Reward-based
Iterated amplification (IDA): Approximating strong/weak HCH by training an agent that behaves like depth-limited strong/weak HCH and increasing the effective depth over time. Properties: Implementation needs ML / Task-based
Recursive reward modeling: Approximating strong/weak HCH by training an agent that behaves like depth-limited strong/weak HCH on tasks that are useful for evaluating the task of interest, and on evaluating tasks that are useful for evaluating the task of interest, etc. Properties: Implementation needs ML / Reward-based
Debate: Not really a recursive method at all, but it still depends on the general premise of decomposing thought processes into trees of smaller thoughts (though in this case, the “smaller thoughts” have to be arguments and counterarguments, rather than general considerations). If the Factored cognition hypothesis is false, then debate is unlikely to work.
I have some follow-up questions that I’m hoping you can answer:
I didn’t realize that weak HCH uses linear recursion. On the original HCH post (which is talking about weak HCH), Paul talks in comments about “branching factor”, and Vaniver says things like “So he asks HCH to separately solve A, B, and C”. Are Paul/Vaniver talking about strong HCH here, or am I wrong to think that branching implies tree recursion? If Paul/Vaniver are talking about weak HCH, and branching does imply tree recursion, then it seems like weak HCH must be using tree recursion rather than linear recursion.
Your answer didn’t confirm or deny whether the agents in HCH are human-level or superhuman. I’m guessing it’s the former, in which case I’m confused about how IDA and recursive reward modeling are approximating strong HCH, since in these approaches the agents are eventually superhuman (so they could solve some problems in ways that HCH can’t, or solve problems that HCH can’t solve at all).
You write that meta-execution is “more a component of other approaches”, but Paul says “Meta-execution is annotated functional programming + strong HCH + a level of indirection”, which makes it sound like meta-execution is a specific implementation rather than a component that plugs into other approaches (whereas annotated functional programming does seem like a component that can plug into other approaches). Were you talking about annotated functional programming here? If not, how is meta-execution used in other approaches?
I’m confused that you say IDA is task-based rather than reward-based. My understanding was that IDA can be task-based or reward-based depending on the learning method used during the distillation process. This discussion thread seems to imply that recursive reward modeling is an instance of IDA. Am I missing something, or were you restricting attention to a specific kind of IDA (like imitation-based IDA)?
Are Paul/Vaniver talking about strong HCH here, or am I wrong to think that branching implies tree recursion?
ETA: This paragraph is wrong, see rest of comment thread. I’m pretty sure they’re talking about strong HCH. At the time those comments were written (and now), HCH basically always referred to strong HCH, and it was rare for people to talk about weak HCH. (This is mentioned at the top of the strong HCH post.)
Your answer didn’t confirm or deny whether the agents in HCH are human-level or superhuman. I’m guessing it’s the former
That’s correct.
I’m confused about how IDA and recursive reward modeling are approximating strong HCH, since in these approaches the agents are eventually superhuman
There are two different meanings of the word “agent” here. First, there’s the BaseAgent, the one which we start with that’s already human-level (both HCH and IDA assume access to such a BaseAgent). Then, there’s the PowerfulAgent, which is the system as a whole, which can be superhuman. With HCH, the entire infinite tree of BaseAgents together makes up the PowerfulAgent. With IDA / recursive reward modeling, either the neural net (output of distillation) or the one-level tree (output of amplification) could be considered the PowerfulAgent. Initially, the PowerfulAgent is only as capable as the BaseAgent, but as training progresses, it becomes more capable, reaching the HCH-PowerfulAgent in the limit.
For all methods, the BaseAgent is human level, and the PowerfulAgent is (eventually) superhuman.
You write that meta-execution is “more a component of other approaches”, but Paul says”Meta-execution is annotated functional programming + strong HCH + a level of indirection”, which makes it sound like meta-execution is a specific implementation rather than a component that plugs into other approaches (whereas annotated functional programming does seem like a component that can plug into other approaches). Were you talking about annotated functional programming here?
Yes, my bad. (Technically, I was talking about both the annotated functional programming and the level of indirection, I think.)
Am I missing something, or were you restricting attention to a specific kind of IDA (like imitation-based IDA)?
I was restricting attention to imitation-based IDA, which is the canonical example, and what variousexplainerarticles are explaining.
Thanks. It looks like all the realistic examples I had of weak HCH are actually examples of strong HCH after all, so I’m looking for some examples of weak HCH to help my understanding. I can see how weak HCH would compute the answer to a “naturally linear recursive” problem (like computing factorials) but how would weak HCH answer a question like “Should I get laser eye surgery?” (to take an example from here). The natural way to decompose a problem like this seems to use branching.
Also, I just looked again at Alex Zhu’s FAQ for Paul’s agenda, and Alex’s explanation of weak HCH (in section 2.2.1) seems to imply that it is doing tree recursion (e.g. “I sometimes picture this as an infinite tree of humans-in-boxes, who can break down questions and pass them to other humans-in-boxes”). It seems like either you or Alex must be mistaken here, but I have no idea which.
I just looked at both that and the original strong HCH post, and I think Alex’s explanation is right and I was mistaken: both weak and strong HCH use tree recursion. Edited the original answer to not talk about linear / tree recursion.
I found this immensely helpful overall, thank you!
However, I’m still somewhat confused by meta-execution. Is it essentially a more sophisticated capability amplification strategy that replaces the role filled by “deliberation” in Christiano’s IA paper?
I’m not sure how relevant meta-execution is any more, I haven’t seen it discussed much recently. So probably you’d want to ask Paul, or someone else who was around earlier than I was.
You seem to be thinking of all of these as things-that-can-be-implemented, which I don’t think is exactly right. I think you could only call some of these “approaches to alignment”, if you mean something that could some day be implemented and lead to aligned AI.
I’ll consider two different properties:
Implementation: Theoretical (can’t be implemented, takes infinite compute) or Implementable without ML (i.e. with humans, as a result it is very inefficient) or Implementation needs ML (there will still be humans, but the hope is that it will be efficient and competitive with unaligned AI systems).
How learning happens: Task-based (the agent learns to perform a particular task or reason in a particular way) or Reward-based (the agent learns to provide a good reward signal that provides good incentives for some other system). While task-based systems are more elegant and clean when everything works right, we’d expect that in the presence of real world messiness such as optimization difficulty that reward-based systems will be more robust (see Against Mimicry).
I would only consider the Implementation needs ML things to be “approaches to alignment”. Anyway, here they are (ignoring ALBA for the same reason you do):
Weak HCH: A theoretical ideal where each human can delegate to other agents. We hope that the result is both superintelligent and aligned. Properties: Theoretical / Task-based
Strong HCH: Like weak HCH, but allowing each human to have a dialog with subagents, and allowing message passing to include pointers to other agents. Properties: Theoretical / Task-based
Meta-execution: A particular implementation method that can deal with the fact that some questions may be too “big” for any one agent to even read the full question. Properties: Not really a recursive approach, it’s more a component of other approaches.
Factored cognition: The hypothesis that strong HCH can solve arbitrary tasks. In terms of actual implementation, it’s compute-limited strong HCH / meta-execution. Properties: Implementable without ML / Task-based
Factored evaluation: The hypothesis that strong HCH can provide a reward signal for arbitrary tasks. (Less confident of this one, as it hasn’t been explained in detail before.) In terms of actual implementation, it’s compute-limited strong HCH / meta-execution, where the goal is to provide a reward signal for some task. Properties: Implementable without ML / Reward-based
Iterated amplification (IDA): Approximating strong/weak HCH by training an agent that behaves like depth-limited strong/weak HCH and increasing the effective depth over time. Properties: Implementation needs ML / Task-based
Recursive reward modeling: Approximating strong/weak HCH by training an agent that behaves like depth-limited strong/weak HCH on tasks that are useful for evaluating the task of interest, and on evaluating tasks that are useful for evaluating the task of interest, etc. Properties: Implementation needs ML / Reward-based
Debate: Not really a recursive method at all, but it still depends on the general premise of decomposing thought processes into trees of smaller thoughts (though in this case, the “smaller thoughts” have to be arguments and counterarguments, rather than general considerations). If the Factored cognition hypothesis is false, then debate is unlikely to work.
Thanks! I found this answer really useful.
I have some follow-up questions that I’m hoping you can answer:
I didn’t realize that weak HCH uses linear recursion. On the original HCH post (which is talking about weak HCH), Paul talks in comments about “branching factor”, and Vaniver says things like “So he asks HCH to separately solve A, B, and C”. Are Paul/Vaniver talking about strong HCH here, or am I wrong to think that branching implies tree recursion? If Paul/Vaniver are talking about weak HCH, and branching does imply tree recursion, then it seems like weak HCH must be using tree recursion rather than linear recursion.
Your answer didn’t confirm or deny whether the agents in HCH are human-level or superhuman. I’m guessing it’s the former, in which case I’m confused about how IDA and recursive reward modeling are approximating strong HCH, since in these approaches the agents are eventually superhuman (so they could solve some problems in ways that HCH can’t, or solve problems that HCH can’t solve at all).
You write that meta-execution is “more a component of other approaches”, but Paul says “Meta-execution is annotated functional programming + strong HCH + a level of indirection”, which makes it sound like meta-execution is a specific implementation rather than a component that plugs into other approaches (whereas annotated functional programming does seem like a component that can plug into other approaches). Were you talking about annotated functional programming here? If not, how is meta-execution used in other approaches?
I’m confused that you say IDA is task-based rather than reward-based. My understanding was that IDA can be task-based or reward-based depending on the learning method used during the distillation process. This discussion thread seems to imply that recursive reward modeling is an instance of IDA. Am I missing something, or were you restricting attention to a specific kind of IDA (like imitation-based IDA)?
ETA: This paragraph is wrong, see rest of comment thread. I’m pretty sure they’re talking about strong HCH. At the time those comments were written (and now), HCH basically always referred to strong HCH, and it was rare for people to talk about weak HCH. (This is mentioned at the top of the strong HCH post.)
That’s correct.
There are two different meanings of the word “agent” here. First, there’s the BaseAgent, the one which we start with that’s already human-level (both HCH and IDA assume access to such a BaseAgent). Then, there’s the PowerfulAgent, which is the system as a whole, which can be superhuman. With HCH, the entire infinite tree of BaseAgents together makes up the PowerfulAgent. With IDA / recursive reward modeling, either the neural net (output of distillation) or the one-level tree (output of amplification) could be considered the PowerfulAgent. Initially, the PowerfulAgent is only as capable as the BaseAgent, but as training progresses, it becomes more capable, reaching the HCH-PowerfulAgent in the limit.
For all methods, the BaseAgent is human level, and the PowerfulAgent is (eventually) superhuman.
Yes, my bad. (Technically, I was talking about both the annotated functional programming and the level of indirection, I think.)
I was restricting attention to imitation-based IDA, which is the canonical example, and what various explainer articles are explaining.
Thanks. It looks like all the realistic examples I had of weak HCH are actually examples of strong HCH after all, so I’m looking for some examples of weak HCH to help my understanding. I can see how weak HCH would compute the answer to a “naturally linear recursive” problem (like computing factorials) but how would weak HCH answer a question like “Should I get laser eye surgery?” (to take an example from here). The natural way to decompose a problem like this seems to use branching.
Also, I just looked again at Alex Zhu’s FAQ for Paul’s agenda, and Alex’s explanation of weak HCH (in section 2.2.1) seems to imply that it is doing tree recursion (e.g. “I sometimes picture this as an infinite tree of humans-in-boxes, who can break down questions and pass them to other humans-in-boxes”). It seems like either you or Alex must be mistaken here, but I have no idea which.
I just looked at both that and the original strong HCH post, and I think Alex’s explanation is right and I was mistaken: both weak and strong HCH use tree recursion. Edited the original answer to not talk about linear / tree recursion.
I found this immensely helpful overall, thank you!
However, I’m still somewhat confused by meta-execution. Is it essentially a more sophisticated capability amplification strategy that replaces the role filled by “deliberation” in Christiano’s IA paper?
I maybe think so? I’m not sure about meta-execution (as the comment thread above shows).
Huh, what would you recommend I do to reduce my uncertainty around meta-execution (e.g. “read x”, “ask about it as a top level question”, etc)?
I’m not sure how relevant meta-execution is any more, I haven’t seen it discussed much recently. So probably you’d want to ask Paul, or someone else who was around earlier than I was.