(I’m going to respond here to two different comments about HCH and why bureaucracies fail.)
I think a major reason why people are optimistic about HCH is that they’re confused about why bureaucracies fail.
Responding to Chris: if you go look at real bureaucracies, it is not really the case that “at each level the bosses tell the subordinates what to do and they just have to do it”. At every bureaucracy I’ve worked in/around, lower-level decision makers had many de facto degrees of freedom. You can think of this as a generalization of one of the central problems of jurisprudence: in practice, human “bosses” (or legislatures, in the jurisprudence case) are not able to give instructions which unambiguously specify what to do in all the crazy situations which come up in practice. Nor do people at the top have anywhere near the bandwidth needed to decide every ambiguous case themselves; there is far too much ambiguity in the world. So, in practice, lower-level people (i.e. judges at various levels) necessarily make many many judgement calls in the course of their work.
Also, in general, tons of information flows back up the hierarchy for higher-level people to make decisions. There are already bureacracies whose purpose is very similar to HCH: they exist to support the decision-making of the person at the top. (Government intelligence is a good example.) To my knowledge/experience, such HCH-like bureacracies are not any less dysfunctional than others, nor do normal bureacracies behave less dysfunctionally than normal when passing information up to a high-level decision maker.
Responding to Joe: if you go look at real bureaucracies, most people working in them are generally well-meaning and trying to help. There is still a sense in which incentives are a limiting factor: good incentives are information-carriers in their own right (like e.g. prices), and I’ll link below to arguments that information-transmission is the problem. But incentives are not the problem in a way which can be fixed just by having everyone share some non-selfish values.
So why do bureaucracies (and large organizations more generally) fail so badly?
My main model for this is that interfaces are a scarce resource. Or, to phrase it in a way more obviously relevant to factorization: it is empirically hard for humans to find good factorizations of problems which have not already been found. Interfaces which neatly split problems are not an abundant resource (at least relative to humans’ abilities to find/build such interfaces). If you can solve that problem well, robustly and at scale, then there’s an awful lot of money to be made.
Also, one major sub-bottleneck (though not the only sub-bottleneck) of interface scarcity is that it’s hard to tell who has done a good job on a domain-specific problem/question without already having some domain-specific background knowledge. This also applies at a more “micro” level: it’s hard to tell whose answers are best without knowing lots of context oneself.
I should also mention: these models came out of me working in/around bureacratic organizations, as they were trying to scale up. I wanted to generally understand the causes of various specific instances of dysfunction. So they are based largely on first-hand knowledge.
“At every bureaucracy I’ve worked in/around, lower-level decision makers had many de facto degrees of freedom.”—I wasn’t disputing this—just claiming that they had to work within the constraints of the higher-level boss.
It’s interesting to here the rest of your model though.
Thanks for the elaboration. I agree with most/all of this.
However, for a capable, well-calibrated, cautious H, it mostly seems to argue that HCH won’t be efficient, not that it won’t be capable and something-like-aligned.
Since the HCH structure itself isn’t intended to be efficient, this doesn’t seem too significant to me. In particular, the bureaucracy analogy seems to miss that HCH can spend >99% of its time on robustness. (this might look more like science: many parallel teams trying different approaches, critiquing each other and failing more often than succeeding)
I’m not sure whether you’re claiming:
That an arbitrarily robustness-focused HCH would tend to be incorrect/overconfident/misaligned. (where H might be a team including e.g. you, Eliezer, Paul, Wei Dai, [other people you’d want]...)
That any limits-to-HCH system we train would need to make a robustness/training-efficiency trade-off, and that the levels of caution/redundancy/red-teaming… required to achieve robustness would make training uncompetitive.
Worth noting here that this only needs to be a constant multiplier on human training time—once you’re distilling or similar, there’s no exponential cost increase. (granted distillation has its own issues)
Something else.
To me (2) seems much more plausible than (1), so a perils-of-bureaucracy argument seems more reasonably aimed at IDA etc than at HCH.
I should emphasize that it’s not clear to me that HCH could solve any kind of problem. I just don’t see strong reasons to expect [wrong/misaligned answer] over [acknowledgement of limitations, and somewhat helpful meta-suggestions] (assuming HCH decides to answer the question).
This is a capability thing, not just an efficiency thing. If, for instance, I lack enough context to distinguish real expertise from prestigious fakery in some area, then I very likely also lack enough context to distinguish those who do have enough context from those who don’t (and so on up the meta-ladder). It’s a bottleneck which fundamentally cannot be circumvented by outsourcing cognitive labor.
Similarly, if the interface at the very top level does not successfully convey what I want those one step down to do, then there’s no error-correction mechanism for that; there’s no way to ground out the top-level question anywhere other than the top-level person. Again, it’s a bottleneck which fundamentally cannot be circumvented by outsourcing cognitive labor.
Orthogonal to the “some kinds of cognitive labor cannot be outsourced” problem, there’s also the issue that HCH can only spend >99% of its time on robustness if the person being amplified decides to do so, and then the person being amplified needs to figure out the very difficult problem of how to make all that robustness-effort actually useful. HCH could do all sorts of things if the H in question were already superintelligent, could perfectly factor problems, knew exactly the right questions to ask, knew how to deploy lots of copies in such a way that no key pieces fell through the cracks, etc. But actual humans are not perfectly-ideal tool operators who don’t miss anything or make any mistakes, and actual humans are also not super-competent managers capable of extracting highly robust performance on complex tasks from giant bureaucracies. Heck, it’s a difficult and rare skill just to get robust performance on simple tasks from giant bureaucracies.
In general, if HCH requires some additional assumption that the person being amplified is smart enough to do X, then that should be baked into the whole plan from the start so that we can evaluate it properly. Like, if every time someone says “HCH has problem Y” the answer is “well the humans can just do X”, for many different values of Y and X, then that implies there’s some giant unstated list of things the humans need to do in order for HCH to actually work. If we’re going to rely on the scheme actually working, then we need that whole list in advance, not just some vague hope that the humans operating HCH will figure it all out when the time comes. Humans do not, in practice, reliably ask all the right questions on-the-fly.
And if your answer to that is “ok, the first thing for the HCH operator to do is spin up a bunch of independent HCH instances and ask them what questions we need to ask...” then I want to know why we should expect that to actually generate a list containing all the questions we need to ask. Are we assuming that those subinstances will first ask their subsubinstances (what questions the subinstances need to ask in order to determine (what questions the top instance needs to ask))? Where does that recursion terminate, and when it does terminate, and how does the thing it’s terminating on actually end up producing a list which doesn’t miss any crucial questions?
Similarly, if the interface at the very top level does not successfully convey what I want those one step down to do, then there’s no error-correction mechanism for that; there’s no way to ground out the top-level question anywhere other than the top-level person. Again, it’s a bottleneck which fundamentally cannot be circumvented by outsourcing cognitive labor.
For complex questions I don’t think you’d have the top-level H immediately divide the question itself: you’d want to avoid this single-point-of-failure. In unbounded HCH, one approach would be to set up a scientific community (or a set of communities...), to which the question would be forwarded unaltered. You’d have many teams taking different approaches to the question, teams distilling and critiquing the work of others, teams evaluating promising approaches… [again, in strong HCH we have pointers for all of this]. For IDA you’d do something vaguely similar, on a less grand scale.
You can set up error-correction by passing pointers, explicitly asking about ambiguity/misunderstanding at every step (with parent pointers to get context), using redundancy....
I agree that H needs to be pretty capable and careful—but I’m assuming a context where H is a team formed of hand-picked humans with carefully selected tools (and access to a lot of data). It’s not clear to me that such a team is going to miss required robustness/safety actions (neither is it clear to me that they won’t—I just don’t buy your case yet). It’s not clear they’re in an adversarial situation, so some fixed capability level that can see things in terms of process/meta-levels/abstraction/algorithms… may be sufficient. [once we get into truly adversarial territory, I agree that things are harder—but there we’re beyond things failing for the same reasons bureaucracies do]
I agree it’s hard to get giant bureaucracies to robustly perform simple tasks—I just don’t buy the analogy. Giant bureaucracies don’t have uniform values, and do need to pay for error correction mechanisms.
Like, if every time someone says “HCH has problem Y” the answer is “well the humans can just do X”, for many different values of Y and X, then that implies there’s some giant unstated list of things the humans need to do in order for HCH to actually work. If we’re going to rely on the scheme actually working, then we need that whole list in advance...
Here I want to say: Of course there’s a “giant unstated list of things...”—that’s why we’re putting H into the system. It’d be great if we could precisely specify all the requirements on H ahead of time—but if we could do that, we probably wouldn’t need H. (it certainly makes sense to specify and check for some X, but we’re not likely to be able to find the full list)
To the extent that for all Y so far we’ve found an X, I’m pretty confident that my dream-team H would find X-or-better given a couple of weeks and access to their HCH. While we’d want more than “pretty confident”, it’s not clear to me that we can get it without fooling ourselves: once you’re relying on a human, you’re squarely in pretty-confident-land. (even if we had a full list of desiderata, we’d only be guessing that our H satisfied the list)
However, I get less clear once we’re in IDA territory rather than HCH. Most of the approaches I first consider for HCH are nowhere near the object level of the question. Since IDA can’t afford to set up such elaborate structures, I think the case is harder to make there.
To the extent that for all Y so far we’ve found an X, I’m pretty confident that my dream-team H would find X-or-better given a couple of weeks and access to their HCH.
It sounds like roughly this is cruxy.
We’re trying to decide how reliable <some scheme> is at figuring out the right questions to ask in general, and not letting things slip between the cracks in general, and not overlooking unknown unknowns in general, and so forth. Simply observing <the scheme> in action does not give us a useful feedback signal on these questions, unless we already know the answers to the questions. If <the scheme> is not asking the right questions, and we don’t know what the right questions are, then we can’t tell it’s not asking the right questions. If <the scheme> is letting things slip between the cracks, and we don’t know which things to check for crack-slippage, then we can’t tell it’s letting things slip between the cracks. If <the scheme> is overlooking unknown unknowns, and we don’t already know what the unknown unknowns are, then we can’t tell it’s overlooking unknown unknowns.
So: if the dream team cannot figure out beforehand all the things it needs to do to get HCH to avoid these sorts of problems, we should not expect them to figure it out with access to HCH either. Access to HCH does not provide an informative feedback signal unless we already know the answers. The cognitive labor cannot be delegated.
(Interesting side-point: we can make exactly the same argument as above about our own reasoning processes. In that case, unfortunately, we simply can’t do any better; our own reasoning processes are the final line of defense. That’s why a Simulated Long Reflection is special, among these sorts of buck-passing schemes: it is the one scheme which does as well as we would do anyway. As soon as we start to diverge from Simulated Long Reflection, we need to ask whether the divergence will make the scheme more likely to ask the wrong questions, let things slip between cracks, overlook unknown unknowns, etc. In general, we cannot answer this kind of question by observing the scheme itself in operation.)
For complex questions I don’t think you’d have the top-level H immediately divide the question itself: you’d want to avoid this single-point-of-failure.
(This is less cruxy, but it’s a pretty typical/central example of the problems with this whole way of thinking.) By the time the question/problem has been expressed in English, the English expression is already a proxy for the real question/problem.
One of the central skills involved in conceptual research (of the sort I do) is to not accidentally optimize for something we wrote down in English, rather than the concept which that English is trying to express. It’s all too easy to to think that e.g. we need a nice formalization of “knowledge” or “goal directedness” or “abstraction” or what have you, and then come up with some formalization of the English phrase which does not quite match the thing in our head, and which does not quite fit the use-cases which originally generated the line of inquiry.
This is also a major problem in real bureaucracies: the boss can explain the whole problem to the underlings, in a reasonable amount of detail, without attempting to factor it at all, and the underlings are still prone to misunderstand the goal or the use-cases and end up solving the wrong thing. In software engineering, for instance, this happens all the time and is one of the central challenges of the job.
(I’m going to respond here to two different comments about HCH and why bureaucracies fail.)
I think a major reason why people are optimistic about HCH is that they’re confused about why bureaucracies fail.
Responding to Chris: if you go look at real bureaucracies, it is not really the case that “at each level the bosses tell the subordinates what to do and they just have to do it”. At every bureaucracy I’ve worked in/around, lower-level decision makers had many de facto degrees of freedom. You can think of this as a generalization of one of the central problems of jurisprudence: in practice, human “bosses” (or legislatures, in the jurisprudence case) are not able to give instructions which unambiguously specify what to do in all the crazy situations which come up in practice. Nor do people at the top have anywhere near the bandwidth needed to decide every ambiguous case themselves; there is far too much ambiguity in the world. So, in practice, lower-level people (i.e. judges at various levels) necessarily make many many judgement calls in the course of their work.
Also, in general, tons of information flows back up the hierarchy for higher-level people to make decisions. There are already bureacracies whose purpose is very similar to HCH: they exist to support the decision-making of the person at the top. (Government intelligence is a good example.) To my knowledge/experience, such HCH-like bureacracies are not any less dysfunctional than others, nor do normal bureacracies behave less dysfunctionally than normal when passing information up to a high-level decision maker.
Responding to Joe: if you go look at real bureaucracies, most people working in them are generally well-meaning and trying to help. There is still a sense in which incentives are a limiting factor: good incentives are information-carriers in their own right (like e.g. prices), and I’ll link below to arguments that information-transmission is the problem. But incentives are not the problem in a way which can be fixed just by having everyone share some non-selfish values.
So why do bureaucracies (and large organizations more generally) fail so badly?
My main model for this is that interfaces are a scarce resource. Or, to phrase it in a way more obviously relevant to factorization: it is empirically hard for humans to find good factorizations of problems which have not already been found. Interfaces which neatly split problems are not an abundant resource (at least relative to humans’ abilities to find/build such interfaces). If you can solve that problem well, robustly and at scale, then there’s an awful lot of money to be made.
Also, one major sub-bottleneck (though not the only sub-bottleneck) of interface scarcity is that it’s hard to tell who has done a good job on a domain-specific problem/question without already having some domain-specific background knowledge. This also applies at a more “micro” level: it’s hard to tell whose answers are best without knowing lots of context oneself.
I should also mention: these models came out of me working in/around bureacratic organizations, as they were trying to scale up. I wanted to generally understand the causes of various specific instances of dysfunction. So they are based largely on first-hand knowledge.
“At every bureaucracy I’ve worked in/around, lower-level decision makers had many de facto degrees of freedom.”—I wasn’t disputing this—just claiming that they had to work within the constraints of the higher-level boss.
It’s interesting to here the rest of your model though.
Thanks for the elaboration. I agree with most/all of this.
However, for a capable, well-calibrated, cautious H, it mostly seems to argue that HCH won’t be efficient, not that it won’t be capable and something-like-aligned.
Since the HCH structure itself isn’t intended to be efficient, this doesn’t seem too significant to me. In particular, the bureaucracy analogy seems to miss that HCH can spend >99% of its time on robustness. (this might look more like science: many parallel teams trying different approaches, critiquing each other and failing more often than succeeding)
I’m not sure whether you’re claiming:
That an arbitrarily robustness-focused HCH would tend to be incorrect/overconfident/misaligned. (where H might be a team including e.g. you, Eliezer, Paul, Wei Dai, [other people you’d want]...)
That any limits-to-HCH system we train would need to make a robustness/training-efficiency trade-off, and that the levels of caution/redundancy/red-teaming… required to achieve robustness would make training uncompetitive.
Worth noting here that this only needs to be a constant multiplier on human training time—once you’re distilling or similar, there’s no exponential cost increase. (granted distillation has its own issues)
Something else.
To me (2) seems much more plausible than (1), so a perils-of-bureaucracy argument seems more reasonably aimed at IDA etc than at HCH.
I should emphasize that it’s not clear to me that HCH could solve any kind of problem. I just don’t see strong reasons to expect [wrong/misaligned answer] over [acknowledgement of limitations, and somewhat helpful meta-suggestions] (assuming HCH decides to answer the question).
This is a capability thing, not just an efficiency thing. If, for instance, I lack enough context to distinguish real expertise from prestigious fakery in some area, then I very likely also lack enough context to distinguish those who do have enough context from those who don’t (and so on up the meta-ladder). It’s a bottleneck which fundamentally cannot be circumvented by outsourcing cognitive labor.
Similarly, if the interface at the very top level does not successfully convey what I want those one step down to do, then there’s no error-correction mechanism for that; there’s no way to ground out the top-level question anywhere other than the top-level person. Again, it’s a bottleneck which fundamentally cannot be circumvented by outsourcing cognitive labor.
Orthogonal to the “some kinds of cognitive labor cannot be outsourced” problem, there’s also the issue that HCH can only spend >99% of its time on robustness if the person being amplified decides to do so, and then the person being amplified needs to figure out the very difficult problem of how to make all that robustness-effort actually useful. HCH could do all sorts of things if the H in question were already superintelligent, could perfectly factor problems, knew exactly the right questions to ask, knew how to deploy lots of copies in such a way that no key pieces fell through the cracks, etc. But actual humans are not perfectly-ideal tool operators who don’t miss anything or make any mistakes, and actual humans are also not super-competent managers capable of extracting highly robust performance on complex tasks from giant bureaucracies. Heck, it’s a difficult and rare skill just to get robust performance on simple tasks from giant bureaucracies.
In general, if HCH requires some additional assumption that the person being amplified is smart enough to do X, then that should be baked into the whole plan from the start so that we can evaluate it properly. Like, if every time someone says “HCH has problem Y” the answer is “well the humans can just do X”, for many different values of Y and X, then that implies there’s some giant unstated list of things the humans need to do in order for HCH to actually work. If we’re going to rely on the scheme actually working, then we need that whole list in advance, not just some vague hope that the humans operating HCH will figure it all out when the time comes. Humans do not, in practice, reliably ask all the right questions on-the-fly.
And if your answer to that is “ok, the first thing for the HCH operator to do is spin up a bunch of independent HCH instances and ask them what questions we need to ask...” then I want to know why we should expect that to actually generate a list containing all the questions we need to ask. Are we assuming that those subinstances will first ask their subsubinstances (what questions the subinstances need to ask in order to determine (what questions the top instance needs to ask))? Where does that recursion terminate, and when it does terminate, and how does the thing it’s terminating on actually end up producing a list which doesn’t miss any crucial questions?
For complex questions I don’t think you’d have the top-level H immediately divide the question itself: you’d want to avoid this single-point-of-failure. In unbounded HCH, one approach would be to set up a scientific community (or a set of communities...), to which the question would be forwarded unaltered. You’d have many teams taking different approaches to the question, teams distilling and critiquing the work of others, teams evaluating promising approaches… [again, in strong HCH we have pointers for all of this].
For IDA you’d do something vaguely similar, on a less grand scale.
You can set up error-correction by passing pointers, explicitly asking about ambiguity/misunderstanding at every step (with parent pointers to get context), using redundancy....
I agree that H needs to be pretty capable and careful—but I’m assuming a context where H is a team formed of hand-picked humans with carefully selected tools (and access to a lot of data). It’s not clear to me that such a team is going to miss required robustness/safety actions (neither is it clear to me that they won’t—I just don’t buy your case yet). It’s not clear they’re in an adversarial situation, so some fixed capability level that can see things in terms of process/meta-levels/abstraction/algorithms… may be sufficient.
[once we get into truly adversarial territory, I agree that things are harder—but there we’re beyond things failing for the same reasons bureaucracies do]
I agree it’s hard to get giant bureaucracies to robustly perform simple tasks—I just don’t buy the analogy. Giant bureaucracies don’t have uniform values, and do need to pay for error correction mechanisms.
Here I want to say:
Of course there’s a “giant unstated list of things...”—that’s why we’re putting H into the system. It’d be great if we could precisely specify all the requirements on H ahead of time—but if we could do that, we probably wouldn’t need H. (it certainly makes sense to specify and check for some X, but we’re not likely to be able to find the full list)
To the extent that for all Y so far we’ve found an X, I’m pretty confident that my dream-team H would find X-or-better given a couple of weeks and access to their HCH. While we’d want more than “pretty confident”, it’s not clear to me that we can get it without fooling ourselves: once you’re relying on a human, you’re squarely in pretty-confident-land. (even if we had a full list of desiderata, we’d only be guessing that our H satisfied the list)
However, I get less clear once we’re in IDA territory rather than HCH. Most of the approaches I first consider for HCH are nowhere near the object level of the question. Since IDA can’t afford to set up such elaborate structures, I think the case is harder to make there.
It sounds like roughly this is cruxy.
We’re trying to decide how reliable <some scheme> is at figuring out the right questions to ask in general, and not letting things slip between the cracks in general, and not overlooking unknown unknowns in general, and so forth. Simply observing <the scheme> in action does not give us a useful feedback signal on these questions, unless we already know the answers to the questions. If <the scheme> is not asking the right questions, and we don’t know what the right questions are, then we can’t tell it’s not asking the right questions. If <the scheme> is letting things slip between the cracks, and we don’t know which things to check for crack-slippage, then we can’t tell it’s letting things slip between the cracks. If <the scheme> is overlooking unknown unknowns, and we don’t already know what the unknown unknowns are, then we can’t tell it’s overlooking unknown unknowns.
So: if the dream team cannot figure out beforehand all the things it needs to do to get HCH to avoid these sorts of problems, we should not expect them to figure it out with access to HCH either. Access to HCH does not provide an informative feedback signal unless we already know the answers. The cognitive labor cannot be delegated.
(Interesting side-point: we can make exactly the same argument as above about our own reasoning processes. In that case, unfortunately, we simply can’t do any better; our own reasoning processes are the final line of defense. That’s why a Simulated Long Reflection is special, among these sorts of buck-passing schemes: it is the one scheme which does as well as we would do anyway. As soon as we start to diverge from Simulated Long Reflection, we need to ask whether the divergence will make the scheme more likely to ask the wrong questions, let things slip between cracks, overlook unknown unknowns, etc. In general, we cannot answer this kind of question by observing the scheme itself in operation.)
(This is less cruxy, but it’s a pretty typical/central example of the problems with this whole way of thinking.) By the time the question/problem has been expressed in English, the English expression is already a proxy for the real question/problem.
One of the central skills involved in conceptual research (of the sort I do) is to not accidentally optimize for something we wrote down in English, rather than the concept which that English is trying to express. It’s all too easy to to think that e.g. we need a nice formalization of “knowledge” or “goal directedness” or “abstraction” or what have you, and then come up with some formalization of the English phrase which does not quite match the thing in our head, and which does not quite fit the use-cases which originally generated the line of inquiry.
This is also a major problem in real bureaucracies: the boss can explain the whole problem to the underlings, in a reasonable amount of detail, without attempting to factor it at all, and the underlings are still prone to misunderstand the goal or the use-cases and end up solving the wrong thing. In software engineering, for instance, this happens all the time and is one of the central challenges of the job.