It may well be the case that even if you removed [incentive to distort the credit assignment], credit assignment would still be a major problem for things like HCH, but how can you know this from empirical experience with real-world human institutions (which you emphasize in the OP)?
Because there exist human institutions in which people generally seem basically aligned and not trying to game the credit assignment. For instance, most of the startups I’ve worked at were like this (size ~20 people), and I think the alignment research community is basically like this today (although I’ll be surprised if that lasts another 3 years). Probably lots of small-to-medium size orgs are like this, especially in the nonprofit space. It’s hard to get very big orgs/communities without letting in some credit monsters, but medium-size is still large enough to see coordination problems kick in (we had no shortage of them at ~20-person startups).
And, to be clear, I’m not saying these orgs have zero incentive to distort credit assignment. Humans do tend to do that sort of thing reflexively, to some extent. But to the extent that it’s reflexive, it would also apply to HCH and variants thereof. For instance, people in HCH would still reflexively tend to conceal evidence/arguments contradicting their answers. (And when someone does conceal contradictory evidence/arguments, that would presumably increase the memetic fitness of their claims, causing them to propagate further up the tree, so that also provides a selection channel.) Similarly, if the HCH implementation has access to empirical testing channels and the ability to exchange multiple messages, people would still reflexively tend to avoid/bury tests which they expect will actually falsify their answers, or try to blame incorrect answers on subquestions elsewhere in the tree when an unexpected experimental outcome occurs and someone tries to backpropagate to figure out where the prediction-failure came from. (And, again, those who shift blame successfully will presumably have more memetic fitness, etc.)
What if 90% or 99% of the work was not object level, but about mechanism/incentive design, surveillance/interpretability, and rationality training/tuning, including specialized to particular projects being implemented, including the projects that set this up, iterating as relevant wisdom/tuning and reference texts accumulate? This isn’t feasible for most human projects, as it increases costs by orders of magnitude in money (salaries), talent (number of capable people), and serial time. But in HCH you can copy people, it runs faster, and distillation should get rid of redundant steering if it converges to a legible thing in the limit of redundancy.
Remember all that work still needs to be done by HCH itself. Mechanism/incentive design, surveillance/interpretability, and rationality training/tuning all seem about-as-difficult as the alignment problem itself, if not more so.
Copying people is a potential game changer in general, but HCH seems like a really terrible way to organize those copies.
In my view, the purpose of human/HCH distinction is that there are two models, that of a “human” and that of HCH (bureaucracies). This gives some freedom in training/tuning the bureaucracies model, to carry out multiple specialized objectives and work with prompts that the human is not robust enough to handle. This is done without changing the human model, to preserve its alignment properties and use the human’s pervasive involvement/influence at all steps to keep the bureaucracy training/tuning aligned.
The bureaucracies model starts out as that of a human. An episode involves multiple (but only a few) instances of both humans and bureaucracies, each defined by a self-changed internal state and an unchanging prompt/objective. It’s a prompt/mission-statement that turns the single bureaucracies model into a particular bureaucracy, for example one of the prompts might instantiate the ELK head of the bureaucracies model. Crucially, the prompts/objectives of humans are less weird than those of bureaucracies, don’t go into the chinese room territory, and each episode starts as a single human in control of the decision about which other humans and bureaucracies to initially instantiate in what arrangement. It’s only the bureaucracies that get to be exposed to chinese room prompts/objectives, and they can set up subordinate bureaucracy instances with similarly confusing-for-humans prompts.
Since the initial human model is not very capable or aligned, the greater purpose of the construction is to improve the human model. The setting allows instantiating and training multiple specialized bureaucracies, and possibly generalizing their prompt/role/objective from the examples used in training/tuning the bureaucracies model (the episodes). After all, robustness of the bureaucracies model to weird prompts is almost literally the same thing as breadth of available specializations/objectives of bureaucracies.
So the things I was pointing to, incentives/interpretability/rationality, are focus topics for tuned/specialized bureaucracies, whose outputs can be assessed/used by the more reliable but less trainable human (as more legible reference texts, and not large/opaque models) to improve bureaucracy (episode) designs, to gain leverage over bureaucracies that are more specialized and robust to weird prompts/objectives, by solving more principal-agent issues.
More work being allocated to incentives/surveillance/rationality means that even when working on some object-level objective, a significant portion of the bureaucracy instances in an episode would be those specialized in those principal-agent problem (alignment) prompts/objectives, and not in the object-level objective, even if it’s the object-level objective bureaucracy that’s being currently trained/tuned. Here, the principal-agent objective bureaucracies (alignment bureaucracies/heads of the bureaucracies model) remain mostly unchanged, similarly to how the human model (that bootstraps alignment) normally remains unchanged in HCH, since it’s not their training that’s being currently done.
Because there exist human institutions in which people generally seem basically aligned and not trying to game the credit assignment. For instance, most of the startups I’ve worked at were like this (size ~20 people), and I think the alignment research community is basically like this today (although I’ll be surprised if that lasts another 3 years). Probably lots of small-to-medium size orgs are like this, especially in the nonprofit space. It’s hard to get very big orgs/communities without letting in some credit monsters, but medium-size is still large enough to see coordination problems kick in (we had no shortage of them at ~20-person startups).
And, to be clear, I’m not saying these orgs have zero incentive to distort credit assignment. Humans do tend to do that sort of thing reflexively, to some extent. But to the extent that it’s reflexive, it would also apply to HCH and variants thereof. For instance, people in HCH would still reflexively tend to conceal evidence/arguments contradicting their answers. (And when someone does conceal contradictory evidence/arguments, that would presumably increase the memetic fitness of their claims, causing them to propagate further up the tree, so that also provides a selection channel.) Similarly, if the HCH implementation has access to empirical testing channels and the ability to exchange multiple messages, people would still reflexively tend to avoid/bury tests which they expect will actually falsify their answers, or try to blame incorrect answers on subquestions elsewhere in the tree when an unexpected experimental outcome occurs and someone tries to backpropagate to figure out where the prediction-failure came from. (And, again, those who shift blame successfully will presumably have more memetic fitness, etc.)
What if 90% or 99% of the work was not object level, but about mechanism/incentive design, surveillance/interpretability, and rationality training/tuning, including specialized to particular projects being implemented, including the projects that set this up, iterating as relevant wisdom/tuning and reference texts accumulate? This isn’t feasible for most human projects, as it increases costs by orders of magnitude in money (salaries), talent (number of capable people), and serial time. But in HCH you can copy people, it runs faster, and distillation should get rid of redundant steering if it converges to a legible thing in the limit of redundancy.
Remember all that work still needs to be done by HCH itself. Mechanism/incentive design, surveillance/interpretability, and rationality training/tuning all seem about-as-difficult as the alignment problem itself, if not more so.
Copying people is a potential game changer in general, but HCH seems like a really terrible way to organize those copies.
In my view, the purpose of human/HCH distinction is that there are two models, that of a “human” and that of HCH (bureaucracies). This gives some freedom in training/tuning the bureaucracies model, to carry out multiple specialized objectives and work with prompts that the human is not robust enough to handle. This is done without changing the human model, to preserve its alignment properties and use the human’s pervasive involvement/influence at all steps to keep the bureaucracy training/tuning aligned.
The bureaucracies model starts out as that of a human. An episode involves multiple (but only a few) instances of both humans and bureaucracies, each defined by a self-changed internal state and an unchanging prompt/objective. It’s a prompt/mission-statement that turns the single bureaucracies model into a particular bureaucracy, for example one of the prompts might instantiate the ELK head of the bureaucracies model. Crucially, the prompts/objectives of humans are less weird than those of bureaucracies, don’t go into the chinese room territory, and each episode starts as a single human in control of the decision about which other humans and bureaucracies to initially instantiate in what arrangement. It’s only the bureaucracies that get to be exposed to chinese room prompts/objectives, and they can set up subordinate bureaucracy instances with similarly confusing-for-humans prompts.
Since the initial human model is not very capable or aligned, the greater purpose of the construction is to improve the human model. The setting allows instantiating and training multiple specialized bureaucracies, and possibly generalizing their prompt/role/objective from the examples used in training/tuning the bureaucracies model (the episodes). After all, robustness of the bureaucracies model to weird prompts is almost literally the same thing as breadth of available specializations/objectives of bureaucracies.
So the things I was pointing to, incentives/interpretability/rationality, are focus topics for tuned/specialized bureaucracies, whose outputs can be assessed/used by the more reliable but less trainable human (as more legible reference texts, and not large/opaque models) to improve bureaucracy (episode) designs, to gain leverage over bureaucracies that are more specialized and robust to weird prompts/objectives, by solving more principal-agent issues.
More work being allocated to incentives/surveillance/rationality means that even when working on some object-level objective, a significant portion of the bureaucracy instances in an episode would be those specialized in those principal-agent problem (alignment) prompts/objectives, and not in the object-level objective, even if it’s the object-level objective bureaucracy that’s being currently trained/tuned. Here, the principal-agent objective bureaucracies (alignment bureaucracies/heads of the bureaucracies model) remain mostly unchanged, similarly to how the human model (that bootstraps alignment) normally remains unchanged in HCH, since it’s not their training that’s being currently done.