Vladimir_Nesov comments on Rant on Problem Factorization for Alignment

Vladimir_Nesov 13 Aug 2022 0:07 UTC
LW: 5 AF: 3
0
AF
In my view, the purpose of human/HCH distinction is that there are two models, that of a “human” and that of HCH (bureaucracies). This gives some freedom in training/tuning the bureaucracies model, to carry out multiple specialized objectives and work with prompts that the human is not robust enough to handle. This is done without changing the human model, to preserve its alignment properties and use the human’s pervasive involvement/influence at all steps to keep the bureaucracy training/tuning aligned.

The bureaucracies model starts out as that of a human. An episode involves multiple (but only a few) instances of both humans and bureaucracies, each defined by a self-changed internal state and an unchanging prompt/objective. It’s a prompt/mission-statement that turns the single bureaucracies model into a particular bureaucracy, for example one of the prompts might instantiate the ELK head of the bureaucracies model. Crucially, the prompts/objectives of humans are less weird than those of bureaucracies, don’t go into the chinese room territory, and each episode starts as a single human in control of the decision about which other humans and bureaucracies to initially instantiate in what arrangement. It’s only the bureaucracies that get to be exposed to chinese room prompts/objectives, and they can set up subordinate bureaucracy instances with similarly confusing-for-humans prompts.

Since the initial human model is not very capable or aligned, the greater purpose of the construction is to improve the human model. The setting allows instantiating and training multiple specialized bureaucracies, and possibly generalizing their prompt/role/objective from the examples used in training/tuning the bureaucracies model (the episodes). After all, robustness of the bureaucracies model to weird prompts is almost literally the same thing as breadth of available specializations/objectives of bureaucracies.

So the things I was pointing to, incentives/interpretability/rationality, are focus topics for tuned/specialized bureaucracies, whose outputs can be assessed/used by the more reliable but less trainable human (as more legible reference texts, and not large/opaque models) to improve bureaucracy (episode) designs, to gain leverage over bureaucracies that are more specialized and robust to weird prompts/objectives, by solving more principal-agent issues.

More work being allocated to incentives/surveillance/rationality means that even when working on some object-level objective, a significant portion of the bureaucracy instances in an episode would be those specialized in those principal-agent problem (alignment) prompts/objectives, and not in the object-level objective, even if it’s the object-level objective bureaucracy that’s being currently trained/tuned. Here, the principal-agent objective bureaucracies (alignment bureaucracies/heads of the bureaucracies model) remain mostly unchanged, similarly to how the human model (that bootstraps alignment) normally remains unchanged in HCH, since it’s not their training that’s being currently done.
What links here?