Epistemology of HCH
Introduction
HCH is a recursive acronym meaning “Humans consulting HCH”. Coincidentally, It’s also a concept coined by Paul Christiano, central in much of the reasoning around Prosaic AI Alignment. Yet for many, me included, the various ways in which it is used are sometimes confusing.
I believe that the tools of Epistemology and Philosophy of Science can help understand it better, and push further the research around it. So this post doesn’t give yet another explanation of HCH; instead, it asks about the different perspectives we can take on it. These perspectives capture the form of knowledge that HCH is, what it tells us about AI Alignment, and how to expand, judge and interpret this knowledge. I then apply these perspectives to examples of research on HCH, to show the usefulness of the different frames.
Thanks to Joe Collman, Jérémy Perret, Richard Ngo, Evan Hubinger and Paul Christiano for feedback on this post.
Is it a scientific explanation? Is it a model of computation? No, it’s HCH!
HCH was originally defined in Humans Consulting HCH:
Consider a human Hugh who has access to a question-answering machine. Suppose the machine answers question Q by perfectly imitating how Hugh would answer question Q, if Hugh had access to the question-answering machine.
That is, Hugh is able to consult a copy of Hugh, who is able to consult a copy of Hugh, who is able to consult a copy of Hugh…
Let’s call this process HCH, for “Humans Consulting HCH.”
Nowadays, this is actually called weak HCH, after the Strong HCH post which extended the definition. That being said, I’m only interested in perspective about HCH, which includes the questions asked about it and how to answer them. Although the difference between Weak and Strong HCH matters for the answers, the questions and perspective stay the same. I’ll thus use HCH to mean one or the other interchangeably.
The main use of HCH is as an ideal for what a question-answerer aligned with a given human should be like. This in turn serves as the aim of schemes like IDA and Debate. But thinking about HCH entangles many different angles. For example, what can HCH do? One can interpret this question as “What is the power of the HCH scheme?” or “What questions can HCH answer for a given human?” or “Is HCH aligned with the human parametrizing it?” or “Is HCH competitive?”. Each one of these questions requires different assumptions and focus, making it hard to grasp the big picture.
I claim that most of what is studied about HCH can be brought to order if we explicitly use different epistemological perspectives through which to see it. This is close to the paradigms of Kuhn in The Structure of Scientific Revolutions: a framing of the phenomenon studied which explains its most important aspects, how to study them, and what counts as knowledge for this approach. The perspectives I present are both more abstract than paradigms in natural sciences (they’re more epistemological paradigms) and less expansive. Yet I believe the intuition stays the same.
Kuhn writes that paradigms are necessary for what he calls normal science (and which encompass the majority of scientific research), which is solving the puzzles generated by the paradigm. Similarly, the perspectives I propose each put some questions and puzzles in front, and limit the scope of HCH somewhat. Thus not one is supposed to be sufficient; they all have something to bring to the table.
Each of these perspective provide assumptions about HCH:
What it is
What are the important questions about it
But before giving these, let’s start with a classical perspective in science that doesn’t work well here.
False start: HCH as explanation of a natural phenomenon
In natural sciences, ideas are often explanations of natural phenomena, like lightning and oxidation. Once armed with such an explanation, the research attempts among other things to check it against previous data of the phenomenon, and to predict new behavior for future experiments.
Of what could HCH be the explanation? In the original post, Paul describes it as
our best way to precisely specify “a human’s enlightened judgment” [about a question Q]
So the phenomenon is enlightened judgement. Yet this looks more like an ideal than a phenomenon already present in the world.
Indeed, Paul’s Implementing our considered judgement, his first post on the topic as far as I know, presents the similar notion of “considered judgment” as the outcome of a process that doesn’t exist yet.
To define my considered judgment about a question Q, suppose I am told Q and spend a few days trying to answer it. But in addition to all of the normal tools—reasoning, programming, experimentation, conversation—I also have access to a special oracle. I can give this oracle any question Q’, and the oracle will immediately reply with my considered judgment about Q’. And what is my considered judgment about Q’? Well, it’s whatever I would have output if we had performed exactly the same process, starting with Q’ instead of Q.
Seeing HCH as an explanation of enlightened judgment thus fails to be a fruitful epistemological stance, because we don’t have access to considered judgements in the wild to check the explanation.
HCH as philosophical abstraction
If enlightened judgment isn’t a phenomenon already existing in the world, intuitions nonetheless exist about what it means. For example, it feels like an enlightened judgment should depend on many different perspectives on the problem instead of only on the most obvious one. Or that such judgment shouldn’t change without additional information.
This leads to the perspective of HCH as a philosophical abstraction of the fuzzy intuitions around enlightened judgment (on a question Q). The aim of such an abstraction is to capture the intuitions in a clean and useful way. We’ll see a bit later for what it should be useful for.
How should we judge HCH as a philosophical abstraction of enlightened judgement? One possible approach is inspired by Inference to the Best Explanation with regard to intuitions, as presented by Vanessa in her research agenda:
Although I do not claim a fully general solution to metaphilosophy, I think that, pragmatically, a quasiscientific approach is possible. In science, we prefer theories that are (i) simple (Occam’s razor) and (ii) fit the empirical data. We also test theories by gathering further empirical data. In philosophy, we can likewise prefer theories that are (i) simple and (ii) fit intuition in situations where intuition feels reliable (i.e. situations that are simple, familiar or received considerable analysis and reflection). We can also test theories by applying them to new situations and trying to see whether the answer becomes intuitive after sufficient reflection.
In this perspective, the intuitions mentioned above play the role of experimental data in natural sciences. We then want an abstraction that fits this data in the most common and obvious cases, while staying as simple as possible.
What if the abstraction only fit some intuitions but not others? Here we can take note from explanations in natural sciences. These don’t generally explain everything about a phenomenon—but they have to explain what is deemed the most important and/or fundamental about it. And here the notion of “importance” comes from the application of the abstraction. We want to use enlightened judgement to solve the obvious question: “How to align an AI with what we truly want?” (Competitiveness matters too, but it makes much more sense from the next perspective below)
Enlightened judgement about a question serves as a proxy for “what we truly want” in the context of a question-answerer. It’s important to note that this perspective doesn’t look for the one true philosophical abstraction of enlightened judgment; instead it aims at engineering the most useful abstraction for the problem at hand—aligning a question-answerer.
In summary, this perspective implies the following assumptions about HCH.
(Identity) HCH is a philosophical abstraction of the concept of enlightened judgment for the goal of aligning a question-answerer
(Important Questions) These includes pinpointing of the intuitions behind enlightened judgment, weighting their relevance to aligning a question-answerer, and check that HCH follows them (either through positive arguments or by looking for counterexamples)
HCH as an intermediary alignment scheme
Finding the right words for this one is a bit awkward. Yes, HCH isn’t an alignment scheme proper, in that it doesn’t really tell us how to align an AI. On the other hand, it goes beyond what is expected of a philosophical abstraction, by giving a lot of details about how to produce something satisfying the abstraction.
Comparing HCH with another proposed philosophical abstraction of enlightened judgement makes this clear. Let’s look at Coherent Extrapolated Volition (CEV), which specify “what someone truly wants” as what they would want if they had all the facts available, had the time to consider all options, and knew enough about themselves and their own processes to catch biases and internal issues. By itself, CEV provides a clarified target to anyone trying to capture someone’s enlightened judgement. But it doesn’t say anything about how an AI doing that should be built. Whereas HCH provides a structured answer, and tells you that the solution is to get as closed as possible to that ideal answer.
So despite not being concrete enough to pass as an alignment scheme, HCH does lie at an intermediary level between pure philosophical abstractions (like CEV) and concrete alignment schemes (like IDA and Debate).
So what are the problems this perspective focuses on? As expected from being closer to running code, they are geared towards practical solutions:
How much can HCH be approximated?
How competitive is HCH (and its approximations)?
That is, this perspective cares about the realization of HCH, assuming it is what we want. It’s not really the place to wonder how aligned HCH is; knowing it is, we want to see how to get it right in a concrete program, and whether it costs too much to build.
In summary, this perspective implies the following assumptions about HCH.
(Identity) HCH is an intermediary alignment scheme for a question-answerer.
(Important Questions) Anything related to the realization of HCH as a program: approximability, competitiveness, limits in terms of expressiveness and power.
HCH as a model of computation
Obvious analogies exist between HCH and various models of computations like Turing Machines: both give a framework for solving a category of problems—questions on one side and computable problems on the other. HCH looks like it gives us a system on which to write programs for question answering, by specifying what the human should do.
Yet one difficulty stares us in the face: the H in HCH. A model of computation, like Turing Machines, is a formal construct on which one can prove questions of computability and complexity, among others. But HCH depends on a human at almost every step in the recursion, making it impossible to specify formally (even after dealing with the subtleties of infinite recursion).
Even if one uses the human as a black box, as Paul does, the behavior of HCH depends on the guarantees of this black box, which are stated as “cooperative”, “intelligent”, “reasonable”. Arguably, formalizing these is just as hard as formalizing good judgment or human values, and so proves incredibly hard.
Still, seeing HCH through the perspective of models of computation has value. It allows the leveraging of results from theoretical computer science to get an idea of the sort of problems that HCH could solve. In some sense, it’s more OCO—as in “Oracles Consulting OCO”.
Knowledge about HCH as a model of computations is thus relatively analogous to knowledge for Turing Machines:
Upper bounds for computability and complexity (algorithms)
Lower bounds for computability and complexity (impossibility results)
Requirements or structural constraints to solve a given problem
In summary, this perspective implies the following assumptions about HCH.
(Identity) HCH is a model of computation where the human is either a Turing Machine or an oracle satisfying some properties.
(Important Questions) These include what can be computed on this model, at which cost, and following which algorithm. But anything that is ordinarily studied in theoretical computer science goes: simulation between this model and others, comparison in expressivity,...
Applications of perspectives on HCH
Armed with our varied perspective on HCH, we can now put in context different strands of research about HCH that appear incompatible at first glance, and to judge them with the appropriate standards. The point is not that these arguments are correct; we just want to make them clear. After all, it’s not because someone makes sense that they’re right (especially if they’re disagreeing with you).
Here are three examples from recent works, following the three perspectives on HCH from the previous section. Yet keep in mind that most if not all research on this question draws from more than one perspective—I just point to the most prevalent.
Assumptions about H to have an aligned Question-Answerer
In a recent post (that will probably soon have a followup focused on HCH), Joe Collman introduced the Question-Ignoring Argument (QIA) in Debate:
For a consequentialist human judge, the implicit question in debate is always “Which decision would be best for the world according to my values?”.
Once the stakes are high enough, the judge will act in response to this implicit question, rather than on specific instructions to pick the best answer to the debate question.
With optimal play, the stakes are always high: the value of timely optimal information is huge.
The “best for the world” output will usually be unrelated to the question: there’s no reason to expect the most valuable information to be significantly influenced by the question just asked.
The judge is likely to be persuaded to decide for the “best for the world” information on the basis of the training signal it sends: we already have the high-value information presented in this debate, even if it doesn’t win. The important consequences concern future answers
The gist is that if the human judge is assumed to have what we would consider outstanding qualities (being a consequentialist that wants to do what’s best for the world), then there is an incentive for the debaters to give an answer to a crucially important question (like a cure for a certain type of cancer) instead of answering the probably very specific question asked. So there is a sense in which the judge having traits we intuitively want makes it harder (and maybe impossible) for the system to be a question-answerer, even if it was the point of the training.
Applying the same reasoning to HCH gives a similar result: if H is either an altruistic human or a model of such a human, it might answer more important questions instead of the one it was asked.
Regardless of whether this line of thinking holds, it provides a very good example of taking HCH as a philosophical abstraction and investigating how much it fits the intuitions for enlightened judgement. Here the intuition is that the enlightened judgement about a question is an enlightened answer to this question, and not just a very important and useful (but probably irrelevant to the question) information.
Experimental work on Factored Cognition
Ought has been exploring Factored Cognition through experiments for years now. For example, their latest report studies the result of an experiment for evaluating claims about movie review by seeing only one step of the argument.
Such work indirectly studies the question of the competitiveness of HCH. In a sense, the different Factored Cognition hypotheses are all about what HCH can do. This is crucial if we aim to align question-answers by means of approximating HCH.
The Ought experiments attempt to build a real-world version of (necessarily bounded) HCH and to see what it can do. They thus place themselves in the perspective of HCH as an intermediary alignment scheme, focusing on how competitive various approximations of it are. Knowing this helps us understand that we shouldn’t judge these experiments through what they say about HCH alignment for example, because their perspective takes it for granted.
HCH as Bounded Reflective Oracle
In Relating HCH and Logical Induction, Abram Demski casts HCH as a Bounded Reflective Oracle (BRO), a type of probabilistic oracle Turing Machine which deals with diagonalization and self-referential issue to answers questions about what an oracle would do when given access to itself (the original idea is of Reflective Oracle—the post linked above introduces the boundedness and the link with Logical Induction.) This reframing of HCH allows a more formal comparison with Logical Induction, and the different guarantees that they propose.
The lack of consideration of the human makes this post confusing when you think first of HCH as a philosophical abstraction of enlightened judgement, or as an intermediary alignment scheme. Yet when considered through the perspective of HCH as a model of computation, this makes much more sense. The point is to get an understanding of what HCH actually does when it computes, leveraging the tools of theoretical computer science for this purpose.
And once that is clear, the relevance to the other perspectives starts to appear. For example, Abram talks about different guarantees of rationality satisfied in Logical Induction, and why there is no reason to believe that HCH will satisfy them by default. On the other hand, that raises the question of what is the impact of the human on this:
It would be very interesting if some assumptions about the human (EG, the assumption that human deliberation eventually notices and rectifies any efficiently computable Dutch-book of the HCH) coud guarantee trust properties for the combined notion of amplification, along the lines of the self-trust properties of logical induction.
Conclusion
HCH is still complex to study. But I presented multiple perspectives that help clarify most discussions on the subject: as a philosophical abstraction, as an intermediary alignment scheme, and as a model of computation.
Nothing guarantees that these are the only fruitful perspectives on HCH. Moreover, these might be less useful than imagined, or misguided in some ways. Yet I’m convinced, and I hope you’re more open to this idea after reading this post, that explicit thinking about which epistemic perspectives to take on an idea like HCH matters to AI Alignment. This is one way we make sense of our common work, both for pushing more research and for teaching the newcomers to the field.
Planned summary for the Alignment Newsletter:
I think it’s a good summary. Thanks!