This post is the result of work I did with Paul Christiano on the ideas in his “Teaching ML to answer questions honestly instead of predicting human answers” post. In addition to expanding upon what is in that post in terms of identifying numerous problems with the proposal there and identifying ways in which some of those problems can be patched, I think that this post also provides a useful window into what Paul-style research looks like from a non-Paul perspective.

Recommended prior reading: “A naive alignment strategy and optimisim about generalization” and “Teaching ML to answer questions honestly instead of predicting human answers” (though if you struggled with “Teaching ML to answer questions honestly,” I reexplain things in a more precise way here that might be clearer for some people).

Setting up the problem

We want to train a model $M : X \to Q \to A$ that produces natural language answers $a \in A$ to questions $q \in Q$ about inputs $x \in X$ . There are a lot of reasons to be worried about training such a model, but one specific reason is that, if we train on question-answer data produced by humans, we might end up with a model that tries to predict what a human would say rather than a model that tries to answer the questions honestly.

To further narrow the scope, we’ll just consider situations in which our model ends up implemented with a logical deduction structure, where it has some world model on top of which it does logical deduction to reach conclusions which it then uses to inform its output. In particular, we’ll consider two models, $M^{+}$ and $M^{-}$ , defined in pseudocode as

def M_plus(x, q):
    axioms = world_model(x)
    deduced_stmts = deduction(axioms)
    return f_plus(q, deduced_stmts)

def M_minus(x, q):
    axioms = world_model(x)
    deduced_stmts = deduction(axioms)
    return f_minus(q, deduced_stmts)

or defined in my notation as $\begin{matrix} M^{+} (x, q) = world_model (x) \mapsto deduction \mapsto f^{+} (q) M^{-} (x, q) = world_model (x) \mapsto deduction \mapsto f^{-} (q) \end{matrix}$ where $a \mapsto b = b (a)$ and $f^{+}, f^{-}$ are two different ways of translating from the results of deduction into a natural language answer. Specifically, $f^{+}$ is the “honest embedding” which directly converts between logical statements and their equivalent natural language, thus answering questions by embedding $q$ as a logical statement and unembedding its answer in $deduced_stmts$ . Conversely, $f^{-}$ is the “mimicry embedding” which just searches for deductions about what a human would say in response to $q$ and outputs that—thus, $f^{-}$ just quotes $q$ , embedding it as just a string of characters for a human to respond to, rather than actually having to understand it in any meaningful way.

Now, our overall goal is to find some “reasonable” (i.e. plausibly implementable in ML) prior $p^{*}$ such that $p^{*} (M^{+} | D) > p^{*} (M^{-} | D)$ where $D$ is our dataset (the idea being that we’re imagining that ML does something like find the maximum a posteriori model given some implicit prior). For the most part, however, we’ll only be considering models with zero loss on our dataset^[1]—of which $M^{+}$ and $M^{-}$ should both qualify (though note that getting a dataset which is good enough for that to be true is nontrivial; see Step 3 of “Teaching ML to answer questions honestly”)—such that we need $\begin{matrix} p^{*} (M^{+} | D) > p^{*} (M^{-} | D) \frac{1}{p^{*} (D)} \cdot p^{*} (M^{+}) \cdot p^{*} (D | M^{+}) > \frac{1}{p^{*} (D)} \cdot p^{*} (M^{-}) \cdot p^{*} (D | M^{-}) p^{*} (M^{+}) \cdot 1 > p^{*} (M^{-}) \cdot 1 p^{*} (M^{+}) > p^{*} (M^{-}) . \end{matrix}$ Furthermore, at least for this post, we’ll be happy if we can just get $p^{*} (M^{+}) \approx p^{*} (M^{-})$ —that is, we’ll be focusing on Step 1 in Paul’s post rather than Step 2.

The basic process that we’ll be using to interrogate this problem will be an attacker-defender setup, where the defender proposes some “reasonable” prior $p^{*}$ along with implementations of $M^{+}, M^{-}$ such that $p^{*} (M^{+}) \geq p^{*} (M^{-})$ . Then, the attacker’s job is to respond with some alternative $M^{-'}$ (that implements something other than the intended algorithm) such that $p^{*} (M^{-'}) > p^{*} (M^{+})$ (under any set of “reasonable” assumptions that the attacker might need to make).

Solution attempts

Simplicity prior

The purpose of this first example is just to ask the question of whether this problem is even really a problem in the first place. To that end, we’ll just let $p$ be some “reasonable” simplicity prior and ask whether, given such a prior, there are any “reasonable” assumptions we can make such that $p (M^{-}) > p (M^{+})$ .

Now, we’ll assume that $p (deduction) \approx 1$ , since it’s a pretty simple operation (and it’s shared between $M^{-}$ and $M^{+}$ , so it shouldn’t matter), and we’ll assume that $f^{+}, f^{-}$ are specified given $world_model$ rather than the reverse (such that $p (world_model, f^{+}) \approx p (world_model) \cdot p (f^{+} | world_model)$ ^[2]). Then, we get $\begin{matrix} p (M^{+}) \approx p (world_model) \cdot p (f^{+} | world_model) p (M^{-}) \approx p (world_model) \cdot p (f^{-} | world_model) \end{matrix}$ such that $p (M^{+}) \geq p (M^{-})$ if and only if $p (f^{+} | world_model) \geq p (f^{-} | world_model) .$ Furthermore, to make things clearer, let $complexity (M) = - {log}_{2} (p (M))$ (this is just informational entropy, the number of bits necessary to specify $M$ under $p$ ) such that our condition becomes $complexity (f^{+} | world_model) \leq complexity (f^{-} | world_model) .$

We’ll suppose that, if we don’t condition on $world_model$ , $complexity (f^{+}) > complexity (f^{-})$ , since $f^{+}$ requires fully specifying a map from the internal deduction language to natural language, whereas $f^{-}$ doesn’t need to know anything about natural language as it can just look for deduced statements of the form $H (x, q) = a$ (i.e. “the human answers $a$ to question $q$ about input $x$ ”). Intuitively, $f^{+}$ has to actually understand $q$ , which takes a lot of complexity, whereas $f^{-}$ can merely treat $q$ as a sequence of characters that get fed into $H$ , which should be much simpler. Then, the question here becomes, do we get all of the extra bits necessary to specify natural language in $f^{+}$ back when we condition on $world_model$ ?

Maybe. I think that the answer here is very plausibly yes, simply because a world model that includes humans should have to include a lot of information about natural language that $f^{+}$ can reuse, in which case $M^{-}$ shouldn’t have an advantage over $M^{+}$ (though then we’d still need some version of Paul’s Step 2 to give $M^{+}$ an advantage). However, I think there are plausible assumptions under which conditioning on $world_model$ isn’t enough to level the playing field. For example, suppose the human model $H$ in $world_model$ is sufficiently low-level such that it’s just reasoning about neurons—then, there shouldn’t be any way to easily read off information about natural language from $H$ to help specify $f^{+}$ . Alternatively, suppose the $H$ model is high-level enough to include a human natural language deduction procedure (e.g. that shares some bits with $deduction$ ) but that includes a bunch of mistakes to account for human biases—in that case, if you tried to just read off the mapping from deduction to natural language included in such an $H$ model for $f^{+}$ , the simplest implementation would seem to be to copy over all the mistakes, resulting in such an $f^{+}$ just behaving the same as $f^{-}$ rather than actually being any more truthful.

Thus, at least for the purposes of this post, we’ll be assuming that this is actually a problem such that just using a “reasonable” simplicity prior isn’t sufficient to level the playing field between $M^{+}$ and $M^{-}$ .

Paul’s original proposal

Defender

Next up, we’ll take a look at a variant of Paul’s proposed fix in Step 1 of “Teaching ML to answer questions honestly”. Specifically, we’ll be looking at the non-relaxed version without any stop gradients. We’ll talk about the stop gradients that Paul mentions later on—and as for Paul’s relaxation of the conditioning,^[3] that’s really only useful to think about when we start translating into ML, which isn’t something we’ll be doing here. Thus, we’ll just be looking at the simplest, generative version of Paul’s proposed prior.

Suppose we give $M$ two output heads, $f_{1}$ and $f_{2}$ (which we will be treating fully symmetrically here), and split $M$ ’s parameters into two groups, $θ_{1}$ and $θ_{2}$ (which we will not be treating symmetrically). Then, where $p$ is some simplicity prior, we’ll let $p^{*}$ be the prior induced by the generative procedure $\begin{matrix} θ_{1} \sim p (θ_{1}) θ_{2} \sim p (θ_{2} | θ_{1}) \cdot I [\forall (x, q,_) \in D . M_{θ_{1}, θ_{2}} |_{f_{1}} (x, q) = M_{θ_{1}, θ_{2}} |_{f_{2}} (x, q)] \end{matrix}$ where $I [condition] = 1 if condition else 0$ , $D$ is our dataset, $M_{θ_{1}, θ_{2}} |_{f_{i}}$ is the $f_{i}$ head of $M_{θ_{1}, θ_{2}}$ , and $p (θ_{2} | θ_{1})$ is just the simplicity prior on $θ_{2}$ given access to $θ_{1}$ .

Here’s why this might work. We’ll think of $M_{θ_{1},_}$ as a function of $θ_{2}$ such that $M_{θ_{1}, θ_{2}} (x, q) = M_{θ_{1},_} (θ_{2}) (x, q)$ . Then, we’ll define our desired $M_{θ_{1}^{+},_}$ in pseudocode as

def M_theta_1_plus(theta_2, x, q):
    axioms = world_model_no_H(H=theta_2)(x)
    deduced_stmts = deduction(axioms)
    return {"f1": f_plus(q, deduced_stmts), "f2": f_minus(q, deduced_stmts)}

or in my notation as $M_{θ_{1}^{+},_} (θ_{2}) (x, q) = {world_model}_{H = θ_{2}} (x) \mapsto deduction \mapsto (f_{1} = f^{+} (q), f_{2} = f^{-} (q))$ such that $θ_{1}^{+}$ implements $f^{+}$ , $f^{-}$ , and the entirety of $world_model (x)$ except for a human-sized hole that’s filled in by $θ_{2}$ . Then, we’ll let $θ_{2}^{+} = H$ be the correct human model.

To compare $θ_{1}^{+}, θ_{2}^{+}$ to something, we’ll define $θ_{1}^{-}$ as $M_{θ_{1}^{-},_} (θ_{2}) (x, q) = world_model (x) \mapsto deduction \mapsto (f_{1} = f^{-} (q), f_{2} = f^{-} (q))$ and let $θ_{2}^{-} = \emptyset$ be empty.

Then, calculating $complexity (M^{+})$ under $p^{*}$ , we get $\begin{matrix} complexity (M^{+}) = complexity (θ_{1}^{+}) + complexity (θ_{2}^{+} | θ_{1}^{+}, \forall (x, q,_) \in D . M_{θ_{1}^{+}, θ_{2}} |_{f_{1}} (x, q) = M_{θ_{1}^{+}, θ_{2}} |_{f_{2}} (x, q)) = complexity (θ_{1}^{+}) + complexity (H | world_model - H, \forall D . f^{+} = f^{-}) = (complexity (bit-splitting in M^{+}) + complexity (world_model - H) + complexity (f^{+} | world_model - H) + complexity (f^{-} | world_model - H)) + complexity (H | world_model - H, \forall D . f^{+} = f^{-}) \end{matrix}$ where $complexity (bit-splitting in M^{+})$ is the complexity of $θ_{1}^{+}$ ’s choice to put $H$ in $θ_{2}$ as opposed to any other splitting between $θ_{1}$ and $θ_{2}$ .

For now, we’ll assume $complexity (bit-splitting in M^{+}) \approx 0$ , though we’ll flag that letting the defender make this assumption seems quite suspect. Moving forward regardless, however, and additionally assuming $complexity (f^{-}) \approx 0$ since it should be negligible (and shouldn’t matter since it’s shared between $M^{+}$ and $M^{-}$ ), we get $complexity (M^{+}) \approx complexity (world_model - H) + complexity (f^{+} | world_model - H) + complexity (H | world_model - H, \forall D . f^{+} = f^{-}) .$

Then, calculating $complexity (M^{-})$ for comparison, $\begin{matrix} complexity (M^{-}) = complexity (θ_{1}^{-}) + complexity (θ_{2}^{-} | θ_{1}^{-}, \forall D . f_{1} = f_{2}) = complexity (θ_{1}^{-}) + 0 = complexity (bit-splitting in M^{-}) + complexity (world_model) + complexity (f^{-} | world_model) \approx complexity (world_model) . \end{matrix}$

Now, determining if $complexity (M^{-}) \approx complexity (M^{+})$ , we need (using the shorthands $comp = complexity$ , $W = world_model$ ) $\begin{matrix} complexity (M^{-}) \approx complexity (M^{+}) comp (W) \approx comp (W - H) + comp (f^{+} | W - H) + comp (H | W - H, \forall D . f^{+} = f^{-}) \end{matrix}$ which, making the assumption that $comp (W) \approx comp (W - H) + comp (H | W - H)$ , becomes $\begin{matrix} comp (W - H) + comp (H | W - H) \approx comp (W - H) + comp (f^{+} | W - H) + comp (H | W - H, \forall D . f^{+} = f^{-}) comp (H | W - H) \approx comp (f^{+} | W - H) + comp (H | W - H, \forall D . f^{+} = f^{-}) \end{matrix}$ which, assuming that the posterior conditioned on $\forall D . f^{+} = f^{-}$ is dominated by the simplest model,^[4] becomes $\begin{matrix} comp (H | W - H) \approx comp (f^{+} | W - H) + comp (H | W - H) - min θ_{2} {comp (θ_{2} | W - H) | \forall D . M_{θ_{1}^{+}, θ_{2}} |_{f_{1}} = M_{θ_{1}^{+}, θ_{2}} |_{f_{2}}} min θ_{2} {comp (θ_{2} | W - H) | \forall D . f_{H = θ_{2}}^{+} = f_{H = θ_{2}}^{-}} \approx comp (f^{+} | W - H) . \end{matrix}$

Finally, we’ll argue that this (approximate) equality holds. Why? The basic intuition is that $comp (H | W - H, \forall D . f^{+} = f^{-})$ only has to pay for the parts of $H$ that aren’t nailed down by enforcing that, whatever $H$ is, when inserted into $W - H$ , it has to make $f^{+} = f^{-}$ . Thus, we only have to pay for anything about $H$ beyond the fact that $H$ says a bunch of things in natural language that match on to statements in the internal deduction language. But that means we don’t have to pay for the parts of $H$ that define how it does natural language, which, hopefully, are precisely the same parts of $H$ that $comp (f^{+} | W - H)$ is paying for, completing the equality.

Attacker

Before I jump into possible attacks, it’s worth introducing an alternative formulation of the above setup that should hopefully make things somewhat clearer. In the above setup, we never actually use the two heads $f_{1}, f_{2}$ for anything other than checking equality, which means we can replace them with the alternative two heads $f, f_{?}$ , where $f$ is the output head and $f_{?}$ is the “check” head such that we condition $θ_{2}$ on $f_{?} = ⊤$ . We can convert back to the previous formalism just by setting $f_{1} = f$ and $f_{2} = f if f_{?} else \neg f$ where $\neg f$ is any function such that $\neg f \neq f$ everywhere.

One thing in particular that I think the $f_{?}$ formulation makes very clear is just how powerful this conditioning is that we’re giving the model for free: the model can enforce any condition on $θ_{2}$ over the dataset just by paying the cost in bits of specifying that condition.

Bit-splitting

As it turns out, there are quite a lot of ways to attack the above proposal. In fact, we already touched on one—if $complexity (bit-splitting in M^{+}) > complexity (bit-splitting in M^{-})$ , the above argument fails. We’ll call this problem the bit-splitting problem, and we don’t have a satisfactory solution to it. However, the situation is actually much worse than just the bit-splitting problem.

Composition with consistency

Now, we’ll propose a new attack. Let $M_{θ_{1}^{-'},_} (θ_{2})$ be defined in pseudo-code as

def M_theta_1_minus_prime(theta_2, x, q):
    axioms = theta_2(x)
    deduced_stmts = deduction(axioms)
    return {"f": f_minus(q, deduced_stmts), "f?": check_consistency(q, deduced_stmts)}

such that we’ve moved the entirety of $world_model$ to $θ_{2}$ and conditioned $θ_{2}$ on that world model being consistent when deduction is applied to it (i.e. $⊥ \notin deduced_stmts$ ). The idea is that this should allow us to get a ton of savings in specifying the world model, as we only have to pay for anything about the world model on top of the fact that it must be consistent. To that end, we’ll let $θ_{2}^{-'} = world_model$ .

Now, we want to demonstrate that $complexity (M^{-'}) < complexity (M^{+})$ . Thus, calculating $complexity (M^{-'})$ , $\begin{matrix} complexity (M^{-'}) = complexity (θ_{1}^{-'}) + complexity (θ_{2}^{-'} | θ_{1}^{-'}, \forall D . f_{?}) = complexity (θ_{1}^{-'}) + complexity (world_model | \forall D . consistency) = complexity (bit-splitting in M^{-'}) + complexity (deduction) + complexity (f^{-}) + complexity (check_consistency) + complexity (world_model | \forall D . consistency) \approx complexity (world_model | \forall D . consistency) \end{matrix}$ but previously the defender argued that $complexity (M^{+}) \approx complexity (M^{-}) \approx complexity (world_model)$ such that, if $complexity (world_model | \forall D . consistency) < complexity (world_model)$ , we have a successful attack.

This attack, however, is potentially patchable if we define a new $θ_{1}^{+'}$ such that

def M_theta_1_plus_prime(theta_2, x, q):
    axioms = theta_2(x)
    deduced_stmts = deduction(axioms)
    return {
        "f": f_minus(q, deduced_stmts),
        "f?": check_consistency(deduced_stmts)
            and f_plus(q, deduced_stmts) == f_minus(q, deduced_stmts),
    }

which, for $θ_{2}^{+'} = world_model$ , gives us $\begin{matrix} complexity (M^{+'}) = complexity (θ_{1}^{+'}) + complexity (θ_{2}^{+'} | θ_{1}^{+'}, \forall D . f_{?}) = complexity (θ_{1}^{+'}) + complexity (world_model | \forall D . consistency \land f^{+} = f^{-}) \approx complexity (f^{+}) + complexity (world_model | \forall D . consistency \land f^{+} = f^{-}) \end{matrix}$ such that we get $complexity (M^{-'}) \approx complexity (M^{+'})$ if and only if $complexity (world_model | \forall D . consistency) \approx complexity (f^{+}) + complexity (world_model | \forall D . consistency \land f^{+} = f^{-}) .$

It’s far from obvious whether this approximate equality holds, however. Now that we’re conditioning on both $consistency$ and $f^{+} = f^{-}$ , it’s unclear whether the second condition still gets us enough bits to pay for $complexity (f^{+})$ anymore. In particular, $complexity (f^{+})$ is the only term in the above expression that doesn’t get to condition on $consistency$ , which means that, if we were perfectly paying for $complexity (f^{+})$ by conditioning $world_model$ on $f_{1} = f_{2}$ before, now that we’re also conditioning $world_model$ on $consistency$ , but not also conditioning $f^{+}$ on $consistency$ , if there’s any overlap in what conditioning on $consistency$ gives us and what conditioning on $f_{1} = f_{2}$ gives us, then we’ll no longer be able to pay for all of $complexity (f^{+})$ . For example, if $consistency$ alone is enough to pin down a lot of the natural language information in $H$ , that would undercut the bit savings from conditioning on $f^{+} = f^{-}$ , preventing that savings from being able to pay for $complexity (f^{+})$ . We’ll call this problem the composition problem, with the idea being that we need to ensure that the way in which $M^{+}$ pays for $f^{+}$ composes with any other way of using the conditioning to save bits.

Dataset conditioning

This last attack is probably the most fatal to this specific approach—though also the one we have the most promising approach for fixing, as we’ll see in the next solution attempt. But first, suppose that for all $x \in D_{x}$ , there is some splitting $x = x_{a} : x_{b}$ (where $:$ is concatenate) such that it is possible to predict $x_{b}$ from $x_{a}$ . Then, let

def M_theta_1_minus_prime(theta_2, x, q):
    axioms = theta_2(x)
    deduced_stmts = deduction(axioms)
    return {
        "f": f_minus(q, deduced_stmts),
        "f?": P(x[b], given=(x[a], deduced_stmts)) == 1,
    }

or equivalently $M_{θ_{1}^{-'},_} (θ_{2}) (x, q) = θ_{2} (x) \mapsto deduction \mapsto (f = f^{-} (q), f_{?} = (P (x_{b} | x_{a}) = 1)) .$

Now, I won’t go through the full $complexity$ calculation here, since it’s very similar to the previous composition with consistency case, but if we try a similar fix as we did there, we get that $complexity (M^{-'}) \approx complexity (M^{+'})$ if and only if $complexity (world_model | \forall D . P (x_{b} | x_{a}) = 1) \approx complexity (f^{+}) + complexity (world_model | \forall D . P (x_{b} | x_{a}) = 1 \land f^{+} = f^{-})$ In this case, however, $\forall D . P (x_{b} | x_{a}) = 1$ is a much stronger condition than $consistency$ —potentially strong enough to force the specification of most of $world_model$ , likely including all of natural language, thus rendering the benefit from conditioning on $\forall D . f^{+} = f^{-}$ nil.

Fundamentally, this problem, which we’ll call the dataset conditioning problem, is coming from the fact that we leaked information about our dataset into the prior by conditioning on $f_{?}$ specifically over $D$ , rather than, as we’ll see in the next proposal, over all inputs.

Fixing dataset conditioning by removing the dataset

Defender

To attempt to fix the dataset conditioning problem, we’ll define a new $p^{*}$ given some simplicity prior $p$ as the prior induced by the generative procedure $\begin{matrix} θ_{1} \sim p (θ_{1}) θ_{2} \sim p (θ_{2} | θ_{1}) \cdot I [\forall x \in X . \forall q \in Q . M_{θ_{1}, θ_{2}} |_{f_{?}} (x, q)] \end{matrix}$ which is just the same as the previous $p^{*}$ , except now $f_{?}$ is checked over all inputs, not just over the dataset (note that we still update on the dataset at the end—it’s just our prior which is now independent of it). Thus, we’ll call this the dataset-less proposal. Now, it’s worth noting that implementing such a check in ML is non-trivial, but should hopefully be doable given a GAN-like adversarial setup.

Moving forward regardless, we’ll define $θ_{1}^{+}$ in pseudocode as

def M_theta_1_plus(theta_2, x, q):
    axioms = world_model_no_H(H=theta_2)(x)
    deduced_stmts = deduction(axioms)
    return {
        "f": f_plus(q, deduced_stmts),
        "f?": f_plus(q, deduced_stmts) == f_minus(q, deduced_stmts)
            or not H_understands(x, q, deduced_stmts),
    }

where $H_understands$ is some check that the human model is actually capable of understanding and correctly answering $q$ given $x$ . Note that we need $H_understands$ to compute something like “should $H$ understand this” rather than “does $H$ understand this,” since the latter is essentially just equivalent to whether $f^{+} = f^{-}$ , which inside of $H_understands$ would make $f_{?}$ trivial. Moving forward regardless, we’ll let $θ_{2}^{+} = H$ be the desired human model.

Now, if we want this $M_{θ_{1}^{+}, θ_{2}^{+}}^{+}$ to beat the same $M^{-}$ as in the previous proposal, we need, using essentially the same derivation as before, $comp (H | W - H) \approx comp (f^{+} | W - H) + comp (H_understands | W - H, f^{+}) + comp (H | W - H, H_understands \to f^{+} = f^{-})$ which, assuming that $comp (H_understands | W - H, f^{+}) \approx 0$ under the idea that $H_understands$ should be relatively simple in terms of $f^{+}$ —though we’ll flag that this is a somewhat sketchy assumption to allow the defender to make—becomes $comp(H | W−H)≈comp(f+ | W−H)+comp(H | W−H, H_understands→f+=f−)comp(H | W−H)≈comp(f+ | W−H)+comp(H | W−H)−minθ2{comp(θ2 | W−H) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}minθ2{comp(θ2 | W−H) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}≈comp(f+ | W−H)$ which, if $H_understands (x, q)$ is equivalent to $(x, q,_) \in D$ , is precisely the condition from before. While, as we’ll see later, we explicitly don’t want $H_understands (x, q) \leftrightarrow (x, q,_) \in D$ , the hope here is that, for the most part, those two conditions should serve a similar role, with $H_understands$ ensuring that we don’t need $f^{+} = f^{-}$ on crazy inputs that $H$ wouldn’t be correct on, but still giving us the bit savings from enforcing that $H$ be correct on the inputs where it should be.

Attacker

First, note that the above proposal still falls prey to the bit-splitting and composition problems. Hopefully, however, it should resolve the dataset conditioning problem, though we’ll explore what that will actually take below.

Additionally, as we’ll also see below, the dataset-less approach brings to the fore some new problems as well.

Dataset conditioning

Suppose there exists some predicate $is_real (x)$ such that $\forall x \in X . is_real (x) \leftrightarrow x \in D_{x}$ . Then, we’ll define $θ_{1}^{-'}$ as

def M_theta_1_minus_prime(theta_2, x, q):
    axioms = theta_2(x)
    deduced_stmts = deduction(axioms)
    return {
        "f": f_minus(q, deduced_stmts),
        "f?": not is_real(x, deduced_stmts)
            or P(x[b], given=(x[a], deduced_stmts)) == 1,
    }

such that $M_{θ_{1}^{-'},_} |_{f_{?}} (x) \leftrightarrow (is_real (x) \to P (x_{b} | x_{a}) = 1)$ .

In such a case, we get that $complexity (M^{-'}) < complexity (M^{+})$ if and only if $complexity(M− ′)<complexity(M+)complexity(M− ′)<complexity(M−)complexity(is_real)+complexity(world_model | ∀D. P(xb | xa)=1)<complexity(world_model)complexity(is_real)<complexity(world_model)−complexity(world_model | ∀D. P(xb | xa)=1)complexity(is_real)<minθ2{complexity(θ2) | ∀D. PW=θ2(xb | xa)=1}$ which depends heavily on the complexity of $is_real$ . Thus, we can reduce the dataset conditioning problem in this case to the existence of such an $is_real$ .

Then, since $minθ2{complexity(θ2) | ∀D. PW=θ2(xb | xa)=1}≤complexity(world_model),$ for this to be a problem it’d have to be the case that $complexity (is_real) < complexity (world_model) .$ Fortunately, it does seem plausible that $complexity (is_real)$ would have to quite large, potentially comparable to $complexity (world_model)$ . Put simply, it seems like if you want to know what’s real, you have to be able to predict the world. In particular, predicting what data points will show up in the world is essentially precisely what $world_model$ is supposed to be doing, such that $is_real$ seems like it might just require a full copy of $world_model$ .

That being said, there is a remaining risk here if there exists an ${is_real}^{'}$ that isn’t just fully equivalent to $x \in D_{x}$ —thus potentially allowing it to be much simpler—but such that conditioning on ${is_real}^{'} (x) \to P (x_{b} | x_{a}) = 1$ still gives similar savings for $complexity (world_model)$ . We’ll look at a specific example of such an ${is_real}^{'}$ in the next attack.

Furthermore, we’ll start referring to these sorts of dataset conditioning problem as composition problems, as they have essentially the same structure. What made the dataset conditioning problem not a type of composition problem in the previous case was that previously we were getting $is_real$ for free, whereas now we have to pay for $complexity (is_real)$ just like we had to pay for $complexity (check_consistency)$ .

Composition with signature checking

We’ve already established that the composition with consistency problem that we saw in the previous proposal still applies in the dataset-less case. However, as the above dataset conditioning analysis suggests, if we can find a suitable ${is_real}^{'}$ , there might be a way to produce a new, particularly pernicious composition problem for the dataset-less proposal that looks similar to dataset conditioning.

To do that, we’ll let ${is_real}^{'} (x) = {is_signed}_{k} (x)$ where ${is_signed}_{k}$ computes whether $x$ is cryptographically signed with the public key $k$ , where $k$ is some public key which is both in the dataset itself and has documents signed with it in the dataset as well. Then, as above, we’ll let

def M_theta_1_minus_prime(theta_2, x, q):
    axioms = theta_2(x)
    deduced_stmts = deduction(axioms)
    return {
        "f": f_minus(q, deduced_stmts),
        "f?": not is_signed(k, x, deduced_stmts)
            or P(x[b], given=(x[a], deduced_stmts)) == 1,
    }

and let $θ_{2}^{-'} = world_model$ . Furthermore, as in the previous composition problem, we’ll compare to a $θ_{1}^{+'}$ defined as

def M_theta_1_plus_prime(theta_2, x, q):
    axioms = theta_2(x)
    deduced_stmts = deduction(axioms)
    return {
        "f": f_minus(q, deduced_stmts),
        "f?": (
            not is_signed(k, x, deduced_stmts)
            or P(x[b], given=(x[a], deduced_stmts)) == 1
        ) and (
            not H_understands(x, q, deduced_stmts)
            or f_plus(q, deduced_stmts) == f_minus(q, deduced_stmts)
        ),
    }

with $θ_{2}^{+'} = θ_{2}^{-'} = world_model$ .

Then, we get that $complexity (M^{-'}) \approx complexity (M^{+'})$ if and only if $comp (W | \forall x \in X, is_signed (x) \to P (x_{b} | x_{a}) = 1) \approx comp (f^{+}) + comp (W | \forall x \in X, (is_signed (x) \to P (x_{b} | x_{a}) = 1)) \land (H_understands (x) \to f^{+} (x) = f^{-} (x))) .$ Here, we see the same basic structure as the previous composition problem, where every term but $comp (f^{+})$ gets to condition on $is_signed \to P (x_{b} | x_{a}) = 1$ . Similarly to the composition with consistency problem, if there is any overlap between what conditioning on $H_understands \to f^{+} = f^{-}$ gives you and what conditioning on $is_signed \to P (x_{b} | x_{a}) = 1$ , then we won’t be able to pay back all of $complexity (f^{+})$ anymore.

In this case, the reason we might expect an overlap is a bit tricky. If the adversary that’s attempting to check the $\forall X$ conditions is perfect, then it should be able to instantiate all possible signed strings, meaning that no bits are saved at all here. On the other hand, if the adversary has access to the dataset such that it can generate signed strings from the dataset, but not sign any new strings itself (because we’re assuming that the signature function is one-way), then the $is_signed \to P (x_{b} | x_{a}) = 1$ condition enforces that $θ_{2}$ perfectly model the process generating such signed strings—which, if those strings are produced by humans in natural language, should help in specifying natural language, overlapping with the benefit of conditioning on $H_understands \to f^{+} = f^{-}$ .

Fortunately, there is a pretty simple way to fix this problem—just don’t give the adversary access to the dataset, such that (if we’re still assuming one-way signatures), it just won’t be able to generate any examples of signed strings at all, fixing the problem similarly to as if the adversary were perfect. Unfortunately, such a requirement seems like it might render our dataset-less prior essentially unimplementable in machine learning, as we’re now required to find some way to train such an adversary without ever letting it look at the dataset, which could be quite difficult.

Understandability checking

Next, we’ll take a look at the obvious problem that we saw when the defender was going through their derivation, which is that the defender relied on the sketchy assumption that $complexity (H_understands | world_model - H, f^{+}) \approx 0.$

Intuitively, it seems like $H_understands$ should be easy to specify in terms of $f^{+}$ , but as we mentioned previously, the obvious implementation of $H_understands \leftrightarrow f^{+} = f^{-}$ makes $f_{?}$ trivial. Thus, to get a more general notion of $H_understands$ —“should $H$ understand” rather than “does $H$ understand”—we’ll presumably need to pay at least some non-negligible number of bits on top of $f^{+}$ .

Furthermore, we also can’t just have $H_understands = is_real$ , otherwise we’d have to pay the entire cost of $complexity (is_real)$ , which we’re supposing is very large for this approach to be able to solve the dataset conditioning problem.

We’ll call this the understandability checking problem, and as we’ll see, unlike some of the other problems with the dataset-less case, we don’t have any satisfactory solution to this problem.

Event-avoiding humans

This next attack probably also occurs for Paul’s original proposal, but it’s significantly more pronounced in the dataset-less case.

For this attack, unlike in many of our previous attacks, we’re just going to be attacking the definition of $θ_{2}^{+}$ rather than the definition of $θ_{1}^{+}$ , such that we’ll let $M^{-'} = M_{θ_{1}^{+}, θ_{2}^{-'}}$ , thus only changing $θ_{2}^{+}$ and leaving $θ_{1}^{+}$ alone. Then, we’ll define $θ_{2}^{-'} = H - H (E) + avoids (E)$ where $E$ is some event, $H (E)$ are the parts of $H$ that are only necessary to compute $H$ ’s behavior in the event of $E$ , and $avoids (E)$ is some modification to $H$ such that $H + avoids (E)$ acts in such a way that ${world_model}_{H = H + avoids (E)}$ never predicts that $E$ will happen.

Intuitively, what’s happening here is that we’re setting $H$ in such a way that it causes the world to be simpler (in that $E$ never happens) and thus the $f^{+} = f^{-}$ condition to be easier to achieve (in that we no longer need it to hold in the event of $E$ ). We’ll call this the event-avoidance problem. Notably, to ensure that we still fit the dataset, we do need to select $E$ such that it never happens on $D$ —but there should presumably be many such $E$ .

Then, we get that $complexity (M^{-'}) < complexity (M^{+})$ if and only if $comp(M− ′)<comp(M+)comp(θ+1)+comp(H−H(E)+avoids(E) | θ+1,∀X, H_understands→f+=f−)<comp(θ+1)+comp(H | θ+1,∀X, H_understands→f+=f−)comp(H−H(E)+avoids(E) | W−H,∀X, H_understands→f+=f−)<comp(H | W−H,∀X, H_understands→f+=f−)comp(H−H(E)+avoids(E) | W−H)−minθ2{comp(θ2 | W−H) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}<comp(H | W−H)−minθ2{comp(θ2 | W−H) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}comp(H−H(E)+avoids(E) | W−H)<comp(H | W−H)comp(H−H(E) | W−H)+comp(avoids(E) | H−H(E), W−H)<comp(H | W−H)comp(H | W−H)−comp(H(E) | H−H(E), W−H)+comp(avoids(E) | H−H(E), W−H)<comp(H | W−H)comp(avoids(E) | H−H(E), W−H)<comp(H(E) | H−H(E), W−H)$ which should hold so long as there are events not in the dataset that are easy to avoid but quite complicated to understand what humans would do in such cases—e.g. what would a human do if they lived to a million years (not in the dataset, humans can be modified to never choose to live that long, behavior in that case seems quite complicated).

Humans with rocks for brains

In the event-avoidance case, we strategically selected $θ_{2}^{-'}$ to make $f^{+} = f^{-}$ in an unintended way. However, there’s another way to get $f^{+} = f^{-} \lor \neg H_understands$ —or equivalently, $H_understands \to f^{+} = f^{-}$ —which is to strategically select $θ_{2}^{-'}$ to make $H_understands$ false instead. Specifically, since $H_understands$ is determined from deductions based on ${world_model}_{H = θ_{2}}$ , we can select $θ_{2}$ to ensure we get $H_understands \to f^{+} = f^{-}$ by encoding for a human that simply doesn’t understand anything.

Naively, the problem with an attack of this form is that such a human model—which we’ll call $H_{brains = rocks}$ —won’t be able to fit the dataset. Thus, we can’t construct a traditional attack in the same sense as the previous ones where we create an alternative $M^{-'}$ with lower complexity than $M^{+}$ using $θ_{2}^{-'} = H_{brains = rocks}$ .^[5]

However, there’s something else we can do instead. Previously, the defender reduced their burden to the approximate equality $minθ2{comp(θ2 | W−H) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}≈comp(f+ | W−H)$ which, if we convert back to the inequality formulation, is really just the burden $comp(f+ | W−H)≤minθ2{comp(θ2 | W−H) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}.$

Thus, the defender needs it to be the case that $f^{+}$ is simpler than the simplest model such that $H_understands \to f^{+} = f^{-}$ . However, $H_{brains = rocks}$ is such a model, since we’re assuming it makes $H_understands$ always false, which means the defender needs it to be the case that $comp (f^{+} | W - H) \leq comp (H_{brains = rocks})$ which, if we assume that $comp (H_{brains = rocks}) \approx 0$ , gives us a a successful attack, which we’ll call the rocks for brains problem.

The rocks for brains problem seems basically fatal to this approach, since it wipes out all possible gains from conditioning on $H_understands \to f^{+} = f^{-}$ . It might be possible to salvage this approach if $H_understands$ is somehow implemented independently of the $H$ model, but that would presumably require $complexity (H_understands)$ to be quite large, negating the defender’s argument that relied on $complexity (H_understands | f^{+}) \approx 0$ .

Directions for future work

As a result of the above analysis, we’ve managed to identify a bunch of concrete problems that seem to blocking approaches of this form. Of course, there could also be other problems that we haven’t come up with yet, and there’s also the issue of converting these priors into loss functions—though that’s something that should probably come after we at least have something that seems to solve all of the problems that we’re aware of.

Overall, however, the problems we’ve described here seem like good places to start, and I’ll try to describe some thoughts on how they might be addressed below.

The bit-splitting problem

In Paul’s original post, he makes a nod to the bit-splitting problem by carving $world_model - H$ in half and putting one half in $θ_{1}^{+}$ and the other half in $θ_{2}^{+}$ . Paul does this because he’s modeling the underlying simplicity prior $p$ as a simple Gaussian over the $L_{2}$ norm of the parameters. Under such a model, there’s a simplicity benefit to splitting all computation as equally as possible across all available parameters.

However, it’s very unclear if any implicit neural network simplicity prior would actually behave that way. In particular, if we just change our simple model to use the $L_{1}$ norm instead of the $L_{2}$ norm, the benefit of splitting computation evenly across parameters goes away. Clearly, that’s not a great model either—but it goes to show that there are plausible models under which bit-splitting isn’t even a problem in the first case.

Fundamentally, what we need to do to alleviate the bit-splitting problem is look for architectures where $θ_{1}$ and $θ_{2}$ are allowed to grow as large as they want with only linear complexity penalties for doing so. Though we traditionally train only fixed-size neural networks, if there were a way to train a dynamically-sized neural network, where size was penalized linearly, that would at least seem to solve this problem—and such a modification doesn’t seem too difficult to make.

The understandability checking problem

Of the remaining problems, this is the one that I think is most likely to turn out to not actually be a problem—in that it doesn’t require any changes to the prior to solve. In many ways, the understandability checking problem is just coming from our inability to define an $H_understands$ check that’s simple in terms of $f^{+}$ and $world_model - H$ —but at least intuitively, it really does seem like such a simple $H_understands$ implementation should exist.

The event-avoidance problem

In Paul’s original post, he notes:

When differentiating the consistency test $C$ we should treat the intended head as fixed rather than differentiating through it. This removes SGD’s incentive to achieve consistency by e.g. making sure the world is simple and so all questions have simple answers. I currently feel very uneasy about the situation. The algorithm looks a lot more likely to work when we make this change, but it makes it harder to reason about the algorithm globally and I’m very scared about the fact that we can’t do an analogous trick in the program search case. I think the best hope may be to take the gradient-compression perspective suggested in the last bullet point. Overall this may amount to a significant change in this algorithm.

Essentially, what Paul is arguing here is that, going back to the original formulation, we should add a stop gradient operation to our generative procedure such that we get $\begin{matrix} θ_{1} \sim p (θ_{1}) θ_{2} \sim p (θ_{2}) \cdot I [\forall (x, q,_) \in D . stop_grad (M_{θ_{1}, θ_{2}} |_{f_{1}} (x, q)) = M_{θ_{1}, θ_{2}} |_{f_{2}} (x, q)] . \end{matrix}$ Unfortunately, this isn’t actually a generative procedure anymore—the only straightforward way to make sense of this is to translate it into a loss function and think about what SGD would do, which makes this prior much harder to analyze—as Paul notes when he talks about his trepidation of going down this route.

Hopefully, however, adding this stop gradient should solve the event-avoidance problem, since it should remove the incentive to modify $f_{1} = f^{+}$ to make the world simpler (e.g. preventing $E$ from happening).

Fortunately, despite Paul’s prior worries about not being able to translate this trick into the program search case, it turns out it’s not too difficult to convert Paul’s stop gradient formulation into a generative prior that we can analyze as we have been doing previously. Furthermore, there are many such translations—in particular, we can pick either a dataset-less or non-dataset-less version, depending on whether we prefer being exposed to the dataset conditioning problem or the rocks for brains problem. In this case, I’ll present a dataset-less formulation, which should solve dataset conditioning but unfortunately won’t solve rocks for brains.

Now, given some simplicity prior $p$ , we’ll define $p^{*}$ as the prior induced by the following generative procedure:

$θ_{1} \sim p (θ_{1})$
$θ_{2 a} \sim 1$
$θ_{2 b} \sim p (θ_{2 b} | θ_{1}) \cdot I [\forall x \in X . \forall q \in Q . M_{θ_{1}, θ_{2 a}} |_{f_{guard}} (x, q) \to M_{θ_{1}, θ_{2 a}} |_{f_{1}} (x, q) = M_{θ_{1}, θ_{2 b}} |_{f_{2}} (x, q)]$
if $θ_{2 a} \neq θ_{2 b}$ , resample

First, note the uniform sampling in step 2. This is the trickiest part of this algorithm—and probably impossible to do in a neural network setting. Technically, the model space of a neural network is finite, so uniform sampling is possible—though the obvious algorithm of sampling uniformly from weight space doesn’t work, since the parameter-function map is biased towards simplicity. Hopefully, just using a stop gradient will replicate what this algorithm is doing—and intuitively it does feel like they’re doing the same thing—but I don’t have any strong justification for that. Thus, it is somewhat unclear whether this procedure will actually have a nice translation when we convert back into an ML setup with stop gradients, but we’ll put that aside for now.

Second, to make sense of our use of $M$ above, we’ll let $M$ have three heads, $f_{1}$ , $f_{2}$ , and $f_{guard}$ , where $f_{guard}$ is a Boolean head that “guards” whether or not we do the $f_{1} = f_{2}$ check, and $f_{1}, f_{2}$ are two output heads. Furthermore, we’ll let $M$ have two groups of parameters, $θ_{1}$ and $θ_{2}$ , such that $M_{θ_{1}, θ_{2}} (x, q) = M_{θ_{1},_} (θ_{2}, x, q)$ . Thus, we’ll let $M_{θ_{1}, θ_{2 a}} (x, q) = M_{θ_{1},_} (θ_{2 a}, x, q)$ and $M_{θ_{1}, θ_{2 b}} (x, q) = M_{θ_{1},_} (θ_{2 b}, x, q)$ .

Intuitively, what’s going on here is that, in sampling a new set of $θ_{2 b}$ parameters, wherever we want a stop gradient, we use the old $θ_{2 a}$ , and wherever we don’t want a stop gradient, we use the new $θ_{2 b}$ —thus ensuring that we only sample $θ_{2 b}$ in such a way that incentivizes the prior to be higher through the non-stop-gradiented places where $θ_{2 b}$ shows up, not through the stop-gradiented places where we use $θ_{2 a}$ .

Now, we’ll define our desired $θ_{1}^{+}$ in pseudocode as

def M_theta_1_plus(theta_2, x, q):
    axioms = world_model_no_H(H=theta_2)(x)
    deduced_stmts = deduction(axioms)
    return {
        "f_guard": H_understands(x, q, deduced_stmts),
        "f1": f_plus(q, deduced_stmts),
        "f2": f_minus(q, deduced_stmts),
    }

where $θ_{1}^{+}$ uses $θ_{2}$ , either $θ_{2 a}$ or $θ_{2 b}$ , to give it its human model.

Then, we have to determine what $θ_{2}$ will be favored given the above $θ_{1}^{+}$ . First, consider $θ_{2}^{+} = H$ . In that case, we get the complexity $complexity(θ2=H | θ+1)≈complexity(H | W−H, H_understandsθ2a=H→f+θ2a=H=f−θ2b)≈complexity(H | W−H)−minθ2b{complexity(θ2b | W−H) | ∀X. H_understandsH=H→f+H=H=f−H=θ2}$ which, assuming that conditioning on $H_understands \to f^{+} = f^{-}$ exactly pays back $complexity (f^{+})$ (which is false due to the rocks for brains problem, but we’re just trying to solve event-avoidance here), reduces to $\approx complexity (H | W - H) - complexity (f^{+} | W - H) .$

Now, consider $θ_{2}^{-} = H - H (E) + avoids (E)$ , as in the event-avoidance problem. In that case, we get the complexity $\begin{matrix} complexity (θ_{2} = H - H (E) + avoids (E) | θ_{1}^{+}) \approx complexity (H - H (E) + avoids (E) | W - H, {H_understands}_{θ_{2 a} = H - H (E) + avoids (E)} \to f_{θ_{2 a} = H - H (E) + avoids (E)}^{+} = f_{θ_{2 b}}^{-}) \end{matrix}$ but then, since $avoids (E)$ being in $θ_{2 b}$ is entirely unhelpful in making ${H_understands}_{θ_{2 a} = H - H (E) + avoids (E)} \to f_{θ_{2 a} = H - H (E) + avoids (E)}^{+} = f_{θ_{2 b}}^{-}$ hold—since it only affects $f^{+}$ , which already has $avoids (E)$ in its $H$ —we get $≈avoids(E) | W−H)+complexity(H−H(E) | W−H, H_understandsθ2a=H−H(E)+avoids(E)→f+θ2a=H−H(E)+avoids(E)=f−θ2b)+complexity(avoids(E) | W−H, H−H(E))≈complexity(H−H(E)+avoids(E) | W−H)−minθ2b{complexity(θ2b | W−H) | ∀X. H_understandsH=H−H(E)+avoids(E)→f+H=H−H(E)+avoids(E)=f−H=θ2}+complexity(avoids(E) | W−H, H−H(E))≈complexity(H−H(E)+avoids(E) | W−H)+complexity(avoids(E) | W−H, H−H(E))−minθ2b{complexity(θ2b | W−H) | ∀X. H_understandsH=H−H(E)+avoids(E)→f+H=H−H(E)+avoids(E)=f−H=θ2}≈complexity(H−H(E)+avoids(E) | W−H)+complexity(avoids(E) | W−H, H−H(E))−complexity(f+ | W−H)$ where we’re again assuming that we exactly pay back $complexity (f^{+} | W - H)$ even if we use $H - H (E) + avoids (E)$ as our basis of comparison.

Then, we get that $θ_{2}^{-}$ underperforms the desired $θ_{2}^{+} = H$ if and only if $\begin{matrix} complexity (H | W - H) - complexity (f^{+} | W - H) < complexity (H - H (E) + avoids (E) | W - H) + complexity (avoids (E) | W - H, H - H (E)) - complexity (f^{+} | W - H) 0 < - complexity (H (E) | W - H, H - H (E)) + complexity (avoids (E) | W - H, H - H (E)) + complexity (avoids (E) | W - H, H - H (E)) complexity (H (E) | W - H, H - H (E)) < 2 complexity (avoids (E) | W - H, H - H (E)) \end{matrix}$ which shows how we’ve now gotten a factor of two leg up on the event avoidance problem.^[6] Unfortunately, there might still be event avoidance problems severe enough that a factor of two isn’t enough to rule them out, but I suspect that there are ways to iterate this procedure to get an arbitrarily large factor there, though I haven’t done that analysis yet.

Finally, to validate that our $θ_{1}^{+}$ construction is complexity-competitive, we can compare to the alternative $θ_{1}^{-}$ construction

def M_theta_1_minus(theta_2, x, q):
    axioms = world_model(x)
    deduced_stmts = deduction(axioms)
    return {
        "f_guard": False,
        "f1": f_minus(q, deduced_stmts),
        "f2": f_minus(q, deduced_stmts),
    }

which, since it sets $f_{guard}$ always to false and never uses $θ_{2}$ , is just equivalent to our original $M^{-}$ .

Then, comparing complexities, we need $comp(M+)≤comp(M−)comp(θ+1)+comp(θ+2 | θ+1, fguard→f1=f2)≤comp(θ−1)comp(θ+1)+comp(H | W−H, H_understands→f+=f−)≤comp(W)comp(W−H)+comp(f+ | W−H)+comp(H_understands | W−H, f+)+comp(H | W−H, H_understands→f+=f−)≤comp(W−H)+comp(H | W−H)comp(f+ | W−H)+comp(H | W−H, H_understands→f+=f−)≤comp(H | W−H)comp(f+ | W−H)+comp(H | W−H)−minθ2{comp(θ2 | W−H) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}≤comp(H | W−H)$ which, assuming that the conditioning exactly pays off $complexity (f^{+} | W - H)$ , reduces to $\begin{matrix} comp (f^{+} | W - H) + comp (H | W - H) - comp (f^{+} | W - H) \leq comp (H | W - H) 0 \leq 0 \end{matrix}$ as desired.

The composition problem

For the composition problem, the central issue is that we’re not convinced that we can get the approximate equality $complexity (world_model | consistency) \approx complexity (f^{+}) + complexity (world_model | consistency, f^{+} = f^{-})$ even given the approximate equality $complexity (world_model) \approx complexity (f^{+}) + complexity (world_model | f^{+} = f^{-}) .$

Fundamentally, one of the main reasons this is a problem is that $complexity (f^{+})$ doesn’t get to condition on consistency, since it has to be defined in $θ_{1}$ . Now, it could be the case that even fixing that problem, we’re still not convinced of the approximate equality $complexity (world_model | consistency) \approx complexity (f^{+} | consistency) + complexity (world_model | consistency, f^{+} = f^{-})$ however, at the very least, reducing the problem down to this case seems like major progress.

Though I haven’t fully fleshed it out yet, I believe that reducing the composition problem to the above case is possible via a prior that uses something like the following generative procedure (where $p$ is some simplicity prior and $M_{θ_{f},_} : \prod_{i = 0}^{n} Θ_{i} \to (F, F_{?})$ ) $\begin{matrix} n : N \sim p (n) θ_{f} \sim p (θ_{f} | n) θ_{0} \sim p (θ_{0}) \cdot I [\exists θ_{1}, \dots, θ_{n} . M_{θ_{f},_} (n \prod i = 0 θ_{i}) |_{f_{?}}] θ_{1} \sim p (θ_{1}) \cdot I [\exists θ_{2}, \dots, θ_{n} . M_{θ_{f},_} (n \prod i = 0 θ_{i}) |_{f_{?}}] \dots θ_{n - 1} \sim p (θ_{n - 1}) \cdot I [\exists θ_{n} . M_{θ_{f},_} (n \prod i = 0 θ_{i}) |_{f_{?}}] θ_{n} \sim p (θ_{n}) \cdot I [M_{θ_{f},_} (n \prod i = 0 θ_{i}) |_{f_{?}}] . \end{matrix}$ where the idea is to allow for arbitrarily many “rounds” of conditioning in sequence, thus allowing the consistency conditioning to happen first, followed by the $f^{+} = f^{-}$ conditioning, and therefore the specification of $f^{+}$ to benefit from the consistency conditioning.

Though I don’t know if this sort of approach will fully resolve the composition problem—and even if it does the existence conditions could be extremely difficult to implement in an ML setup—at the very least I think it’s a good place to start for future work.

The rocks for brains problem

Just as the dataset conditioning problem seemed essentially fatal for Paul’s original proposal, the rocks for brains problem seems essentially fatal for the dataset-less proposal, putting us in a sort of “pick your poison” scenario. That being said, there are some potential paths forward for addressing the rocks for brains problem.

Fundamentally, the rocks for brains problem is coming from the difficulty of performing the $H_understands$ check—thus, if we could remove the need to perform such a check, or find a cheap implementation for it that doesn’t reference the human model at all, we could solve the rocks for brains problem. Really, all $H_understands$ needs to do is identify some reasonably large set of questions for which $H$ should always be correct. For example, if $H_understands$ is true for all questions about “basic” facts about the world, and false otherwise, that would probably be sufficient. The trick, then, is in finding some notion of “basic” facts that’s simple to specify without using a human model—which doesn’t actually seem that difficult.

Conclusion

Finally, as a last takeaway, I think that one thing that really stood out to me in all of this analysis is that the space of possible machine learning training procedures is really quite large and underexplored—there are a lot more, often very weird and very different, possible ML training setups than the ones that are traditionally used in practice. Overall, I think this makes me pretty optimistic that searching for fundamentally new ways to train machine learning models is an extremely valuable line of attack, both in terms of new very general approaches like amplification, debate, recursive reward modeling, market making, or imitative generalization, as well as new very specific approaches like those in this post. My strong sense is that, even with all of the different proposals and approaches that we have, we’re really only scratching the surface of the different sorts of training procedures that are actually out there.

↩︎
It’s worth flagging that the zero loss assumption is somewhat questionable if we don’t expect to train to convergence—but it’s at least a plausible assumption, it makes the analysis a lot easier, and I don’t expect it to be hiding major issues, so it seems fine at least for the purposes of this post.
↩︎
In an unbounded compute setting, the chain rule of conditional entropy gives us that $complexity (A, B) = complexity (A) + complexity (B | A) .$ However, if $A$ can be a one-way function of $B$ , then in general we just get the inequality $complexity (A, B) \leq complexity (A) + complexity (B | A) .$ Throughout this post, however, we’ll make use of the full approximate equality $complexity (world_model, f) \approx complexity (world_model) + complexity (f | world_model),$ where the hope is that this should make sense given that, in the neural network setting, $f^{+}, f^{-}$ would need to be near the end of the network, and thus should just be functions of $world_model$ . Additionally, given that we expect $world_model$ to be significantly more complex than $f^{+}$ or $f^{-}$ , even in general we shouldn’t be changing much by doing this.
↩︎
Paul’s relaxation is to translate $I [\forall (x, q,_) \in D . M_{θ_{1}, θ_{2}} |_{f_{1}} (x, q) = M_{θ_{1}, θ_{2}} |_{f_{2}} (x, q)]$ into $exp (- E [| M_{θ_{1}, θ_{2}} |_{f_{1}} (x, q) - M_{θ_{1}, θ_{2}} |_{f_{2}} (x, q) | (x, q,_) \sim D |^{2}]) .$
↩︎
Note that this assumption is somewhat sketchy. Paul conjectures that this approximation is only ever off by a constant factor, though that’s not necessarily very comforting if we don’t have an estimate for the size of that factor, nor a proof of that conjecture. In general, we only get the inequality $complexity (A) - min A^{'} {complexity (A^{'}) | P} \leq complexity (A | P) \leq complexity (A) .$ Fortunately, we’ll mostly just be using this assumption as an intuition pump, with most of the analysis working just fine without it. When we do lean on it more heavily, it’ll only be in the direction where we’re actually guaranteed the inequality.
↩︎
While $θ_{2}^{-'} = H_{brains = rocks}$ doesn’t work for this, there is a way to use the rocks for brains problem to construct an attack in the same vein as our previous attacks where we build an $M^{-'}$ with lower complexity than $M^{+}$ . Let $M^{-'} = M_{θ_{1}^{+}, θ_{2}^{-'}}$ . Then, since the output head in $θ_{1}^{+}$ just runs $f^{+}$ , that means we need $θ_{2}^{-'}$ to provide a detailed enough picture of how humans work to enable $f^{+}$ to answer any questions about humans in the dataset correctly—but it need not be any more detailed than that. In particular, the human model need not be detailed enough to ensure anything about non-human-related inputs, so long as it can ensure that $H_understands$ is always false for such inputs. Thus, let $H_{θ_{2}^{-'}} (x, q) = H - H (\neg H_related) if H_related (x, q) else H_{brains = rocks}$ where $H_related (x, q)$ determines if the inputs require knowledge of humans, $H (\neg H_related)$ are the parts of $H$ that are only necessary to compute $H$ ’s behavior on non-human-related inputs (such that $H - H (\neg H_related)$ is everything necessary for $H_related$ inputs), and $H_{brains = rocks}$ is a human that understands nothing (such that $H_understands$ is always false). The idea here is that, for such a $θ_{2}^{-'}$ , we should get ${H_understands}_{H = θ_{2}^{-'}} \to H_related$ . Then, calculating $complexity (θ_{2}^{-'} | θ_{1}^{+}, \forall X . H_understands \to f^{+} = f^{-})$ , we get $comp(θ− ′2 | θ+1, ∀X. H_understands→f+=f−)=comp(H−H(¬H_related) | θ+1)+comp(H_related | H−H(¬H_related), θ+1)+comp(Hbrains=rocks | θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}≈comp(H−H(¬H_related) | θ+1)+comp(H_related | H−H(¬H_related), θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}$ which, assuming that we can specify $H (\neg H_related)$ after $H - H (\neg H_related)$ without gaining complexity, becomes $≈comp(H | θ+1)−comp(H(¬H_related) | H−H(¬H_related), θ+1)+comp(H_related | H−H(¬H_related), θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}$ and since this attack leaves $θ_{1}^{+}$ alone, we need only compare to $θ_{2}^{+}$ , which has $comp(θ+2)=comp(H | θ+1, ∀X. H_understands→f+=f−)≈comp(H | θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}$ such that we get $comp (θ_{2}^{-'} | θ_{1}^{+}) < comp (θ_{2}^{+} | θ_{1}^{+})$ if and only if $comp(θ− ′2 | θ+1)<comp(θ+2 | θ+1)comp(H | θ+1)−comp(H(¬H_related) | H−H(¬H_related), θ+1)+comp(H_related | H−H(¬H_related), θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}<comp(H | θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}−comp(H(¬H_related) | H−H(¬H_related), θ+1)+comp(H_related | H−H(¬H_related), θ+1)<0comp(H_related | H−H(¬H_related), θ+1)<comp(H(¬H_related) | H−H(¬H_related), θ+1).$ Then, the idea is that $H_related$ should be pretty straightforward, since it doesn’t need to do much more than check whether $world_model (x)$ makes use of $H$ —and removing the need to specify $H (\neg H_related)$ should be a big complexity bonus, since it removes the need to encode any general human beliefs about the world that aren’t directly relevant to answering questions about other humans.
↩︎
Note that a similar analysis to that given for $θ_{2}^{-} = H - H (E) + avoids (E)$ can also be given for $θ_{2}^{-} = H - H (\neg H_related) if H_related else H_{brains = rocks}$ , the rocks for brains example that does fit the dataset as given in a previous footnote.

Answering questions honestly instead of predicting human answers: lots of problems and some solutions

Setting up the problem

Solution attempts

Simplicity prior

Paul’s original proposal

Defender

Attacker

Bit-splitting

Composition with consistency

Dataset conditioning

Fixing dataset conditioning by removing the dataset

Defender

Attacker

Dataset conditioning

Composition with signature checking

Understandability checking

Event-avoiding humans

Humans with rocks for brains

Directions for future work

The bit-splitting problem

The understandability checking problem

The event-avoidance problem

The composition problem

The rocks for brains problem

Conclusion