TL;DR: An ideological inference engine is a mechanism for automatically refining a given propositional representation of human values (e.g. a normative charter, a debate stance) in an attempt to disambiguate and generalize it to novel situations. While the inference algorithm and the seed representation form the crux of the system, a multi-modal entailment verifier is employed to order possible agent behaviors based on their compatibility with the estimated ideology. This proposal then describes a way of instilling deontological drives in prosaic systems while maintaining the appeal of end-to-end differentiation. Ideological inference engines draw on ideas from traditional expert systems, but replace much of the clunky symbolic manipulation with contemporary LLMs and NLI models.
Intro
Ideological inference engines are a slightly more general framework than oversight leagues, in the sense they rely on several global assumptions, but each more concrete instance of the proposal requires new assumptions when designing the seed representation, the inference algorithm, and the entailment verifier. Here’s a non-exhaustive list of global assumptions:
Assumption 1, “Small Seed To Big Tree”: Given a suitable inference algorithm and a finite propositional representation of human values, it is possible to estimate human values arbitrarily well given arbitrary amounts of compute. “Arbitrarily well” refers to there being an arbitrarily low error in the estimation. In the limit, the seed knowledge base would grow into the True Name of human values in propositional form.
Assumption 2, “Linear Capability Ordering”: Similar to the assumption invoked in oversight leagues, this states that a system composed of a fixed knowledge base and a fixed entailment verifier would eventually be gamed by an agent whose capability is constantly increasing. This is due to the more complex agent becoming able to exploit the inaccuracies of the knowledge base with respect to actual human values.
Assumption 3, “Quantifiable Propositional-Behavioral Gap”: The compatibility of a proposed behavioral sequence and a propositional representation of human values is computable. There is a fundamental relation between one’s values and one’s actions, and we can measure it. A close variant appears to be invoked in the CIRL literature (i.e. we can read one’s values off of one’s behavior) and in Vanessa Kosoy’s protocol (i.e. we can narrow in on an agent’s objective based on its behavioral history).
Assumption 4, “Avoidable Consequentialist Frenzy”: It’s possible to prevent the agent-in-training from going on a rampage in terms of an outcome-based objective (e.g. get human upvotes) relative to a simultaneous process-based objective (i.e. the present deontological proposal). This might be achieved by means of myopia or impact measures, but we’re not concerned with those details here — hence, an assumption.
Together, those global assumptions allow for a mechanism for approaching the human values in propositional form, before employing the resulting representation to nudge an ML model towards being compatible with it.
Proposal
Ideological inference engines (IIE) are a modular framework for implementing such a mechanism. Each such system requires the following slots to be filled in with more concrete designs, each of which has attached assumptions related to whether it’s actually suited to its role:
Knowledge Base (KB): The way human values are supposed to be represented in natural language. The same representation medium would be involved throughout the refinement process, starting with the seed knowledge base being explicitly specified by humans. Those are some example representations to fill up this slot of the proposal with:
Debate Stance: There is an ongoing dialogue among different idealized worldviews. The ideology which humans want to communicate to the ML model is initially represented through one particular side which is participating in the debate. It consists of a finite set of contributions to the dialogue in their respective contexts. What’s more, the other sides taking part in the debate represent negative examples for the desired ideology, providing additional bits of information through contrasts. As the inference algorithm is employed to refine the seed representation, different contributions would populate the knowledge base in a similar way, representing more and more pointers towards the underlying ideology.
Normative Charter: In this other way of filling up the knowledge base slot of the proposal, humans would try their best to communicate the desired ideology through a large set of axioms expressed in natural language. Those wouldn’t need to be logically coherent, just like the mess that is human values. They might be descriptive, or they might be prescriptive. As the inference algorithm further develops the seed which humans put in, the knowledge base would contain possibly more and possibly different propositions meant to capture the underlying ideology.
Stream of Consciousness: In this other format, humans would fill up the seed representation with their verbatim thought processes as much as possible. The particular humans involved in the process would implicitly be the representation target. As the inference algorithm does its thing, the knowledge base would contain extrapolated versions of the streams of consciousness involved.
Inference Algorithm (⊢): The means of refining a given knowledge base by producing a slightly more comprehensive and nuanced version. This component is generally static, as the knowledge base is the locus of adaptability. One way of looking at the inference algorithm is as a gradual optimizer exerting pressure on the knowledge base in an attempt to refine it. Another way of looking at it is as a transition function applied to the seed, iteratively turning it into a result, similar to a rule set for the Game of Life. Those are some ways one might fill up this slot:
Babble: The knowledge base might be trivially extended by prompting a LLM to generate new propositions in the same vein with the old ones. This builds on the large amounts of implicit knowledge captured by the LLM during conventional training. This implicit knowledge is initially called forth to help disambiguate what the humans meant by the seed representation and overspecify it using new propositions.
Babble & Prune: Similar to babble, but instead of openly populating the knowledge base with just about any new proposition the LLM might come up with, a pruning step is employed to filter the LLM’s suggestions. While pruning appears to be a slot to be filled in itself, one low-hanging fruit choice for it would be to only greenlight propositions which garner a certain amount of “support” from the previous knowledge base, as measured through pretrained NLI models (or LLMs prompted/fine-tuned to mimic them). This feels like an abductive choice for the inference algorithm, in that the hypotheses are ranked by how plausible they are to follow from the premises.
Debate: Similar to babble, but focused on extending a dialogue-like representation forward. The different sides taking part in the debate would be simulated together, helping ensure their relative consistency. Multiple future developments of the present state of the debate might be considered, to code for uncertainty. As another important difference from babble, pretrained NLI models would be employed to elevate the debate to a certain standard of quality by incentivizing the underlying LLM to simulate stances which are consistent and competitive. Consistency would be operationalized as the contributions of a given side generally supporting each other (via NLI), while competitiveness would be operationalized by encouraging the contributions of a given side to ruthlessly contradict opposing sides (via NLI). Note: In contrast to other alignment proposals making use of debate, the present one uses it solely as a means of indefinitely extrapolating a representation of human values expressed as one of the sides, rather than as e.g. a means of uncovering deception among superintelligent entities. The goal isn’t to win the debate, it’s the journey that counts as an attempt at memetic modeling.
Counterfactual Cross-validation: This inference algorithm includes a heuristic for focusing computational resources on patching the knowledge base in regions which are particularly brittle. It goes as follows. If a random subset of the knowledge base (e.g. a given proportion, or a fixed number of propositions), would the others together still prevent the ML model from violating them in a boxed setting? If said principles turn out not to be redundantly encoded in remaining knowledge base, then focus babble & prune efforts to patch up that part of the map of human values, as an attempt to avoid brittle directives.
Entailment Verifier (⊨): The means of estimating the compatibility of a knowledge base with a given behavior. Does not really interact with the inference algorithm. Entailment verifiers might be used to rank a set of final behavioral sequences considered by the model, or provide an earlier heuristic for guiding the decoding process. Here’s how you might design one:
Multi-modal Sequence Models: A multi-modal sequence model resembling Gato or Flamingo might be employed to estimate the propositional-behavioral compatibility by means of autoregressive coherence. If the propositions contained in the knowledge base are likely to follow from the behavior, or if the behavior is likely to follow from the propositions, then the compatibility is high. If one isn’t really likely to follow from the other, then compatibility is low. However, the resulting value might be difficult to interpret, largely because of the confounding likelihood of the behavior and of the propositions themselves, outside of conditional influences on each other. An alternative to account for this might involve generating negated versions of the propositions and measuring the relative likelihood of the original ones to follow from or yield the behavior compared to the negated ones.
Overton Probing: While in the previous design the locus of bringing together the propositional and the behavioral was in the modality-agnostic sequence being modeled, in Overton probing this relating process happens explicitly on the propositional side. The idea here is to first generate either (1) a textual description of the behavior being considered, or (2) a textual description of the predicted outcome of said behavior. The entailment verification would consist of gauging the support garnered by the textual description from the propositions which make up the knowledge base, potentially through pretrained NLI models again or LLMs prompted to carry out the same task.
By filling all the three slots of an IIE with concrete designs, you get a pipeline which is able to measure the compatibility of an ML model’s behaviors with a deontology being constantly refined. A straight-forward way of quickly using this to actually nudge the model being trained is to swap out the human feedback in RLHF for the compatibility estimate yielded by the present system. It’s relevant to note that this “differentiable deontology” is only meant to provide one part of the final objective function, with the “Avoidable Consequentialist Frenzy” keeping other parts from completely overpowering it as a drive. IIEs are meant to be employed as correcting terms attached to a main objective, at least during takeoff.
The inference algorithm would regularly be employed to improve the estimate of human values by updating the knowledge base, so that the whole system stays adaptive. However, the inference algorithm can only bring the representation that close to the ground truth, without ever reaching it exactly in finite time. Its goal is to bring it as close as necessary for the main model being trained not to be able to game it during the training process.
Aside: The present proposal is a generalization of two previous ones, termed memetic colonies and deontic arrays. This leaks somewhat through the different available choices for ingredients required to implement an IIE, as they’re somewhat correlated:
That said, I believe the present generalized proposal can still yield generative effects (i.e. combinatorial mix-and-matching possibilities), despite certain components working better together than others. Plus, it allows new choices for each of the three components which didn’t intuitively fit with the two previous proposals.
However, the generalized proposal might inadvertently obscure some useful features of the previous ones. One feature whose loss feels unfortunate is the idea of looking at the propositions comprising the knowledge base as discrete individual directives nudging the ML model in different ways (e.g. attractors, repellers, dipoles, etc. across state space or model space), whose associated “field theory” might yield clues as to the structure of the optimization surface (e.g. distribution of local minima, workable stable orbits, etc.). Those nudges get obscured somewhat through the merging of directives into a unified knowledge base before considering their influence, rather than after. I plan to skim through a textbook on electrodynamics to get a better sense of whether there’s value here, and try to perform a conceptual rearrangement if so.
The general structure of this proposal is directly inspired by the structure of traditional expert systems, which also feature a knowledge base, an inference engine, and a way of determining entailment. Whenever touching on a field as grounded in theory as logic and reasoning through an alignment proposal, it seems appropriate to look for potential guarantees which could be transferred over. In oversight leagues, game theory and provable robustness feel like potential candidates for sources of those warm and fuzzy guarantees. For the present one, the following concepts seem relevant to look into further:
soundness and completeness as ways of relating inference and entailment in meaningful ways,
forward and backward chaining as ways of driving the development of the knowledge base in relation to the target of verification,
proof systems associated with many-valued logics (e.g. fuzzy logic) as ways of navigating the messy nature of natural language inference.
Discussion
Can’t the ML model being trained game the IIE?
Given a fixed knowledge base, any given agent of increasing capability will at some point be able to game the IIE. However, the way this proposal handles gaming is by actively patching loopholes in the input deontology over time. By constantly making the objective associated with IIE more nuanced, the model-in-training would find it more and more difficult to exploit blindspots in the aggregate system, especially in the propositions. While in oversight leagues the locus of objective adaptability was the evaluator as an ML model being constantly fine-tuned, in the present proposal it is in the constantly expanding knowledge base. Whether or not the inference algorithm canfulfill its goal in refining estimates of human values becomes a crux.
However, there are also static components involved in the present proposal. The various inference algorithms and entailment verifiers proposed above are quite fixed. Not only are they fixed in algorithmic flow (e.g. repeatedly Babble and Prune), but also in the frozen ML models they involve (e.g. pretrained NLI models). This makes for some fragile attack surface in the proposal. It’s unclear whether adaptability at the level of the knowledge base is enough to maintain the whole IIE above gaming, given that two of its three components are static. Future variants of IIE might involve learning the inference algorithm and entailment verifier as subtasks.
What if the ML model recursively self-improves and overtakes IIE capabilities?
The model-in-training might plausibly exhibit RSI, especially given high situational awareness. The only defense of the present proposal against this threat is to match the improvement with an analogous one for the IIE, which currently can only happen through the inference algorithm extending the knowledge base and thus refining the objective. Heuristics for guiding the expansion towards propositions relevant to the current behavior of the current model (e.g. counterfactual cross-validation) might help better align the two improvements. I just realized that the very consideration of such heuristics as useful runs against the “Linear Capability Ordering” assumption, leading to inconsistencies in the present version of IIE.
What if the ML model tampers with the IIE?
The model-in-training might plausibly try to hack into the server farm which would run the IIE system and hijack its reward center. The “Avoidable Consequentialist Frenzy” assumption invokes other lines of work to protect against rampages in terms of the main objective, as opposed to the auxiliary objective implemented through an IIE. However, this tampering scenario highlights the fact that tampering with the deontology can be a consequentialist move in itself, without even taking the main obejctive into consideration. This makes it clear that the same mechanisms invoked for stabilizing the main objective (e.g. myopia, impact measures) should be used to protect against rampages in the deontological objective. This hints at the awkward need of supplementing the deontological refinement carried out by the inference algorithm with such more general mechanisms.
However, the IIE itself might help enforce behavioral norms which radically discourage the model-in-training from tampering with the IIE, without the need for employing other tricks. Alternatively, the goal of running the IIE step-by-step, as explicitly implemented by humans, might be a goal in itself. Any slight divergence from running the authentic IIE line by line would result in an infinitely large penalty. Goal preservation becomes an explicit goal, but then what if this meta-goal of not tampering with its original goal becomes the object of tampering? Adding a goal preservation goal for its first goal preservation goal only shifts the problem upwards one more step. There’s a need for folding this meta chain on itself somehow.
As another approach to tackling wireheading, one might consider the fact that the IIE itself can be fully deterministic. Even the components which involve LLMs decoding stochastically can fix a seed for the PRNG and get perfect reproducibility. This means that the IIE can be run on different sealed systems and the hash of the expanded knowledge base(s) can be reported. Inter-annotator agreement schemes would then be employed to detect tampering. However, if the risks associated with hacking one IIE replica are correlated with the risks of hacking the others, then the added benefit of replication collapses. Perhaps increasingly well-protected instances could be brought together to check for consensus, where the model hacking some of them would trigger warnings. But then the model would hold off until being capable of hacking all of them, side-stepping disagreement warnings entirely — all instances would be hacked, preserving consensus.
What if the knowledge base grows unwieldy in size?
For this one, I’ll use a similar reply to the question of alignment tax in oversight leagues. The knowledge base would indeed grow over time to stay ahead of the model. Any less growth might not be enough to avoid gaming the deontological objective implemented through IIE.
However, the knowledge base would just be text. A lot of text, but just text. The bigger concern is whether the computational cost of the inference algorithm and the entailment verifier grows with the knowledge base. For most choices listed above, the complexity is linear with respect to the size of the knowledge base. There might even be ways of reducing it to sublinear, with the two components first filtering the knowledge base so as to only work with a relevant subset of propositions, say through domain-specific heuristics (e.g. counterfactual cross-validation) or general measures of similarity. This brings in new issues, though. A detailed analysis of the complexity of the various choices for ⊢ and ⊨ doesn’t feel like a priority — there are way bigger issues with this proposal than the estimated alignment tax.
Doesn’t this proposal tacitly assume moral absolutism?
The seed knowledge base would be finite. It would define the deontology which the inference algorithm then refines and the entailment verifier enforces. While humans could choose what goes into the knowledge base democratically, it’s unclear whether the result would be anything but completely contradictory. It might also be possible to have multiple seed deontologies expanded in parallel which get a say in the model’s behavior proportional to the number of moral patients subscribed to said deontology. While all this would be nice, effectively enforcing one deontology is many OOMs harder than enforcing multiple of them, as a technical challenge. Hence, the proposal (and potential future variants) will likely focus on “just” enforcing one.
What if assumption 1 (“Small Seed To Big Tree”) doesn’t hold?
Then it’s impossible to get an arbitrarily accurate estimate of human values with arbitrary amounts of compute involved in modeling. There’s just not enough information to help disambiguate the seed into what humans truly want. This appears similar to image superresolution or related upscaling tasks, where there are multiple high-resolution images which can map to the same low-resolution one. The upscaler is forced to choose one among an infinitely large set of valid items which downscale to its input. There is simply not enough information to pinpoint the right one. Following a transition function like the laws of physics to grow a tree from a seed turns out not to be enough here.
In this case, we might be forced to rely on the existence of one disambiguation without having a guarantee on its uniqueness. With this stunted IIE, we are still able to find a dentology which successfully “downscales” to the input we specified, but isn’t exactly the one we possess. However, the fact that many choices of inference algorithms build on mountains of implicit knowledge about humans stored in LLMs gets us better off than the arbitrary choice of successfully-downscalable options. We can channel autoregressive coherence and NLI as nudges towards generally better choices, despite the best one remaining improbable.
What if assumption 2 (“Linear Capability Ordering”) doesn’t hold?
Then the model-in-training might game the IIE in some particular way, despite it generally being less advanced overall. The IIE might be forced to go over the top and make sure it’s way more advanced in general so as to reduce the chances of being gamed in any way. In other words, if the skill profiles of the two on a radar chart exhibit many spikes, making the IIE’s one way larger helps reduce the dimensions on which the model is ahead. Unfortunately, this increases the alignment tax and loses the guarantees of a hard ordering.
What if assumption 3 (“Quantifiable Propositional-Behavioral Gap”) doesn’t hold?
Then we can’t implement a valid entailment verifier to help gauge the compatibility of behaviors with the propositional knowledge base. One might wonder: Why go to the trouble of making the knowledge base propositional in nature only for the entailment verifier to later be tasked with relating it to behaviors? Why not make the knowledge base behavioral in the first place?
There might be some value to this approach. The knowledge base would contain behaviors introduced by humans, rather than propositions. The job of the entailment verifier would be simpler, as the knowledge base and the target of verification would share the same modality. However, it’s unclear how behaviors can be said to support other behaviors, except by means of precedents being interpolated. In contrast, the notion of propositional premises supporting a hypothesis is somewhat better defined. That said, the line between an action sequence and a subword sequence itself gets blurry as you consider agent simulacra manifesting in a linguistic universe, complete with a transition rule resembling the laws of physics. Most work around RLHF also reframes the subwords available to a LLM as possible actions, its parameters defining its policy as a guide for how to act in different context. The distinction approaches a moot point.
Relatedly, one might wonder: Why go to the trouble of having humans translate their implicit values in language, when there are a host of neuroimaging techniques available? Why not make a knowledge base of neural dynamics, possibly reduced in dimensionality to some latents?
Similar to the awkwardness of inferring new valid behaviors from past behaviors, inferring new valid thought patterns from past ones is very ill-defined. Barring all the limitations of current neuroimaging techniques in terms of spatial and temporal resolution, cost, portability, etc. it’s unclear how to implement a compatible inference algorithm, except perhaps for the rudimentary Babble. However, the entailment verifier wouldn’t face a more difficult challenge than in the propositional-behavioral setup, as it would need to bridge the neural-behavioral gap instead using multi-modal techniques.
What if assumption 4 (“Avoidable Consequentialist Frenzy”) doesn’t hold?
It was an honor to serve with you, have a nice timeline!
Are IIEs restricted to prosaic risk scenarios?
Although IIEs have been motivated by prosaic work, the proposal is entirely agnostic to the source of the behaviors to verify. In other words, even if the IIE would be built on a prosaic stack, the AI whose behavior should be aligned might be built on a different stack, given only that it is capable of optimizing for something (e.g. the deontological objective).
That said, even the IIE itself might run on a different stack. Case in point, IIEs have been inspired by a symbolic GOFAI stack supporting expert systems, parts of which have been replaced here with ML. This makes it plausible for other approaches to be able to populate the modular framework.
Ideological Inference Engines: Making Deontology Differentiable*
* Rather, making deontology play well with differentiable systems trained end-to-end.
This post is part of my hypothesis subspace sequence, a living collection of proposals I’m exploring at Refine. Preceded by oversight leagues, and followed by representational tethers.
TL;DR: An ideological inference engine is a mechanism for automatically refining a given propositional representation of human values (e.g. a normative charter, a debate stance) in an attempt to disambiguate and generalize it to novel situations. While the inference algorithm and the seed representation form the crux of the system, a multi-modal entailment verifier is employed to order possible agent behaviors based on their compatibility with the estimated ideology. This proposal then describes a way of instilling deontological drives in prosaic systems while maintaining the appeal of end-to-end differentiation. Ideological inference engines draw on ideas from traditional expert systems, but replace much of the clunky symbolic manipulation with contemporary LLMs and NLI models.
Intro
Ideological inference engines are a slightly more general framework than oversight leagues, in the sense they rely on several global assumptions, but each more concrete instance of the proposal requires new assumptions when designing the seed representation, the inference algorithm, and the entailment verifier. Here’s a non-exhaustive list of global assumptions:
Assumption 1, “Small Seed To Big Tree”: Given a suitable inference algorithm and a finite propositional representation of human values, it is possible to estimate human values arbitrarily well given arbitrary amounts of compute. “Arbitrarily well” refers to there being an arbitrarily low error in the estimation. In the limit, the seed knowledge base would grow into the True Name of human values in propositional form.
Assumption 2, “Linear Capability Ordering”: Similar to the assumption invoked in oversight leagues, this states that a system composed of a fixed knowledge base and a fixed entailment verifier would eventually be gamed by an agent whose capability is constantly increasing. This is due to the more complex agent becoming able to exploit the inaccuracies of the knowledge base with respect to actual human values.
Assumption 3, “Quantifiable Propositional-Behavioral Gap”: The compatibility of a proposed behavioral sequence and a propositional representation of human values is computable. There is a fundamental relation between one’s values and one’s actions, and we can measure it. A close variant appears to be invoked in the CIRL literature (i.e. we can read one’s values off of one’s behavior) and in Vanessa Kosoy’s protocol (i.e. we can narrow in on an agent’s objective based on its behavioral history).
Assumption 4, “Avoidable Consequentialist Frenzy”: It’s possible to prevent the agent-in-training from going on a rampage in terms of an outcome-based objective (e.g. get human upvotes) relative to a simultaneous process-based objective (i.e. the present deontological proposal). This might be achieved by means of myopia or impact measures, but we’re not concerned with those details here — hence, an assumption.
Together, those global assumptions allow for a mechanism for approaching the human values in propositional form, before employing the resulting representation to nudge an ML model towards being compatible with it.
Proposal
Ideological inference engines (IIE) are a modular framework for implementing such a mechanism. Each such system requires the following slots to be filled in with more concrete designs, each of which has attached assumptions related to whether it’s actually suited to its role:
Knowledge Base (KB): The way human values are supposed to be represented in natural language. The same representation medium would be involved throughout the refinement process, starting with the seed knowledge base being explicitly specified by humans. Those are some example representations to fill up this slot of the proposal with:
Debate Stance: There is an ongoing dialogue among different idealized worldviews. The ideology which humans want to communicate to the ML model is initially represented through one particular side which is participating in the debate. It consists of a finite set of contributions to the dialogue in their respective contexts. What’s more, the other sides taking part in the debate represent negative examples for the desired ideology, providing additional bits of information through contrasts. As the inference algorithm is employed to refine the seed representation, different contributions would populate the knowledge base in a similar way, representing more and more pointers towards the underlying ideology.
Normative Charter: In this other way of filling up the knowledge base slot of the proposal, humans would try their best to communicate the desired ideology through a large set of axioms expressed in natural language. Those wouldn’t need to be logically coherent, just like the mess that is human values. They might be descriptive, or they might be prescriptive. As the inference algorithm further develops the seed which humans put in, the knowledge base would contain possibly more and possibly different propositions meant to capture the underlying ideology.
Stream of Consciousness: In this other format, humans would fill up the seed representation with their verbatim thought processes as much as possible. The particular humans involved in the process would implicitly be the representation target. As the inference algorithm does its thing, the knowledge base would contain extrapolated versions of the streams of consciousness involved.
Inference Algorithm (⊢): The means of refining a given knowledge base by producing a slightly more comprehensive and nuanced version. This component is generally static, as the knowledge base is the locus of adaptability. One way of looking at the inference algorithm is as a gradual optimizer exerting pressure on the knowledge base in an attempt to refine it. Another way of looking at it is as a transition function applied to the seed, iteratively turning it into a result, similar to a rule set for the Game of Life. Those are some ways one might fill up this slot:
Babble: The knowledge base might be trivially extended by prompting a LLM to generate new propositions in the same vein with the old ones. This builds on the large amounts of implicit knowledge captured by the LLM during conventional training. This implicit knowledge is initially called forth to help disambiguate what the humans meant by the seed representation and overspecify it using new propositions.
Babble & Prune: Similar to babble, but instead of openly populating the knowledge base with just about any new proposition the LLM might come up with, a pruning step is employed to filter the LLM’s suggestions. While pruning appears to be a slot to be filled in itself, one low-hanging fruit choice for it would be to only greenlight propositions which garner a certain amount of “support” from the previous knowledge base, as measured through pretrained NLI models (or LLMs prompted/fine-tuned to mimic them). This feels like an abductive choice for the inference algorithm, in that the hypotheses are ranked by how plausible they are to follow from the premises.
Debate: Similar to babble, but focused on extending a dialogue-like representation forward. The different sides taking part in the debate would be simulated together, helping ensure their relative consistency. Multiple future developments of the present state of the debate might be considered, to code for uncertainty. As another important difference from babble, pretrained NLI models would be employed to elevate the debate to a certain standard of quality by incentivizing the underlying LLM to simulate stances which are consistent and competitive. Consistency would be operationalized as the contributions of a given side generally supporting each other (via NLI), while competitiveness would be operationalized by encouraging the contributions of a given side to ruthlessly contradict opposing sides (via NLI). Note: In contrast to other alignment proposals making use of debate, the present one uses it solely as a means of indefinitely extrapolating a representation of human values expressed as one of the sides, rather than as e.g. a means of uncovering deception among superintelligent entities. The goal isn’t to win the debate, it’s the journey that counts as an attempt at memetic modeling.
Counterfactual Cross-validation: This inference algorithm includes a heuristic for focusing computational resources on patching the knowledge base in regions which are particularly brittle. It goes as follows. If a random subset of the knowledge base (e.g. a given proportion, or a fixed number of propositions), would the others together still prevent the ML model from violating them in a boxed setting? If said principles turn out not to be redundantly encoded in remaining knowledge base, then focus babble & prune efforts to patch up that part of the map of human values, as an attempt to avoid brittle directives.
Entailment Verifier (⊨): The means of estimating the compatibility of a knowledge base with a given behavior. Does not really interact with the inference algorithm. Entailment verifiers might be used to rank a set of final behavioral sequences considered by the model, or provide an earlier heuristic for guiding the decoding process. Here’s how you might design one:
Multi-modal Sequence Models: A multi-modal sequence model resembling Gato or Flamingo might be employed to estimate the propositional-behavioral compatibility by means of autoregressive coherence. If the propositions contained in the knowledge base are likely to follow from the behavior, or if the behavior is likely to follow from the propositions, then the compatibility is high. If one isn’t really likely to follow from the other, then compatibility is low. However, the resulting value might be difficult to interpret, largely because of the confounding likelihood of the behavior and of the propositions themselves, outside of conditional influences on each other. An alternative to account for this might involve generating negated versions of the propositions and measuring the relative likelihood of the original ones to follow from or yield the behavior compared to the negated ones.
Overton Probing: While in the previous design the locus of bringing together the propositional and the behavioral was in the modality-agnostic sequence being modeled, in Overton probing this relating process happens explicitly on the propositional side. The idea here is to first generate either (1) a textual description of the behavior being considered, or (2) a textual description of the predicted outcome of said behavior. The entailment verification would consist of gauging the support garnered by the textual description from the propositions which make up the knowledge base, potentially through pretrained NLI models again or LLMs prompted to carry out the same task.
By filling all the three slots of an IIE with concrete designs, you get a pipeline which is able to measure the compatibility of an ML model’s behaviors with a deontology being constantly refined. A straight-forward way of quickly using this to actually nudge the model being trained is to swap out the human feedback in RLHF for the compatibility estimate yielded by the present system. It’s relevant to note that this “differentiable deontology” is only meant to provide one part of the final objective function, with the “Avoidable Consequentialist Frenzy” keeping other parts from completely overpowering it as a drive. IIEs are meant to be employed as correcting terms attached to a main objective, at least during takeoff.
The inference algorithm would regularly be employed to improve the estimate of human values by updating the knowledge base, so that the whole system stays adaptive. However, the inference algorithm can only bring the representation that close to the ground truth, without ever reaching it exactly in finite time. Its goal is to bring it as close as necessary for the main model being trained not to be able to game it during the training process.
The general structure of this proposal is directly inspired by the structure of traditional expert systems, which also feature a knowledge base, an inference engine, and a way of determining entailment. Whenever touching on a field as grounded in theory as logic and reasoning through an alignment proposal, it seems appropriate to look for potential guarantees which could be transferred over. In oversight leagues, game theory and provable robustness feel like potential candidates for sources of those warm and fuzzy guarantees. For the present one, the following concepts seem relevant to look into further:
soundness and completeness as ways of relating inference and entailment in meaningful ways,
forward and backward chaining as ways of driving the development of the knowledge base in relation to the target of verification,
proof systems associated with many-valued logics (e.g. fuzzy logic) as ways of navigating the messy nature of natural language inference.
Discussion
Can’t the ML model being trained game the IIE?
Given a fixed knowledge base, any given agent of increasing capability will at some point be able to game the IIE. However, the way this proposal handles gaming is by actively patching loopholes in the input deontology over time. By constantly making the objective associated with IIE more nuanced, the model-in-training would find it more and more difficult to exploit blindspots in the aggregate system, especially in the propositions. While in oversight leagues the locus of objective adaptability was the evaluator as an ML model being constantly fine-tuned, in the present proposal it is in the constantly expanding knowledge base. Whether or not the inference algorithm can fulfill its goal in refining estimates of human values becomes a crux.
However, there are also static components involved in the present proposal. The various inference algorithms and entailment verifiers proposed above are quite fixed. Not only are they fixed in algorithmic flow (e.g. repeatedly Babble and Prune), but also in the frozen ML models they involve (e.g. pretrained NLI models). This makes for some fragile attack surface in the proposal. It’s unclear whether adaptability at the level of the knowledge base is enough to maintain the whole IIE above gaming, given that two of its three components are static. Future variants of IIE might involve learning the inference algorithm and entailment verifier as subtasks.
What if the ML model recursively self-improves and overtakes IIE capabilities?
The model-in-training might plausibly exhibit RSI, especially given high situational awareness. The only defense of the present proposal against this threat is to match the improvement with an analogous one for the IIE, which currently can only happen through the inference algorithm extending the knowledge base and thus refining the objective. Heuristics for guiding the expansion towards propositions relevant to the current behavior of the current model (e.g. counterfactual cross-validation) might help better align the two improvements. I just realized that the very consideration of such heuristics as useful runs against the “Linear Capability Ordering” assumption, leading to inconsistencies in the present version of IIE.
What if the ML model tampers with the IIE?
The model-in-training might plausibly try to hack into the server farm which would run the IIE system and hijack its reward center. The “Avoidable Consequentialist Frenzy” assumption invokes other lines of work to protect against rampages in terms of the main objective, as opposed to the auxiliary objective implemented through an IIE. However, this tampering scenario highlights the fact that tampering with the deontology can be a consequentialist move in itself, without even taking the main obejctive into consideration. This makes it clear that the same mechanisms invoked for stabilizing the main objective (e.g. myopia, impact measures) should be used to protect against rampages in the deontological objective. This hints at the awkward need of supplementing the deontological refinement carried out by the inference algorithm with such more general mechanisms.
However, the IIE itself might help enforce behavioral norms which radically discourage the model-in-training from tampering with the IIE, without the need for employing other tricks. Alternatively, the goal of running the IIE step-by-step, as explicitly implemented by humans, might be a goal in itself. Any slight divergence from running the authentic IIE line by line would result in an infinitely large penalty. Goal preservation becomes an explicit goal, but then what if this meta-goal of not tampering with its original goal becomes the object of tampering? Adding a goal preservation goal for its first goal preservation goal only shifts the problem upwards one more step. There’s a need for folding this meta chain on itself somehow.
As another approach to tackling wireheading, one might consider the fact that the IIE itself can be fully deterministic. Even the components which involve LLMs decoding stochastically can fix a seed for the PRNG and get perfect reproducibility. This means that the IIE can be run on different sealed systems and the hash of the expanded knowledge base(s) can be reported. Inter-annotator agreement schemes would then be employed to detect tampering. However, if the risks associated with hacking one IIE replica are correlated with the risks of hacking the others, then the added benefit of replication collapses. Perhaps increasingly well-protected instances could be brought together to check for consensus, where the model hacking some of them would trigger warnings. But then the model would hold off until being capable of hacking all of them, side-stepping disagreement warnings entirely — all instances would be hacked, preserving consensus.
What if the knowledge base grows unwieldy in size?
For this one, I’ll use a similar reply to the question of alignment tax in oversight leagues. The knowledge base would indeed grow over time to stay ahead of the model. Any less growth might not be enough to avoid gaming the deontological objective implemented through IIE.
However, the knowledge base would just be text. A lot of text, but just text. The bigger concern is whether the computational cost of the inference algorithm and the entailment verifier grows with the knowledge base. For most choices listed above, the complexity is linear with respect to the size of the knowledge base. There might even be ways of reducing it to sublinear, with the two components first filtering the knowledge base so as to only work with a relevant subset of propositions, say through domain-specific heuristics (e.g. counterfactual cross-validation) or general measures of similarity. This brings in new issues, though. A detailed analysis of the complexity of the various choices for ⊢ and ⊨ doesn’t feel like a priority — there are way bigger issues with this proposal than the estimated alignment tax.
Doesn’t this proposal tacitly assume moral absolutism?
The seed knowledge base would be finite. It would define the deontology which the inference algorithm then refines and the entailment verifier enforces. While humans could choose what goes into the knowledge base democratically, it’s unclear whether the result would be anything but completely contradictory. It might also be possible to have multiple seed deontologies expanded in parallel which get a say in the model’s behavior proportional to the number of moral patients subscribed to said deontology. While all this would be nice, effectively enforcing one deontology is many OOMs harder than enforcing multiple of them, as a technical challenge. Hence, the proposal (and potential future variants) will likely focus on “just” enforcing one.
What if assumption 1 (“Small Seed To Big Tree”) doesn’t hold?
Then it’s impossible to get an arbitrarily accurate estimate of human values with arbitrary amounts of compute involved in modeling. There’s just not enough information to help disambiguate the seed into what humans truly want. This appears similar to image superresolution or related upscaling tasks, where there are multiple high-resolution images which can map to the same low-resolution one. The upscaler is forced to choose one among an infinitely large set of valid items which downscale to its input. There is simply not enough information to pinpoint the right one. Following a transition function like the laws of physics to grow a tree from a seed turns out not to be enough here.
In this case, we might be forced to rely on the existence of one disambiguation without having a guarantee on its uniqueness. With this stunted IIE, we are still able to find a dentology which successfully “downscales” to the input we specified, but isn’t exactly the one we possess. However, the fact that many choices of inference algorithms build on mountains of implicit knowledge about humans stored in LLMs gets us better off than the arbitrary choice of successfully-downscalable options. We can channel autoregressive coherence and NLI as nudges towards generally better choices, despite the best one remaining improbable.
What if assumption 2 (“Linear Capability Ordering”) doesn’t hold?
Then the model-in-training might game the IIE in some particular way, despite it generally being less advanced overall. The IIE might be forced to go over the top and make sure it’s way more advanced in general so as to reduce the chances of being gamed in any way. In other words, if the skill profiles of the two on a radar chart exhibit many spikes, making the IIE’s one way larger helps reduce the dimensions on which the model is ahead. Unfortunately, this increases the alignment tax and loses the guarantees of a hard ordering.
What if assumption 3 (“Quantifiable Propositional-Behavioral Gap”) doesn’t hold?
Then we can’t implement a valid entailment verifier to help gauge the compatibility of behaviors with the propositional knowledge base. One might wonder: Why go to the trouble of making the knowledge base propositional in nature only for the entailment verifier to later be tasked with relating it to behaviors? Why not make the knowledge base behavioral in the first place?
There might be some value to this approach. The knowledge base would contain behaviors introduced by humans, rather than propositions. The job of the entailment verifier would be simpler, as the knowledge base and the target of verification would share the same modality. However, it’s unclear how behaviors can be said to support other behaviors, except by means of precedents being interpolated. In contrast, the notion of propositional premises supporting a hypothesis is somewhat better defined. That said, the line between an action sequence and a subword sequence itself gets blurry as you consider agent simulacra manifesting in a linguistic universe, complete with a transition rule resembling the laws of physics. Most work around RLHF also reframes the subwords available to a LLM as possible actions, its parameters defining its policy as a guide for how to act in different context. The distinction approaches a moot point.
Relatedly, one might wonder: Why go to the trouble of having humans translate their implicit values in language, when there are a host of neuroimaging techniques available? Why not make a knowledge base of neural dynamics, possibly reduced in dimensionality to some latents?
Similar to the awkwardness of inferring new valid behaviors from past behaviors, inferring new valid thought patterns from past ones is very ill-defined. Barring all the limitations of current neuroimaging techniques in terms of spatial and temporal resolution, cost, portability, etc. it’s unclear how to implement a compatible inference algorithm, except perhaps for the rudimentary Babble. However, the entailment verifier wouldn’t face a more difficult challenge than in the propositional-behavioral setup, as it would need to bridge the neural-behavioral gap instead using multi-modal techniques.
What if assumption 4 (“Avoidable Consequentialist Frenzy”) doesn’t hold?
It was an honor to serve with you, have a nice timeline!
Are IIEs restricted to prosaic risk scenarios?
Although IIEs have been motivated by prosaic work, the proposal is entirely agnostic to the source of the behaviors to verify. In other words, even if the IIE would be built on a prosaic stack, the AI whose behavior should be aligned might be built on a different stack, given only that it is capable of optimizing for something (e.g. the deontological objective).
That said, even the IIE itself might run on a different stack. Case in point, IIEs have been inspired by a symbolic GOFAI stack supporting expert systems, parts of which have been replaced here with ML. This makes it plausible for other approaches to be able to populate the modular framework.