Deconfusing “ontology” in AI alignment
Epistemic status: quite uncertain, feel like the idea that ontologies are inadequate is a hundred-dollar bill on the floor of Grand Central Terminal. Also fairly sure that I’m constructing a strawman but probably good for someone to be pedantically skeptical about ontologies
Ontology within the parlance of AI alignment is a term which formalizes the set of objects, categories, actions, etc., over which minds reason. The idea presupposes a theory of mind (indirect realism) in which rather than directly interfacing with the environment around us, the mind instead takes in sensory data and constructs an internal ontology. The actions and predictions of a cognitive system are then functions of the latent objects in the ontology.
Ontologies are a useful concept in agent modeling because an agent’s utility function and beliefs are easily definable over their mental objects. The elicitation of an agent’s ontology is also an ambitious goal of interpretability, since it would provide a clean way to represent a neural network as a decision-theoretic agent. The field of AI alignment particularly deals with the problems of ontology identification and ontology mismatch: ontology identification is the problem of eliciting the ontology from a neural network (or other cognitive system), while ontology mismatch is the problem of translating human concepts and values to the ontology of the neural network.
This post seeks to challenge the usefulness of ontologies in framing the way artificial minds operate, specifically within the AI alignment literature. It seems like a lot of researchers are working on ontology identification but I haven’t seen any posts that make a compelling case for why we should expect ontologies to emerge in the first place. If you’re unfamiliar with ontologies, it may behoove you to skim a couple posts under the Ontology tag on Lesswrong before continuing.
Defining ontologies
In this section, I present a (hopefully impartial) natural-language description of ontologies based on a couple definitions across Lesswrong. It’s difficult to create an adequate formalism for ontologies, in large part because they are intuited through our own conscious experience rather than based on any empirical data. This is the definition upon which I establish a mathematical formalism after explicating:
An ontology is the system containing the functions, relations, and objects used in the mind’s internal formal language.
This definition also hints at the idea that each ontology has a corresponding reasoner that uses the formal language. The reasoner is where the beliefs and values of the system live, and it generates actions based on perceptual input. I’ll use the term biphasic cognition to refer to the theory of mind in which cognition can be represented as an abstraction phase, where input data is expressed in the ontology, followed by a reasoning phase, where the data in ontology-space is synced with beliefs and values concerning the environment and then an action is selected. It seems to me like biphasic cognition is the implicit context in which researchers are using the word “ontology,” but I’ve never seen it explicitly defined anywhere so I’m not certain.
With a more formal framing, we can define the ontology of a cognitive system as a function such there exists a reasoning function such that satisfies the “reasoner conditions” and . There are two interpretations of this, a looser one and a stricter one:
Any cognitive system F can merely be represented as , meaning the computational path of F does not necessarily look like the composition of and .
Any computable cognitive system can be broken down as where each is a basic arithmetic operation. Let be an ontology of iff for some , and .
This distinction is important because (2) would imply the ontology is already present in the program, and we just need to find the cutoff . Definition 1 encapsulates a greater number of systems than Definition 2. For Definition 2 especially, this picture is intuitively how I think about biphasic cognition:
For a function to be a reasoner, it must be isomorphic to some process that looks like symbolic reasoning, like a Bayesian network or Markov logic network. By isomorphic I mean that there exists a bijective mapping between the parameters of the reasoner and the parameters of a corresponding symbolic reasoning function with identical output. I don’t think there are any formal demands to place on ontologies since it just transmutes the input information. Surely some ontologies are better than others, in that they more efficiently encode information or keep more relevant information, but those are demands on quality, not ontology-ness.
Both Definition 1 and Definition 2 square pretty well with how we think about biphasic cognition in humans, since the idea of ontologies in the first place is based on our conscious thought. Definition 2 is a bit harder to match up, since we don’t have the source code for our brains, but from our own conscious experience we infer that the intermediate step between and actually seems to occur (as in, not only can humans be represented as , but that it seems like this is the actual “computational” path our brains follow).
Reasons why biphasic cognition may be incorrect
Assuming that my framing of biphasic cognition and ontologies represents how others view the subject, here are some reasons why it might be incorrect or incomplete, specifically when applied to modern neural networks.
Biphasic cognition might already be an incomplete theory of mind for humans
Biphasic cognition is just a model of cognition, and not one that provides much predictive power. The motivation is based on how we think our minds operate from the inside. Neuroscientists don’t seem to really know where or how conscious thought occurs in the first place, and some philosophers think that consciousness is entirely an illusion. This is mostly to say that the existence of doubts around the realness of conscious/symbolic thought in humans should entail even larger a priori doubts about the emergence of conscious/symbolic thought in artificial minds.
This is especially important because some approaches to alignment assume that our own ontology and reasoning can be elicited and formally represented. In the ELK report, for example, much of the approach to eliciting knowledge involves translating thoughts from the Bayes net of an AI system to the Bayes net of a human. If the framing of biphasic cognition which we’ve pieced together from our conscious experience does not actually reflect what is happening mechanistically, then we’re in a pickle.
Biphasic cognition lacks empirical evidence for generalization
Even if biphasic cognition is a good model for human cognition, there is little empirical evidence thus far to suggest that the framing translates cleanly to other minds. And most critically, the theory itself lacks predictive power, making it susceptible to methodological traps: one is reminded of how the theory of phlogiston was dominant among chemists in the 18th century. Phlogiston arose from natural philosophers trying to unite the Aristotelian element of fire with the budding principles of chemistry. Because the theory was still being developed, observations were explained by augmenting the properties of phlogiston. These claims were not challenged because it was assumed phlogiston existed and just needed its properties to be formalized. Chemists didn’t realize that they were viewing combustion through an inadequate paradigm.
Within reinforcement learning, some architectures are built with world-modeling in mind, either via directly constructing a POMDP or by separately training a world model in an unsupervised/self-supervised fashion and then training an agent to interact with the representation created by the world model (here and here). The existence and success of these architectures do not contradict my main point: that clear-cut biphasic cognition probably won’t emerge naturally. Furthermore, these architectures are not robust against the many-ontologies framing I detail below.
Speculation
No ontologies, just floating abstractions
Depending on how stringent our requirements for a function to be a “reasoner” are, we may end up unable to represent neural networks with biphasic cognition. In this case, abstractions exist as subfilters of information spaced along the entire filter that is the neural network rather than localizing as a single ontology.
Below is a diagram of increasingly-detailed feature maps in a CNN (note: this widely-circulated image probably wasn’t sourced from a real CNN). I think that while it’s tempting to think that the feature assembly stops after the high-level features are constructed and that any subsequent computation is more similar to logical inference, it’s more likely that the “reasoning” occurs via the same mode of thought as the feature assembly if the subsequent computation is still occurring within convolutional layers. By this I mean that we picture our conscious selves as interfacing with the data at this point, but with neural networks it seems like the data just keeps getting filtered and abstracted through weird functions of latent variables.
Many ontologies
On the other hand, we might end up with very strong representation techniques, to the point where we’re able to decompose neural networks into ontologies and reasoners. A potential issue in this case is that there exist multiple ontology-reasoner decompositions of a neural network. In Definition 1 of biphasic cognition stated above, this would look like the existence of multiple pairs where the outputs of the ontologies are not isomorphic to each other, while for Definition 2, the existence of multiple points in the computational path about which the ontology and reasoner can be delineated.
I’ll note that this possibility doesn’t dismiss the existence of ontologies, but it does run contrary to the typical framing of ontologies, and wasn’t addressed in any of the papers and Lesswrong posts I read while doing research for this post. My thoughts on multiple ontologies are outside of the scope of this post but I hope to investigate later. One concern I do have, though, is that if the math ends up too strong we could just represent O and R for any , which would call into question the usefulness of the ontology framing–at that point, it almost reverts to the floating abstraction picture.
This issue might also just be a problem with the formalism I’ve laid out, but this is all I have to say on the topic right now.
A note on the gooder regulator theorem
One result that I came across while searching for representation theorems concerning ontologies was the gooder regulator theorem, which shows that an optimal “regulator” (synonymous with “agent” or “cognitive system” in this context) will, given some conditions, reconstruct a world model from its training data that is isomorphic to the Bayesian posterior of what the world should look like given the input data. A stronger version of this statement would invalidate most of the points I make in this post, but I think that the theorem is actually wholly inapplicable to real AI systems.
In John Wentworth’s original post, he provides us with the following setup: imagine you have a regulator consisting of a model function and an output function , which acts within system to optimize some target . The regulator is first given , training data containing only the set of variables it can observe from the system. It can only store information about in the model function . Then, the regulator is provided with “test” data along with an optimization problem (“game”) to solve for each item in .
The idea is that if minimizes information retained about and also performs optimally on tasks and data in , then the output of is isomorphic to the Bayesian posterior distribution of the state of given the input. Thus, the optimal regulator literally reconstructs an optimal model of the system state given a noisy (or noiseless) input. John’s proof can be found here.
While the math for the theorem is sound, its notion of regulators generally does not line up with neural networks. First, the learned parameters are limited to , meaning that can’t retain any information about . This means that the parts of a neural network optimized via backpropagation (i.e. all of it, usually) are under the umbrella of , so this theorem doesn’t show anything about optimal world models arising inside of neural networks, only that the output will be optimal if used as the encoder in some larger system. Additionally, the regulator is forced to optimize over an arbitrarily large set of games, which means that it can’t afford to discard any information in , so the model must be a lossless compression of . Real neural networks are not trained across such games, however, so the system’s posterior distribution given will not be selected for, even in models which are optimal over their test data.
Recap/TL;DR
In this post, I present a natural-language definition of ontologies and use the definition to construct two mathematical formalisms of ontologies that provide a specific picture of what an ontology in a neural network could look like. I then show that the motivation for modeling neural networks as using ontologies in the first place is flawed, and briefly two alternative views of how neural networks might abstract their environment. I finish with a brief note on the gooder regulator theorem where I explain why it isn’t particularly useful.
This isn’t what I usually picture, but I like it as a simplified toy setup in which it’s easy to say what we even mean by “ontology” and related terms.
Two alternative simplified toy setups in which it’s relatively easy to say what we even mean by “ontology”:
For a Solomonoff inductor, or some limited-compute variant of a Solomonoff inductor, one “hypothesis” about the world is a program. We can think of the variables/functions defined within such a program as its “ontology”. (I got this one from some combination of Abram Demski and Steve Petersen.)
Suppose we have a Bayesian learner, with its own raw sense data as “low-level data”. Assume the Bayesian learner learns a generative model, i.e. one with a bunch of latent variables in it whose values are backed out from sense data. The ontology consists of the latent variables, and their relationships to the sensory data and each other.
The main thing which I don’t think either of those toy models make sufficiently obvious is that “ontology” is mostly about how an agent factors its models/cognition.
When alignment researchers talk about ontologies and world models and agents, we’re (often) talking about potential future AIs that we think will be dangerous. We aren’t necessarily talking about all current neural networks.
A common-ish belief is that future powerful AIs will be more naturally thought of as being agentic and having a world model. The extent to which this will be true is heavily debated, and gooder regulator is kinda part of that debate.
Nothing wrong with an incomplete or approximate theory, as long as you keep an eye on the things that it’s missing and whether they are relevant to whatever prediction you’re trying to make.
I see most work like you describe about ontology as more extra abstractions to reason about ontologies on top of the basic thing that ontologies are.
So what is ontology fundamentally? Simply the categorization of the world, telling apart one thing from another. Something as simple as a sensor that flips the voltage on an output wire high or low based on whether there’s more than X lumens of light hitting the sensor is creating an ontology by establishing a relationship between the voltage on the output wire and the environment surrounding the sensor.
Given ontology can be a pretty simple thing, I don’t know if folks are confused about ontology so much as perhaps sometimes confused about how complex an ontology they can claim a system has.