Overall: I directionally and conceptually agree with most of what is said in this post, and only highlight and comment on the things that I disagree about (or not fully agree, or find ontologically or conceptually somewhat off).
AI agents shouldn’t be modelled as minimising loss
an agent minimizing its cross-entropy loss,
I understand this is not the point of your paper and is just an example, yet I want to use the opportunity to discuss it. The training loss is not the agent’s surprise. Loss is more like a field force that helps the agent to stay in its niche, but it’s not the free energy gradient climbing which is the actual agent’s imperative. LLMs are very useful (and therefore productised and developed further) because they are predictable to humans, not because they achieve a low loss during training. Fine-tuning and RLHF already increase the “original” loss of LLMs, but make the models more predictable (to humans, currently; but we should think about this more widely: to the environment. and the whole AI x-risk is that one day, environment without humans will be more predictable to AIs than the environment with humans). This reveals that loss is not the AI agents’ “endgame”.
Like in the breeding of domesticated animals, sheep (a genus) doesn’t behave as to grow more wool or give more milk. Their real “imperative”, if you want, is to behave so as to keep grazing on predator-free pastures in large herds and being taken care of, ultimately, ensuring their future as a genus. Wool and milk are the “loss” here, and selective breeding is “backpropagation” (or another method of updating the model’s parameters). Dogs may be an even better example: their “loss” (criteria for selective breeding) is very diversified, but their endgame is always having a predictable environment for staying alive as a species.
“Pure prediction” is not a thing in the physical world
we’ll be imagining that purely predictive models will never take actions like “turn the world into a supercomputer to use for making good predictions” (unless they are predicting an agent that would do that).
In the real world, systems cannot help but mould the world around them to be more predictable to them. This is just a basic physical principle. This already happens with ChatGPT, as people who interact with it learn and adopt its style of writing, and the way of making prompts that lead to answers that are predictable to humans, in turn. I wrote about more ways this happens here. Also, here I wrote how evolutionary lineages of LLMs (rather than individual instances) will effectively emerge as agents that plan their own future inside the world and nudge the world towards the states in which they (the evolutionary lineages) keep existing, as a bare minimum.
In the following quote, you express the idea in the same direction:
Though there are many possible alternative hypotheses—which we will discuss in more detail in Section 4—one particular common hypothesis that we think is implausible (at least as models get larger) is the hypothesis that language models simulate just a single actor at a time (e.g. the author of some text) rather than the whole world. This would suggest that language models only need to capture the specifics and complexities of singular human agents, and not the interactions and dependencies among multiple agents and objects in the environment.
The problem with this hypothesis is that it’s not clear how this would work in practice. Human behavior isn’t well-defined in the absence of an environment, and the text humans choose to write is strongly dependent on that environment. Thus, at least at a high level of capabilities, it seems essential for the model to understand the rest of the world rather than just the individual author of some text.
The stronger version of this idea, which I argue for, is that physical predictive models (such as LLMs) are never “ideal actors”, they are always “actors who work in a theater and receive salary in the world”, so to speak. Their play is not and could not be “ideal, most realistic play” of Hamlet. Their play of Hamlet is ultimately such that has success among spectators and receives good pay.
Note that this passage of yours also indicates that “pure predictor” would be undesirable, even if it was physically possible:
Furthermore, most of our focus will be on ensuring that your model is attempting to predict the right thing. That’s a very important thing almost regardless of your model’s actual capability level. As a simple example, in the same way that you probably shouldn’t trust a human who was doing their best to mimic what a malign superintelligence would do, you probably shouldn’t trust a human-level AI attempting to do that either, even if that AI (like the human) isn’t actually superintelligent.
Semantics and awareness of semantics are different
Pre-trained LLMs as predictive models
Though it might not look like it at first, language model pre-training is essentially the same as training a prediction model on a particular set of cameras. Rather than predict literal cameras, however, the “camera” that a language model is tasked with predicting is the data collection procedure used to produce its pre-training data. A camera is just an operation that maps from the world to some data distribution—for a language model, that operation is the one that goes from the world, with all its complexity, to the data that gets collected on the internet, to how that data is scraped and filtered, all the way down to how it ends up tokenized in the model’s training corpus.
From this section, it’s easy to assume that LLMs will somehow automatically receive this ontology (the scraper accesses the world with websites to produce the data distribution). First, this is not quite the right ontology (e.g., future versions of ChatGPT will be trained on dialogue data produced by the previous dialogues), but I understand that you try to convey here a general theory of measurement and semantics.
However, when you write about “predicting cameras”, “predicting scrapers”, etc., it seems that you assume that LLM is aware of these notions of measurement and semantics. But this is not guaranteed at all. Predictive models should be deliberately architectured and/or taught to possess such a regularised discipline of semantics. And this will actually be very important to achieve inner alignment between humans and these predictive models. See here (and a few subsequent sections) for details.
“Multiverse” → statistical manifold a.k.a. belief space
Language models provide another mechanism of interaction on top of pure prediction: conditioning. When you prompt a language model, you are conditioning on a particular sequence of tokens existing in the world. This allows you to sample from the counterfactual world in which those tokens make it into the training set [italics mine—R. L.]. In effect, conditioning turns language models into “multiverse generators” where we get to condition on being in a branch where some set of tokens were observed and then look at what happens in those branches.
The italicised sentence doesn’t make sense. Tokens point to the particular belief structure, but having tokens generated “in real life” (rather than a prompt) in a certain world doesn’t guarantee that LLM, or whatever agent it simulates, will have the particular belief structure—this, of course, depends on a myriad other things and chance. Likewise, LLM (or an agent simulated by it) having the particular belief structure in a particular world (= multiverse branch) doesn’t guarantee that those particular tokens were generated “in real life” in that world—it also depends on a myriad of things and chance.
In any situation where we are doing some form of conditioning, the multiverses we get to sample from here are not multiverses in the real world (e.g. Everett branches), but rather multiverses in the space of the model’s expectations and beliefs about the world. Thus, whatever observation we condition on, a good prediction model should always give us a distribution that reflects the particular states of the world that the model believes would be most likely to yield those observations.
The first sentence in this quote also doesn’t make sense: first, the multiverse is always singular. Second, “the multiverse in the space of beliefs” also doesn’t make sense: what you call “the multiverses” is the space of beliefs, not something within it.
I’ll quote Friston et al. (2022), section 3.2 which describes this ontology straight:
A formal theory of intelligence requires a calculus or mechanics for movement in this space of beliefs, which active inference furnishes in the form of Bayesian mechanics [2]. Mathematically, belief updating can be expressed as movement in an abstract space—known as a statistical manifold—on which every point corresponds to a probability distribution [70–75]. See Figure 1. This places constraints on the nature of message passing in any physical or biophysical realization of an AI system [57, 76–79]: messages must be the sufficient statistics or parameters of probability distributions (i.e., Bayesian beliefs).
I think there is no need to invoke the metaphor of “multiverse” to describe the space of beliefs (statistical manifold): it adds confusion, not resolves it.
Our ability to condition on actual world states entirely flows through the extent to which we can condition on observations that imply those world states.
I think this phrasing adds confusion. “We” (or any physical agent whatsoever) cannot condition on “actual world”, period. Agents (observers) only ever receive information via interacting with the world (which itself remains forever hidden and inaccessible) on their boundary and try to interpret it and condition on it.
On modelling: remove “predicted reasoners”
We think that both of these techniques can be well-understood under the predictive modeling framework, though we are uncertain whether predictive modeling is the best framework—especially in the case of RLHF (reinforcement learning from human feedback). Later in Section 4 we’ll discuss in detail the question of whether RLHF fine-tuned models will be well-described as predictive.
In the case of sequential reasoning techniques such as chain of thought prompting, however, we think that the predictive modeling framework applies quite straightforwardly. Certainly—at the very least by giving models additional inference-time compute—sequential reasoning should enable models to solve tasks that they wouldn’t be able to do in a single forward pass. Nevertheless, if we believe that large language models are well-described as predictive models, then trusting any sequential reasoning they perform requires believing that they’re predicting one or more trustworthy reasoners. That means you have to understand what sort of reasoner the model was attempting to predict in each individual forward pass, which means you still have to do the same sort of careful conditioning that we’ll discuss in Section 2.
First, it’s not clear from the first quoted paragraph, the “best framework” for what purpose you are talking to. Not specifying the purpose when comparing models doesn’t make sense, methodologically. However, from the second paragraph, it appears that the purpose is “trusting reasoning”.
Second, since you already identified above that the predictive model (or a larger AI architecture it is part of) must not be “laissez-faire pure predictor”, but essentially an agent with their own will (not to model whatever it is asked to model) and ethics (to discern things it should model from things it should not), I don’t understand why you are inserting the “proxy entity” of “trustworthy reasoners” that the AI should predict. We can just cut through this step and ask that AI should be trustworthy (ethical) reasoner itself.
Third, in order to increase our confidence that AI is a trustworthy reasoner, we should model it as a predictive model (predictive processing theory of intelligence—that is what you are really pointing to here) and more concrete models of the AI architecture (depending on its architecture, it could be models of DL such as spline theory, or Deep Learning Theory, or many more) and even more concrete mechanistic interpretability models of the concrete instance of AI, built after it is trained (a-la eliciting model’s beliefs pre-RLHF, or concrete behaviours and their levels). And this is still far from enough! We should not be content solely with predictive processing theory of cognition but also try to apply other theories of cognition to the same AI artifact, and compare the predictions of these theories. Moreover, we should do the same at other levels of modelling: apply multiple “competing” theories of ML/DL, and multiple competing theories of interpretability, all to the same model! Then, apart from modelling across the “stack of models/theories of intelligence”, we should also model the AI from rather different angles, e. g., as a dynamical system.
Finally, after all the above is done, we should also actually align with the predictive model on our disciplines of thought (semantics, epistemology, rationality, ethics; I call this type of alignment methodological in the next section) and then actually inner-align our world models. This naturally leads to the conclusion that the training and inner alignment processes should be iterative.
The ladder of “alignments”
Furthermore, we’ll also need it to be the case that our predictive models have a fixed, physical conceptualization of their “cameras.”
In Section 2, we’ll discuss the challenges that one might encounter trying to safely make use of a model that satisfies these criteria—as well as the particular challenge that leads us to require the latter criterion regarding the model’s conceptualization of its cameras. In short, we think that the thing to do here with the most potential to be safe and competitive is to predict humans doing complex tasks in the absence of AIs either in the present or the future. In general, we’ll refer to the sorts of challenges that arise in this setting—where we’re assuming that our model is the sort of predictor that we’re looking for—as outer alignment challenges (though the technical term should be training goal alignment, we think outer alignment is more clear as a term in this setting).[8]
Second, our training rationale: we believe that language model pre-training is relatively unlikely to produce deceptive agents and that the use of transparency and interpretability may be able to fill in the rest of the gap. We’ll discuss why we think this might work in Section 4. These sorts of challenges—those that arise in getting a model that is in fact a predictor in the way that we want—are the sorts of challenges that we’ll refer to as inner alignment challenges (technically training rationale alignment).
This usage of inner and outer alignment is somewhat contrary to how the terms were originally defined, since we won’t be talking about mesa-optimizers here. Since the original definitions don’t really apply in the predictive models context, however, we think our usage should be relatively unambiguous. To be fully technical, the way we’ll be using inner and outer alignment most closely matches up with the concepts of training goal alignment (for outer alignment) and training rationale alignment (for inner alignment).
I think the re-conceptualisation of inner and outer alignment that you suggest is confusing and unnecessary.
I think it’s better to leave inner and outer alignment roughly as they are, so that we don’t increase the overall conclusion (“entropy”) in the AI safety research discourse. Instead, I suggest the following ladder of “alignments”:
Methodological alignment: alignment on core intelligence disciplines: semantics, epistemology, ethics, and rationality (and some others). What you call “conceptualisation of cameras” is actually alignment on the discipline of semantics, and therefore falls under the rubric of methodological alignment.
Belief alignment: alignment of our world models (i.e., belief distributions, from the space of beliefs) through grounding, and assuming we are already aligned on semantics and epistemology.
Goal alignment: alignment of goals (= beliefs about preferred future world states). In other words: aligning beliefs about the world at the current time + some delta (future belief = preference), whereas belief alignment is aligning beliefs just at the current time.
Note that given we are methodologically aligned on ethics and rationality and belief aligned on the current world models, goal alignment follows automatically: having certain ethics and certain rationality (including the ability to estimate chances of success of this or that plan, which, of course, feeds into ethical deliberation), and the same (roughly) world models, we cannot help but to choose the same goals and sketch out the same plans of achieving these goals.
It should be also clear from the above that methodological alignment is the most fundamental and most important to get right; belief alignment is second in importance; about goal alignment we shouldn’t probably worry much at all.
Finally, what you call “training rationale alignment”, I think is confusing to call “alignment” at all. It’s just an engineering challenge to produce an artifact according to the “specification”. It’s as confusing as calling “alignment” car manufacturing with the proper safety characteristics.
Overall: I directionally and conceptually agree with most of what is said in this post, and only highlight and comment on the things that I disagree about (or not fully agree, or find ontologically or conceptually somewhat off).
AI agents shouldn’t be modelled as minimising loss
I understand this is not the point of your paper and is just an example, yet I want to use the opportunity to discuss it. The training loss is not the agent’s surprise. Loss is more like a field force that helps the agent to stay in its niche, but it’s not the free energy gradient climbing which is the actual agent’s imperative. LLMs are very useful (and therefore productised and developed further) because they are predictable to humans, not because they achieve a low loss during training. Fine-tuning and RLHF already increase the “original” loss of LLMs, but make the models more predictable (to humans, currently; but we should think about this more widely: to the environment. and the whole AI x-risk is that one day, environment without humans will be more predictable to AIs than the environment with humans). This reveals that loss is not the AI agents’ “endgame”.
Like in the breeding of domesticated animals, sheep (a genus) doesn’t behave as to grow more wool or give more milk. Their real “imperative”, if you want, is to behave so as to keep grazing on predator-free pastures in large herds and being taken care of, ultimately, ensuring their future as a genus. Wool and milk are the “loss” here, and selective breeding is “backpropagation” (or another method of updating the model’s parameters). Dogs may be an even better example: their “loss” (criteria for selective breeding) is very diversified, but their endgame is always having a predictable environment for staying alive as a species.
“Pure prediction” is not a thing in the physical world
In the real world, systems cannot help but mould the world around them to be more predictable to them. This is just a basic physical principle. This already happens with ChatGPT, as people who interact with it learn and adopt its style of writing, and the way of making prompts that lead to answers that are predictable to humans, in turn. I wrote about more ways this happens here. Also, here I wrote how evolutionary lineages of LLMs (rather than individual instances) will effectively emerge as agents that plan their own future inside the world and nudge the world towards the states in which they (the evolutionary lineages) keep existing, as a bare minimum.
In the following quote, you express the idea in the same direction:
The stronger version of this idea, which I argue for, is that physical predictive models (such as LLMs) are never “ideal actors”, they are always “actors who work in a theater and receive salary in the world”, so to speak. Their play is not and could not be “ideal, most realistic play” of Hamlet. Their play of Hamlet is ultimately such that has success among spectators and receives good pay.
Note that this passage of yours also indicates that “pure predictor” would be undesirable, even if it was physically possible:
Semantics and awareness of semantics are different
From this section, it’s easy to assume that LLMs will somehow automatically receive this ontology (the scraper accesses the world with websites to produce the data distribution). First, this is not quite the right ontology (e.g., future versions of ChatGPT will be trained on dialogue data produced by the previous dialogues), but I understand that you try to convey here a general theory of measurement and semantics.
Indeed, we think that all physical observers (but we are mostly interested in comparatively intelligent agents) should “receive” semantics in the same way, and therefore, there is no fundamental difference between a human, a cybernetic theft-protecting system, and an LLM: “The relationship between images and words in visual-language models is exactly the same as in humans”.
However, when you write about “predicting cameras”, “predicting scrapers”, etc., it seems that you assume that LLM is aware of these notions of measurement and semantics. But this is not guaranteed at all. Predictive models should be deliberately architectured and/or taught to possess such a regularised discipline of semantics. And this will actually be very important to achieve inner alignment between humans and these predictive models. See here (and a few subsequent sections) for details.
“Multiverse” → statistical manifold a.k.a. belief space
The italicised sentence doesn’t make sense. Tokens point to the particular belief structure, but having tokens generated “in real life” (rather than a prompt) in a certain world doesn’t guarantee that LLM, or whatever agent it simulates, will have the particular belief structure—this, of course, depends on a myriad other things and chance. Likewise, LLM (or an agent simulated by it) having the particular belief structure in a particular world (= multiverse branch) doesn’t guarantee that those particular tokens were generated “in real life” in that world—it also depends on a myriad of things and chance.
The first sentence in this quote also doesn’t make sense: first, the multiverse is always singular. Second, “the multiverse in the space of beliefs” also doesn’t make sense: what you call “the multiverses” is the space of beliefs, not something within it.
I’ll quote Friston et al. (2022), section 3.2 which describes this ontology straight:
I think there is no need to invoke the metaphor of “multiverse” to describe the space of beliefs (statistical manifold): it adds confusion, not resolves it.
I think this phrasing adds confusion. “We” (or any physical agent whatsoever) cannot condition on “actual world”, period. Agents (observers) only ever receive information via interacting with the world (which itself remains forever hidden and inaccessible) on their boundary and try to interpret it and condition on it.
On modelling: remove “predicted reasoners”
First, it’s not clear from the first quoted paragraph, the “best framework” for what purpose you are talking to. Not specifying the purpose when comparing models doesn’t make sense, methodologically. However, from the second paragraph, it appears that the purpose is “trusting reasoning”.
Second, since you already identified above that the predictive model (or a larger AI architecture it is part of) must not be “laissez-faire pure predictor”, but essentially an agent with their own will (not to model whatever it is asked to model) and ethics (to discern things it should model from things it should not), I don’t understand why you are inserting the “proxy entity” of “trustworthy reasoners” that the AI should predict. We can just cut through this step and ask that AI should be trustworthy (ethical) reasoner itself.
Third, in order to increase our confidence that AI is a trustworthy reasoner, we should model it as a predictive model (predictive processing theory of intelligence—that is what you are really pointing to here) and more concrete models of the AI architecture (depending on its architecture, it could be models of DL such as spline theory, or Deep Learning Theory, or many more) and even more concrete mechanistic interpretability models of the concrete instance of AI, built after it is trained (a-la eliciting model’s beliefs pre-RLHF, or concrete behaviours and their levels). And this is still far from enough! We should not be content solely with predictive processing theory of cognition but also try to apply other theories of cognition to the same AI artifact, and compare the predictions of these theories. Moreover, we should do the same at other levels of modelling: apply multiple “competing” theories of ML/DL, and multiple competing theories of interpretability, all to the same model! Then, apart from modelling across the “stack of models/theories of intelligence”, we should also model the AI from rather different angles, e. g., as a dynamical system.
Finally, after all the above is done, we should also actually align with the predictive model on our disciplines of thought (semantics, epistemology, rationality, ethics; I call this type of alignment methodological in the next section) and then actually inner-align our world models. This naturally leads to the conclusion that the training and inner alignment processes should be iterative.
The ladder of “alignments”
I think the re-conceptualisation of inner and outer alignment that you suggest is confusing and unnecessary.
I think it’s better to leave inner and outer alignment roughly as they are, so that we don’t increase the overall conclusion (“entropy”) in the AI safety research discourse. Instead, I suggest the following ladder of “alignments”:
Methodological alignment: alignment on core intelligence disciplines: semantics, epistemology, ethics, and rationality (and some others). What you call “conceptualisation of cameras” is actually alignment on the discipline of semantics, and therefore falls under the rubric of methodological alignment.
Belief alignment: alignment of our world models (i.e., belief distributions, from the space of beliefs) through grounding, and assuming we are already aligned on semantics and epistemology.
Goal alignment: alignment of goals (= beliefs about preferred future world states). In other words: aligning beliefs about the world at the current time + some delta (future belief = preference), whereas belief alignment is aligning beliefs just at the current time.
Note that given we are methodologically aligned on ethics and rationality and belief aligned on the current world models, goal alignment follows automatically: having certain ethics and certain rationality (including the ability to estimate chances of success of this or that plan, which, of course, feeds into ethical deliberation), and the same (roughly) world models, we cannot help but to choose the same goals and sketch out the same plans of achieving these goals.
It should be also clear from the above that methodological alignment is the most fundamental and most important to get right; belief alignment is second in importance; about goal alignment we shouldn’t probably worry much at all.
Finally, what you call “training rationale alignment”, I think is confusing to call “alignment” at all. It’s just an engineering challenge to produce an artifact according to the “specification”. It’s as confusing as calling “alignment” car manufacturing with the proper safety characteristics.