All possible encoding schemes / universal priors differ from each other by at most a finite prefix. You might think this doesn’t achieve much, since the length of the prefix can be in principle unbounded; but in practice, the length of the prefix (or rather, the prior itself) is constrained by a system’s physical implementation. There are some encoding schemes which neither you nor any other physical entity will ever be able to implement, and so for the purposes of description length minimization these are off the table. And of the encoding schemes that remain on the table, virtually all of them will behave identically with respect to the description lengths they assign to “natural” versus “unnatural” optimization criteria.
dxu
It looks to me like the “updatelessness trick” you describe (essentially, behaving as though certain non-local branches of the decision tree are still counterfactually relevant even though they are not — although note that I currently don’t see an obvious way to use that to avoid the usual money pump against intransitivity) recovers most of the behavior we’d see under VNM anyway; and so I don’t think I understand your confusion re: VNM axioms.
E.g. can you give me a case in which (a) we have an agent that exhibits preferences against whose naive implementation there exists some kind of money pump (not necessarily a repeatable one), (b) the agent can implement the updatelessness trick in order to avoid the money pump without modifying their preferences, and yet (c) the agent is not then representable as having modified their preferences in the relevant way?
I think I might be missing something, because the argument you attribute to Dávid still looks wrong to me. You say:
The entropy of the simulators’ distribution need not be more than the entropy of the (square of the) wave function in any relevant sense. Despite the fact that subjective entropy may be huge, physical entropy is still low (because the simulations happen on a high-amplitude ridge of the wave function, after all).
Doesn’t this argument imply that the supermajority of simulations within the simulators’ subjective distribution over universe histories are not instantiated anywhere within the quantum multiverse?
I think it does. And, if you accept this, then (unless for some reason you think the simulators’ choice of which histories to instantiate is biased towards histories that correspond to other “high-amplitude ridges” of the wave function, which makes no sense because any such bias should have already been encoded within the simulators’ subjective distribution over universe histories) you should also expect, a priori, that the simulations instantiated by the simulators should not be indistinguishable from physical reality, because such simulations comprise a vanishingly small proportion of the simulators’ subjective probability distribution over universe histories.
What this in turn means, however, is that prior to observation, a Solomonoff inductor (SI) must spread out much of its own subjective probability mass across hypotheses that predict finding itself within a noticeably simulated environment. Those are among the possibilities it must take into account—meaning, if you stipulate that it doesn’t find itself in an environment corresponding to any of those hypotheses, you’ve ruled out all of the “high-amplitude ridges” corresponding to instantiated simulations in the crossent of the simulators’ subjective distribution and reality’s distribution.
We can make this very stark: suppose our SI finds itself in an environment which, according to its prior over the quantum multiverse, corresponds to one high-amplitude ridge of the physical wave function, and zero high-amplitude ridges containing simulators that happened to instantiate that exact environment (either because no branches of the quantum multiverse happened to give rise to simulators that would have instantiated that environment, or because the environment in question simply wasn’t a member of any simulators’ subjective distributions over reality to begin with). Then the SI would immediately (correctly) conclude that it cannot be in a simulation.
Now, of course, the argument as I’ve presented it here is heavily reliant on the idea of our SI being an SI, in such a way that it’s not clear how exactly the argument carries over to the logically non-omniscient case. In particular, it relies on the SI being capable of discerning differences between very good simulations and perfect simulations, a feat which bounded reasoners cannot replicate; and it relies on the notion that our inability as bounded reasoners to distinguish between hypotheses at this level of granularity is best modeled in the SI case by stipulating that the SI’s actual observations are in fact consistent with its being instantiated within a base-level, high-amplitude ridge of the physical wave function—i.e. that our subjective inability to tell whether we’re in a simulation should be viewed as analogous to an SI being unable to tell whether it’s in a simulation because its observations actually fail to distinguish. I think this is the relevant analogy, but I’m open to being told (by you or by Dávid) why I’m wrong.
The AI has a similarly hard time to the simulators figuring out what’s a plausible configuration to arise from the big bang. Like the simulators have an entropy N distribution of possible AIs, the AI itself also has an entropy N distribution for that. So it’s probability that it’s in a real Everett branch is not p, but p times 2^-N, as it has only a 2^-N prior probability that the kind of word it observes is the kind of thing that can come up in a real Everett branch. So it’s balanced out with the simulation hypothesis, and as long as the simulators are spending more planets, that hypothesis wins.
If I imagine the AI as a Solomonoff inductor, this argument looks straightforwardly wrong to me: of the programs that reproduce (or assign high probability to, in the setting where programs produce probabilistic predictions of observations) the AI’s observations, some of these will do so by modeling a branching quantum multiverse and sampling appropriately from one of the branches, and some of them will do so by modeling a branching quantum multiverse, sampling from a branch that contains an intergalactic spacefaring civilization, locating a specific simulation within that branch, and sampling appropriately from within that simulation. Programs of the second kind will naturally have higher description complexity than programs of the first kind; both kinds feature a prefix that computes and samples from the quantum multiverse, but only the second kind carries out the additional step of locating and sampling from a nested simulation.
(You might object on the grounds that there are more programs of the second kind than of the first kind, and the probability that the AI is in a simulation at all requires summing over all such programs, but this has to be balanced against the fact most if not all of these programs will be sampling from branches much later in time than programs of the first type, and will hence be sampling from a quantum multiverse with exponentially more branches; and not all of these branches will contain spacefaring civilizations, or spacefaring civilizations interested in running ancestor simulations, or spacefaring civilizations interested in running ancestor simulations who happen to be running a simulation that exactly reproduces the AI’s observations. So this counter-counterargument doesn’t work, either.)
These two kinds of “learning” are not synonymous. Adaptive systems “learn” things, but they don’t necessarily “learn about” things; they don’t necessarily have an internal map of the external territory. (Yes, the active inference folks will bullshit about how any adaptive system must have a map of the territory, but their math does not substantively support that interpretation.) The internal heuristics or behaviors “learned” by an adaptive system are not necessarily “about” any particular external thing, and don’t necessarily represent any particular external thing.
I think I am confused both about whether I think this is true, and about how to interpret in such a way that it might be true. Could you go into more detail on what it means for a learner to learn something without there being some representational semantics that could be used to interpret what it’s learned, even if the learner itself doesn’t explicitly represent those semantics? Or is the lack of explicit representation actually the core substance of the claim here?
It seems the SOTA for training LLMs has (predictably) pivoted away from pure scaling of compute + data, and towards RL-style learning based on (synthetic?) reasoning traces (mainly CoT, in the case of o1). AFAICT, this basically obviates safety arguments that relied on “imitation” as a key source of good behavior, since now additional optimization pressure is being applied towards correct prediction rather than pure imitation.
Strictly speaking, this seems very unlikely, since we know that e.g. CoT increases the expressive power of Transformers.
Ah, yeah, I can see how I might’ve been unclear there. I was implicitly taking CoT into account when I talked about the “base distribution” of the model’s outputs, as it’s essentially ubiquitous across these kinds of scaffolding projects. I agree that if you take a non-recurrent model’s O(1) output and equip it with a form of recurrent state that you permit to continue for O(n) iterations, that will produce a qualitatively different distribution of outputs than the O(1) distribution.
In that sense, I readily admit CoT into the class of improvements I earlier characterized as “shifted distribution”. I just don’t think this gets you very far in terms of the overarching problem, since the recurrent O(n) distribution is the one whose output I find unimpressive, and the method that was used to obtain it from the (even less impressive) O(1) distribution is a one-time trick.[1]
And also intuitively, I expect, for example, that Sakana’s agent would be quite a bit worse without access to Semantic search for comparing idea novelty; and that it would probably be quite a bit better if it could e.g. retrieve embeddings of full paragraphs from papers, etc.
I also agree that another way to obtain a higher quality output distribution is to load relevant context from elsewhere. This once more seems to me like something of a red herring when it comes to the overarching question of how to get an LLM to produce human- or superhuman-level research; you can load its context with research humans have already done, but this is again a one-time trick, and not one that seems like it would enable novel research built atop the human-written research unless the base model possesses a baseline level of creativity and insight, etc.[2]
If you don’t already share (or at least understand) a good chunk of my intuitions here, the above probably sounds at least a little like I’m carving out special exceptions: conceding each point individually, while maintaining that they bear little on my core thesis. To address that, let me attempt to put a finger on some of the core intuitions I’m bringing to the table:
On my model of (good) scientific research de novo, a lot of key cognitive work occurs during what you might call “generation” and “synthesis”, where “generation” involves coming up with hypotheses that merit testing, picking the most promising of those, and designing a robust experiment that sheds insight; “synthesis” then consists of interpreting the experimental results so as to figure out the right takeaway (which very rarely ought to look like “we confirmed/disconfirmed the starting hypothesis”).
Neither of these steps are easily transmissible, since they hinge very tightly on a given individual’s research ability and intellectual “taste”; and neither of them tend to end up very well described in the writeups and papers that are released afterwards. This is hard stuff even for very bright humans, which implies to me that it requires a very high quality of thought to manage consistently. And it’s these steps that I don’t think scaffolding can help much with; I think the model has to be smart enough, at baseline, that its landscape of cognitive reachability contains these kinds of insights, before they can be elicited via an external method like scaffolding.[3]
- ↩︎
I’m not sure whether you could theoretically obtain greater benefits from allowing more than O(n) iterations, but either way you’d start to bump up against context window limitations fairly quickly.
- ↩︎
Consider the extreme case where we prompt the model with (among other things) a fully fleshed out solution to the AI alignment problem, before asking it to propose a workable solution to the AI alignment problem; it seems clear enough that in this case, almost all of the relevant cognitive work happened before the model even received its prompt.
- ↩︎
I’m uncertain-leaning-yes on the question of whether you can get to a sufficiently “smart” base model via mere continued scaling of parameter count and data size; but that connects back to the original topic of whether said “smart” model would need to be capable of goal-directed thinking, on which I think I agree with Jeremy that it would; much of my model of good de novo research, described above, seems to me to draw on the same capabilities that characterize general-purpose goal-direction.
- ↩︎
And I suspect we probably can, given scaffolds like https://sakana.ai/ai-scientist/ and its likely improvements (especially if done carefully, e.g. integrating something like Redwood’s control agenda, etc.). I’d be curious where you’d disagree (since I expect you probably would) - e.g. do you expect the AI scientists become x-risky before they’re (roughly) human-level at safety research, or they never scale to human-level, etc.?
Jeremy’s response looks to me like it mostly addresses the first branch of your disjunction (AI becomes x-risky before reaching human-level capabilities), so let me address the second:
I am unimpressed by the output of the AI scientist. (To be clear, this is not the same thing as being unimpressed by the work put into it by its developers; it looks to me like they did a great job.) Mostly, however, the output looks to me basically like what I would have predicted, on my prior model of how scaffolding interacts with base models, which goes something like this:
A given model has some base distribution on the cognitive quality of its outputs, which is why resampling can sometimes produce better or worse responses to inputs. What scaffolding does is to essentially act as a more sophisticated form of sampling based on redundancy: having the model check its own output, respond to that output, etc. This can be very crudely viewed as an error correction process that drives down the probability that a “mistake” at some early token ends up propagating throughout the entirety of the scaffolding process and unduly influencing the output, which biases the quality distribution of outputs away from the lower tail and towards the upper tail.
The key moving piece on my model, however, is that all of this is still a function of the base distribution—a rough analogy here would be to best-of-n sampling. And the problem with best-of-n sampling, which looks to me like it carries over to more complicated scaffolding, is that as n increases, the mean of the resulting distribution increases as a sublinear (actually, logarithmic) function of n, while the variance decreases at a similar rate (but even this is misleading, since the resulting distribution will have negative skew, meaning variance decreases more rapidly in the upper tail than in the lower tail).
Anyway, the upshot of all of this is that scaffolding cannot elicit capabilities that were not already present (in some strong sense) in the base model—meaning, if the base models in question are strongly subhuman at something like scientific research (which it presently looks to me like they still are), scaffolding will not bridge that gap for them. The only thing that can close that gap without unreasonably large amounts of scaffolding, where “unreasonable” here means something a complexity theorist would consider unreasonable, is a shifted base distribution. And that corresponds to the kind of “useful [superhuman] capabilities” Jeremy is worried about.
I’m interested! Also curious as to how this is implemented; are you using retrieval-augmented generation, and if so, with what embeddings?
Epistemic status: exploratory, “shower thought”, written as part of a conversation with Claude:
For any given entity (broadly construed here to mean, essentially, any physical system), it is possible to analyze that entity as follows:
Define the set of possible future trajectories that entity might follow, according to some suitably uninformative ignorance prior on its state and (generalized) environment. Then ask, of that set, whether there exists some simple, obvious, or otherwise notable prior on the set in question, that assigns probabilities to various member trajectories in such a way as to establish an upper level set of some kind. Then ask, of that upper level set, how large it is relative to the size of the set as a whole, and (relatedly) how large the difference is between the probability of that upper set’s least probable member, and its most probable nonmember. (If you want to conceptualize these sets as infinite and open—although it’s unclear to me that one needs to conceptualize them this way—then you can speak instead of “infimum” and “supremum”.)
The claim is that, for some specific kinds of system, there will be quite a sharp difference between its upper level set and its lower level set, constituting a “plausibility gap”: trajectories within the upper set are in some sense “plausible” ways of extrapolating the system forward in time. And then the relative size of that upper set becomes relevant, because it indicates how tightly constrained the system’s time-evolution is by its present state (and environment). So, the claim is that there are certain systems for which their forwards time-evolution is very tightly constrained indeed, and these systems are “agents”; and there are systems for which barely any upper level set exists, and these are “simplistic” entities whose behavior is essentially entropic. And humans (seem to me to) occupy a median position between these two extremes.
One additional wrinkle, however, is that “agency”, as I’ve defined it here, may additionally play the role of a (dynamical system) attractor: entities already close to having full agency will be more tightly constrained in their future evolution, generally in the direction of becoming ever more agentic; meanwhile, entirely inanimate systems are not at all pulled in the direction of becoming more constrained or agentic; they are outside of the agency attractor’s basin of attraction. However, humans, if they indeed exist at some sort of halfway point between fully coherent agency and a complete lack of coherence, are left interestingly placed under this framing: we would exist at the boundary of the agency attractor’s basin of attraction. And since many such boundaries are fundamentally fractal or chaotic in nature, that could have troubling implications for the trajectories of points along those boundaries trying to reach reflective equilibrium, as it were.
The rule of thumb test I tend to use to assess proposed definitions of agency (at least from around these parts) is whether they’d class a black hole as an agent. It’s not clear to me whether this definition does; I would have said it very likely does based on everything you wrote, except for this one part here:
A cubic meter of rock has a persistent boundary over time, but no interior, states in an informational sense and therefore are not agents. To see they have no interior, note that anything that puts information into the surface layer of the rock transmits that same information into the very interior (vibrations, motion, etc).
I think I don’t really understand what is meant by “no interior” here, or why the argument given supports the notion that a cubic meter of rock has no interior. You can draw a Markov boundary around the rock’s surface, and then the interior state of the rock definitely is independent of the exterior environment conditioned on said boundary, right?
If I try very hard to extract a meaning out of the quoted paragraph, I might guess (with very low confidence) that what it’s trying to say is that a rock’s internal state has a one-to-one relation with the external forces or stimuli that transmit information through its surface, but in this case a black hole passes the test, in that the black hole’s internal state definitely is not one-to-one with the information entering through its event horizon. In other words, if my very low-confidence understanding of the quoted paragraph is correct, then black holes are classified as agents under this definition.
(This test is of interest to me because black holes tend to pass other, potentially related definitions of agency, such as agency as optimization, agency as compression, etc. I’m not sure whether this says that something is off with our intuitive notion of agency, that something is off with our attempts at rigorously defining it, or simply that black holes are a special kind of “physical agent” built in-to the laws of physics.)
How is a Bayesian agent supposed to modify priors except by updating on the basis of evidence?
They’re not! But humans aren’t ideal Bayesians, and it’s entirely possible for them to update in a way that does change their priors (encoded by intuitions) moving forward. In particular, the difference between having updated one’s intuitive prior, and keeping the intuitive prior around but also keeping track of a different, consciously held posterior, is that the former is vastly less likely to “de-update”, because the evidence that went into the update isn’t kept around in a form that subjects it to (potential) refutation.
(IIRC, E.T. Jaynes talks about this distinction in Chapter 18 of Probability Theory: The Logic of Science, and he models it by introducing something he calls an A_p distribution. His exposition of this idea is uncharacteristically unclear, and his A_p distribution looks basically like a beta distribution with specific values for α and β, but it does seem to capture the distinction I see between “intuitive” updating versus “conscious” updating.)
There’s also a failure mode of focusing on “which arguments are the best” instead of “what is actually true”. I don’t understand this failure mode very well, except that I’ve seen myself and others fall into it. Falling into it looks like focusing a lot on specific arguments, and spending a lot of time working out what was meant by the words, rather than feeling comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs.
My sense is that this is because different people have different intuitive priors, and process arguments (mostly) as a kind of Bayesian evidence that updates those priors, rather than modifying the priors (i.e. intuitions) directly.
Eliezer in particular strikes me as having an intuitive prior for AI alignment outcomes that looks very similar to priors for tasks like e.g. writing bug-free software on the first try, assessing the likelihood that a given plan will play out as envisioned, correctly compensating for optimism bias, etc. which is what gives rise to posts concerning concepts like security mindset.
Other people don’t share this intuitive prior, and so have to be argued into it. To such people, the reliability of the arguments in question is actually critical, because if those arguments turn out to have holes, that reverts the downstream updates and restores the original intuitive prior, whatever it looked like—kind of like a souped up version of the burden of proof concept, where the initial placement of that burden is determined entirely via the intuitive judgement of the individual.
This also seems related to why different people seem to naturally gravitate towards either conjunctive or disjunctive models of catastrophic outcomes from AI misalignment: the conjunctive impulse stems from an intuition that AI catastrophe is a priori unlikely, and so a bunch of different claims have to hold simultaneously in order to force a large enough update, whereas the disjunctive impulse stems from the notion that any given low-level claim need not be on particularly firm ground, because the high-level thesis of AI catastrophe robustly manifests via different but converging lines of reasoning.
See also: the focus on coherence, where some people place great importance on the question of whether VNM or other coherence theorems show what Eliezer et al. purport they show about superintelligent agents, versus the competing model wherein none of these individual theorems are important in their particulars, so much as the direction they seem to point, hinting at the concept of what idealized behavior with respect to non-gerrymandered physical resources ought to look like.
I think the real question, then, is where these differences in intuition come from, and unfortunately the answer might have to do a lot with people’s backgrounds, and the habits and heuristics they picked up from said backgrounds—something quite difficult to get at via specific, concrete argumentation.
Can we not speak of apparent coherence relative to a particular standpoint? If a given system seems to be behaving in such a way that you personally can’t see a way to construct for it a Dutch book, a series of interactions with it such that energy/negentropy/resources can be extracted from it and accrue to you, that makes the system inexploitable with respect to you, and therefore at least as coherent as you are. The closer to maximal coherence a given system is, the less it will visibly depart from the appearance of coherent behavior, and hence utility function maximization; the fact that various quibbles can be made about various coherence theorems does not seem to me to negate this conclusion.
Humans are more coherent than mice, and there are activities and processes which individual humans occasionally undergo in order to emerge more coherent than they did going in; in some sense this is the way it has to be, in any universe where (1) the initial conditions don’t start out giving you fully coherent embodied agents, and (2) physics requires continuity of physical processes, so that fully formed coherent embodied agents can’t spring into existence where there previously were none; there must be some pathway from incoherent, inanimate matter from which energy may be freely extracted, to highly organized configurations of matter from which energy may be extracted only with great difficulty, if it can be extracted at all.
If you expect the endpoint of that process to not fully accord with the von Neumann-Morgenstein axioms, because somebody once challenged the completeness axiom, independence axiom, continuity axiom, etc., the question still remains as to whether departures from those axioms will give rise to exploitable holes in the behavior of such systems, from the perspective of much weaker agents such as ourselves. And if the answer is “no”, then it seems to me the search for ways to make a weaker, less coherent agent into a stronger, more coherent agent is well-motivated, and necessary—an appeal to consequences in a certain sense, yes, but one that I endorse!
I seem to recall hearing a phrase I liked, which appears to concisely summarize the concern as: “There’s no canonical way to scale me up.”
Does that sound right to you?
Well, if we’re following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren’t generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn’t overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it.
(Just to be clear: yes, I know what training and test sets are, as well as dev sets/validation sets. You might notice I actually used the phrase “validation set” in my earlier reply to you, so it’s not a matter of guessing someone’s password—I’m quite familiar with these concepts, as someone who’s implemented ML models myself.)
Generally speaking, training, validation, and test datasets are all sourced the same way—in fact, sometimes they’re literally sourced from the same dataset, and the delineation between train/dev/test is introduced during training itself, by arbitrarily carving up the original dataset into smaller sets of appropriate size. This may capture the idea of “IID” you seem to appeal to elsewhere in your comment—that it’s possible to test the model’s generalization performance on some held-out subset of data from the same source(s) it was trained on.
In ML terms, what the thought experiment points to is a form of underlying distributional shift, one that isn’t (and can’t be) captured by “IID” validation or test datasets. The QFT model in particular highlights the extent to which your training process, however broad or inclusive from a parochial human standpoint, contains many incidental distributional correlates to your training signal which (1) exist in all of your data, including any you might hope to rely on to validate your model’s generalization performance, and (2) cease to correlate off-distribution, during deployment.
This can be caused by what you call “omniscience”, but it need not; there are other, more plausible distributional differences that might be picked up on by other kinds of models. But QFT is (as far as our current understanding of physics goes) very close to the base ontology of our universe, and so what is inferrable using QFT is naturally going to be very different from what is inferrable using some other (less powerful) ontology. QFT is a very powerful ontology!
If you want to call that “omniscience”, you can, although note that strictly speaking the model is still just working from inferences from training data. It’s just that, if you feed enough data to a model that can hold entire swaths of the physical universe inside of its metaphorical “head”, pretty soon hypotheses that involve the actual state of that universe will begin to outperform hypotheses that don’t, and which instead use some kind of lossy approximation of that state involving intermediary concepts like “intent”, “belief”, “agent”, “subjective state”, etc.
In principle we could construct a test set or dev set either before or after the model has been trained. It shouldn’t make a difference under normal circumstances. It sounds like maybe you’re discussing a scenario where the model has achieved a level of omniscience, and it does fine on data that was available during its training, because it’s able to read off of an omniscient world-model. But then it fails on data generated in the future, because the translation method for its omniscient world-model only works on artifacts that were present during training. Basically, the time at which the data was generated could constitute a hidden and unexpected source of distribution shift. Does that summarize the core concern?
You’re close; I’d say the concern is slightly worse than that. It’s that the “future data” never actually comes into existence, at any point. So the source of distributional shift isn’t just “the data is generated at the wrong time”, it’s “the data never gets externally generated to begin with, and you (the model) have to work with predictions of what the data counterfactually would have been, had it been generated”.
(This would be the case e.g. with any concept of “human approval” that came from a literal physical human or group of humans during training, and not after the system was deployed “in the wild”.)
In any case, I would argue that “accidental omniscience” characterizes the problem better than “alien abstractions”. As before, you can imagine an accidentally-omniscient model that uses vanilla abstractions, or a non-omniscient model that uses alien ones.
The problem is that “vanilla” abstractions are not the most predictively useful possible abstractions, if you’ve got access to better ones. And models whose ambient hypothesis space is broad enough to include better abstractions (from the standpoint of predictive accuracy) will gravitate towards those, as is incentivized by the outer form of the training task. QFT is the extreme example of a “better abstraction”, but in principle (if the natural abstraction hypothesis fails) there will be all sorts and shapes of abstractions, and some of them will be available to us, and some of them will be available to the model, and these sets will not fully overlap—which is a concern in worlds where different abstractions lead to different generalization properties.
I think it ought to be possible for someone to always be present. [I’m also not sure it would be necessary.]
I think I don’t understand what you’re imagining here. Are you imagining a human manually overseeing all outputs of something like ChatGPT, or Microsoft Copilot, before those outputs are sent to the end user (or, worse yet, put directly into production)?
[I also think I don’t understand why you make the bracketed claim you do, but perhaps hashing that out isn’t a conversational priority.]
As I understand this thought experiment, we’re doing next-token prediction on e.g. a book written by a philosopher, and in order to predict the next token using QFT, the obvious method is to use QFT to simulate the philosopher. But that’s not quite enough—you also need to read the next token out of that QFT-based simulation if you actually want to predict it.
It sounds like your understanding of the thought experiment differs from mine. If I were to guess, I’d guess that by “you” you’re referring to someone or something outside of the model, who has access to the model’s internals, and who uses that access to, as you say, “read” the next token out of the model’s ontology. However, this is not the setup we’re in with respect to actual models (with the exception perhaps of some fairly limited experiments in mechanistic interpretability)—and it’s also not the setup of the thought experiment, which (after all) is about precisely what happens when you can’t read things out of the model’s internal ontology, because it’s too alien to be interpreted.
In other words: “you” don’t read the next token out of the QFT simulation. The model is responsible for doing that translation work. How do we get it to do that, even though we don’t know how to specify the nature of the translation work, much less do it ourselves? Well, simple: in cases where we have access to the ground truth of the next token, e.g. because we’re having it predict an existing book passage, we simply penalize it whenever its output fails to match the next token in the book. In this way, the model can be incentivized to correctly predict whatever we want it to predict, even if we wouldn’t know how to tell it explicitly to do whatever it’s doing.
(The nature of this relationship—whereby humans train opaque algorithms to do things they wouldn’t themselves be able to write out as pseudocode—is arguably the essence of modern deep learning in toto.)
For one thing, in a standard train/dev/test setup, the model is arguably not really doing prediction, it’s doing retrodiction. It’s making ‘predictions’ about things which already happened in the past. The final model is chosen based on what retrodicts the data the best.
Yes, this is a reasonable description to my eyes. Moreover, I actually think it maps fairly well to the above description of how a QFT-style model might be trained to predict the next token of some body of text; in your terms, this is possible specifically because the text already exists, and retrodictions of that text can be graded based on how well they compare against the ground truth.
Also, usually the data is IID rather than sequential—there’s no time component to the data points (unless it’s a time-series problem, which it usually isn’t).
This, on the other hand, doesn’t sound right to me. Yes, there are certainly applications where the training regime produces IID data, but next-token prediction is pretty clearly not one of those? Later tokens are highly conditionally dependent on previous tokens, in a way that’s much closer to a time series than to some kind of IID process. Possibly part of the disconnect is that we’re imagining different applications entirely—which might also explain our differing intuitions w.r.t. deployment?
The fact that we’re choosing a model which retrodicts well is why the presence/absence of a human is generally assumed to be irrelevant, and emphasizing this factor sounds wacky to my ML engineer ears.
Right, so just to check that we’re on the same page: do we agree that after a (retrodictively trained) model is deployed for some use case other than retrodicting existing data—for generative use, say, or for use in some kind of online RL setup—then it’ll doing something other than retrodicting? And that in that situation, the source of (retrodictable) ground truth that was present during training—whether that was a book, a philosopher, or something else—will be absent?
If we do actually agree about that, then that distinction is really all I’m referring to! You can think of it as training set versus test set, to use a more standard ML analogy, except in this case the “test set” isn’t labeled at all, because no one labeled it in advance, and also it’s coming in from an unpredictable outside world rather than from a folder on someone’s hard drive.
Why does that matter? Well, because then we’re essentially at the mercy of the model’s generalization properties, in a way we weren’t while it was retrodicting the training set (or even the validation set, if one of those existed). If it gets anything wrong, there’s no longer any training signal or gradient to penalize it for being “wrong”—so the only remaining question is, just how likely is it to be “wrong”, after being trained for however long it was trained?
And that’s where the QFT model comes in. It says, actually, even if you train me for a good long while on a good amount of data, there are lots of ways for me to generalize “wrongly” from your perspective, if I’m modeling the universe at the level of quantum fields. Sure, I got all the retrodictions right while there was something to be retrodicted, but what exactly makes you think I did that by modeling the philosopher whose remarks I was being trained on?
Maybe I was predicting the soundwaves passing through a particularly region of air in the room he was located—or perhaps I was predicting the pattern of physical transistors in the segment of memory of a particular computer containing his works. Those physical locations in spacetime still exist, and now that I’m deployed, I continue to make predictions using those as my referent—except, the encodings I’m predicting there no longer resemble anything like coherent moral philosophy, or coherent anything, really.
The philosopher has left the room, or the computer’s memory has been reconfigured—so what exactly are the criteria by which I’m supposed to act now? Well, they’re going to be something, presumably—but they’re not going to be something explicit. They’re going to be something implicit to my QFT ontology, something that—back when the philosopher was there, during training—worked in tandem with the specifics of his presence, and the setup involving him, to produce accurate retrodictions of his judgements on various matters.
Now that that’s no longer the case, those same criteria describe some mathematical function that bears no meaningful correspondence to anything a human would recognize, valuable or not—but the function exists, and it can be maximized. Not much can be said about what maximizing that function might result in, except that it’s unlikely to look anything like “doing right according to the philosopher”.
That’s why the QFT example is important. A more plausible model, one that doesn’t think natively in terms of quantum amplitudes, permits the possibility of correctly compressing what we want it to compress—of learning to retrodict, not some strange physical correlates of the philosopher’s various motor outputs, but the actual philosopher’s beliefs as we would understand them. Whether that happens, or whether a QFT-style outcome happens instead, depends in large part on the inductive biases of the model’s architecture and the training process—inductive biases on which the natural abstraction hypothesis asserts a possible constraint.
I’m confused about what it means to “remove the human”, and why it’s so important whether the human is ‘removed’.
Because the human isn’t going to constantly be present for everything the system does after it’s deployed (unless for some reason it’s not deployed).
If I can assume that stuff, then it feels like a fairly core task, abundantly stress-tested during training, to read off the genius philosopher’s spoken opinions about e.g. moral philosophy from the quantum fields. How else could quantum fields be useful for next-token predictions?
Quantum fields are useful for an endless variety of things, from modeling genius philosophers to predicting lottery numbers. If your next-token prediction task involves any physically instantiated system, a model that uses QFT will be able to predict that system’s time-evolution with alacrity.
(Yes, this is computationally intractable, but we’re already in full-on hypothetical land with the QFT-based model to begin with. Remember, this is an exercise in showing what happens in the worst-case scenario for alignment, where the model’s native ontology completely diverges from our own.)
So we need not assume that predicting “the genius philosopher” is a core task. It’s enough to assume that the model is capable of it, among other things—which a QFT-based model certainly would be. Which, not so coincidentally, brings us to your next question:
Is alignment supposed to be hard in this hypothetical because the AI can’t represent human values in principle? Or is it supposed to be hard because it also has a lot of unsatisfactory representations of human values, and there’s no good method for finding a satisfactory needle in the unsatisfactory haystack? Or some other reason?
Consider how, during training, the human overseer (or genius philosopher, if you prefer) would have been pointed out to the model. We don’t have reliable access to its internal world-model, and even if we did we’d see blobs of amplitude and not much else. There’s no means, in that setting, of picking out the human and telling the model to unambiguously defer to that human.
What must happen instead, then, is something like next-token prediction: we perform gradient descent (or some other optimization method; it doesn’t really matter for the purposes of our story) on the model’s outputs, rewarding it when its outputs happen to match those of the human. The hope is that this will lead, in the limit, to the matching no longer occurring by happenstance—that if we train for long enough and in a varied enough set of situations, the best way for the model to produce outputs that track those of the human is to model that human, even in its QFT ontology.
But do we know for a fact that this will be the case? Even if it is, what happens when the overseer isn’t present to provide their actual feedback, as was never the case during training? What becomes the model’s referent then? We’d like to deploy it without an overseer, or in situations too complex for an overseer to understand. And whether the model’s behavior in those situations conforms to what the overseer would want, ideally, depends on what kinds of behind-the-scenes extrapolation the model is doing—which, if the model’s native ontology is something in which “human philosophers” are not basic objects, is liable to look very weird indeed.
This sounds a lot like saying “it might fail to generalize”.
Sort of, yes—but I’d call it “malgeneralization” rather than “misgeneralization”. It’s not failing to generalize, it’s just not generalizing the way you’d want it to.
Supposing we make a lot of progress on out-of-distribution generalization, is alignment getting any easier according to you? Wouldn’t that imply our systems are getting better at choosing proxies which generalize even when the human isn’t ‘present’?
Depends on what you mean by “progress”, and “out-of-distribution”. A powerful QFT-based model can make perfectly accurate predictions in any scenario you care to put it in, so it’s not like you’ll observe it getting things wrong. What experiments, and experimental outcomes, are you imagining here, such that those outcomes would provide evidence of “progress on out-of-distribution generalization”, when fundamentally the issue is expected to arise in situations where the experimenters are themselves absent (and which—crucially—is not a condition you can replicate as part of an experimental setup)?
I’d assume that when we tell it, “optimize this company, in a way that we would accept, after a ton of deliberation”, this could be instead described as, “optimize this company, in a way that we would accept, after a ton of deliberation, where these terms are described using our ontology”
The problem shows up when the system finds itself acting in a regime where the notion of us (humans) “accepting” its optimizations becomes purely counterfactual, because no actual human is available to oversee its actions in that regime. Then the question of “would a human accept this outcome?” must ground itself somewhere in the system’s internal model of what those terms refer to, which (by hypothesis) need not remotely match their meanings in our native ontology.
This isn’t (as much of) a problem in regimes where an actual human overseer is present (setting aside concerns about actual human judgement being hackable because we don’t implement our idealized values, i.e. outer alignment), because there the system’s notion of ground truth actually is grounded by the validation of that overseer.
You can have a system that models the world using quantum field theory, task it with predicting the energetic fluctuations produced by a particular set of amplitude spikes corresponding to a human in our ontology, and it can perfectly well predict whether those fluctuations encode sounds or motor actions we’d interpret as indications of approval of disapproval—and as long as there’s an actual human there to be predicted, the system will do so without issue (again modulo outer alignment concerns).
But remove the human, and suddenly the system is no longer operating based on its predictions of the behavior of a real physical system, and is instead operating from some learned counterfactual representation consisting of proxies in its native QFT-style ontology which happened to coincide with the actual human’s behavior while the human was present. And that learned representation, in an ontology as alien as QFT, is (assuming the falsehood of the natural abstraction hypothesis) not going to look very much like the human we want it to look like.
Your phrasing here is vague and somewhat convoluted, so I have difficulty telling if what you say is simply misleading, or false. Regardless:
If you have UTM1 and UTM2, there is a constant-length prefix P such that UTM1 with P prepended to some further bitstring as input will compute whatever UTM2 computes with only that bitstring as input; we can say of P that it “encodes” UTM2 relative to UTM1. This being the case, each function indexed by UTM1 differs from its counterpart for UTM2 by a maximum of len(P), because whenever it’s the case that a given function would otherwise be encoded in UTM1 by a bitstring longer than len(P + [the length of the shortest bitstring encoding the function in UTM2]), the prefixed version of that function simply is the shortest bitstring encoding it in UTM1.
One of the consequences of this, however, is that this prefix-based encoding method is only optimal for functions whose prefix-free encodings (i.e. encodings that cannot be partitioned into substrings such that one of the substrings encodes another UTM) in UTM1 and UTM2 differ in length by more than len(P). And, since len(P) is a measure of UTM2′s complexity relative to UTM1, it follows directly that, for a UTM2 whose “coding scheme” is such that a function whose prefix-free encoding in UTM2 differs in length from its prefix-free encoding in UTM1 by some large constant (say, ~2^10^80), len(P) itself must be on the order of 2^10^80—in other words, UTM2 must have an astronomical complexity relative to UTM1.
For any physically realizable universal computational system, that system can be analogized to UTM1 in the above analysis. If you have some behavioral policy that is e.g. deontological in nature, that behavioral policy can in principle be recast as an optimization criterion over universe histories; however, this criterion will in all likelihood have a prefix-free description in UTM1 of length ~2^10^80. And, crucially, there will be no UTM2 in whose encoding scheme the criterion in question has a prefix-free description of much less than ~2^10^80, without that UTM2 itself having a description complexity of ~2^10^80 relative to UTM1—meaning, there is no physically realizable system that can implement UTM2.