I’d assume that when we tell it, “optimize this company, in a way that we would accept, after a ton of deliberation”, this could be instead described as, “optimize this company, in a way that we would accept, after a ton of deliberation, where these terms are described using our ontology”
The problem shows up when the system finds itself acting in a regime where the notion of us (humans) “accepting” its optimizations becomes purely counterfactual, because no actual human is available to oversee its actions in that regime. Then the question of “would a human accept this outcome?” must ground itself somewhere in the system’s internal model of what those terms refer to, which (by hypothesis) need not remotely match their meanings in our native ontology.
This isn’t (as much of) a problem in regimes where an actual human overseer is present (setting aside concerns about actual human judgement being hackable because we don’t implement our idealized values, i.e. outer alignment), because there the system’s notion of ground truth actually is grounded by the validation of that overseer.
You can have a system that models the world using quantum field theory, task it with predicting the energetic fluctuations produced by a particular set of amplitude spikes corresponding to a human in our ontology, and it can perfectly well predict whether those fluctuations encode sounds or motor actions we’d interpret as indications of approval of disapproval—and as long as there’s an actual human there to be predicted, the system will do so without issue (again modulo outer alignment concerns).
But remove the human, and suddenly the system is no longer operating based on its predictions of the behavior of a real physical system, and is instead operating from some learned counterfactual representation consisting of proxies in its native QFT-style ontology which happened to coincide with the actual human’s behavior while the human was present. And that learned representation, in an ontology as alien as QFT, is (assuming the falsehood of the natural abstraction hypothesis) not going to look very much like the human we want it to look like.
I’m confused about what it means to “remove the human”, and why it’s so important whether the human is ‘removed’. Maybe if I try to nail down more parameters of the hypothetical, that will help with my confusion. For the sake of argument, can I assume...
That the AI is running computations involving quantum fields because it found that was the most effective way to make e.g. next-token predictions on its training set?
That the AI is in principle capable of running computations involving quantum fields to represent a genius philosopher?
If I can assume that stuff, then it feels like a fairly core task, abundantly stress-tested during training, to read off the genius philosopher’s spoken opinions about e.g. moral philosophy from the quantum fields. How else could quantum fields be useful for next-token predictions?
Another probe: Is alignment supposed to be hard in this hypothetical because the AI can’t represent human values in principle? Or is it supposed to be hard because it also has a lot of unsatisfactory representations of human values, and there’s no good method for finding a satisfactory needle in the unsatisfactory haystack? Or some other reason?
But remove the human, and suddenly the system is no longer operating based on its predictions of the behavior of a real physical system, and is instead operating from some learned counterfactual representation consisting of proxies in its native QFT-style ontology which happened to coincide with the actual human’s behavior while the human was present.
This sounds a lot like saying “it might fail to generalize”. Supposing we make a lot of progress on out-of-distribution generalization, is alignment getting any easier according to you? Wouldn’t that imply our systems are getting better at choosing proxies which generalize even when the human isn’t ‘present’?
I’m confused about what it means to “remove the human”, and why it’s so important whether the human is ‘removed’.
Because the human isn’t going to constantly be present for everything the system does after it’s deployed (unless for some reason it’s not deployed).
If I can assume that stuff, then it feels like a fairly core task, abundantly stress-tested during training, to read off the genius philosopher’s spoken opinions about e.g. moral philosophy from the quantum fields. How else could quantum fields be useful for next-token predictions?
Quantum fields are useful for an endless variety of things, from modeling genius philosophers to predicting lottery numbers. If your next-token prediction task involves any physically instantiated system, a model that uses QFT will be able to predict that system’s time-evolution with alacrity.
(Yes, this is computationally intractable, but we’re already in full-on hypothetical land with the QFT-based model to begin with. Remember, this is an exercise in showing what happens in the worst-case scenario for alignment, where the model’s native ontology completely diverges from our own.)
So we need not assume that predicting “the genius philosopher” is a core task. It’s enough to assume that the model is capable of it, among other things—which a QFT-based model certainly would be. Which, not so coincidentally, brings us to your next question:
Is alignment supposed to be hard in this hypothetical because the AI can’t represent human values in principle? Or is it supposed to be hard because it also has a lot of unsatisfactory representations of human values, and there’s no good method for finding a satisfactory needle in the unsatisfactory haystack? Or some other reason?
Consider how, during training, the human overseer (or genius philosopher, if you prefer) would have been pointed out to the model. We don’t have reliable access to its internal world-model, and even if we did we’d see blobs of amplitude and not much else. There’s no means, in that setting, of picking out the human and telling the model to unambiguously defer to that human.
What must happen instead, then, is something like next-token prediction: we perform gradient descent (or some other optimization method; it doesn’t really matter for the purposes of our story) on the model’s outputs, rewarding it when its outputs happen to match those of the human. The hope is that this will lead, in the limit, to the matching no longer occurring by happenstance—that if we train for long enough and in a varied enough set of situations, the best way for the model to produce outputs that track those of the human is to model that human, even in its QFT ontology.
But do we know for a fact that this will be the case? Even if it is, what happens when the overseer isn’t present to provide their actual feedback, as was never the case during training? What becomes the model’s referent then? We’d like to deploy it without an overseer, or in situations too complex for an overseer to understand. And whether the model’s behavior in those situations conforms to what the overseer would want, ideally, depends on what kinds of behind-the-scenes extrapolation the model is doing—which, if the model’s native ontology is something in which “human philosophers” are not basic objects, is liable to look very weird indeed.
This sounds a lot like saying “it might fail to generalize”.
Sort of, yes—but I’d call it “malgeneralization” rather than “misgeneralization”. It’s not failing to generalize, it’s just not generalizing the way you’d want it to.
Supposing we make a lot of progress on out-of-distribution generalization, is alignment getting any easier according to you? Wouldn’t that imply our systems are getting better at choosing proxies which generalize even when the human isn’t ‘present’?
Depends on what you mean by “progress”, and “out-of-distribution”. A powerful QFT-based model can make perfectly accurate predictions in any scenario you care to put it in, so it’s not like you’ll observe it getting things wrong. What experiments, and experimental outcomes, are you imagining here, such that those outcomes would provide evidence of “progress on out-of-distribution generalization”, when fundamentally the issue is expected to arise in situations where the experimenters are themselves absent (and which—crucially—is not a condition you can replicate as part of an experimental setup)?
Because the human isn’t going to constantly be present for everything the system does after it’s deployed (unless for some reason it’s not deployed).
I think it ought to be possible for someone to always be present. [I’m also not sure it would be necessary.]
So we need not assume that predicting “the genius philosopher” is a core task.
It’s not the genius philosopher that’s the core task, it’s the reading of their opinions out of a QFT-based simulation of them. As I understand this thought experiment, we’re doing next-token prediction on e.g. a book written by a philosopher, and in order to predict the next token using QFT, the obvious method is to use QFT to simulate the philosopher. But that’s not quite enough—you also need to read the next token out of that QFT-based simulation if you actually want to predict it. This sort of ‘reading tokens out of a QFT simulation’ thing would be very common, thus something the system gets good at in order to succeed at next-token prediction.
I think perhaps there’s more to your thought experiment than just alien abstractions, and it’s worth disentangling these assumptions. For one thing, in a standard train/dev/test setup, the model is arguably not really doing prediction, it’s doing retrodiction. It’s making ‘predictions’ about things which already happened in the past. The final model is chosen based on what retrodicts the data the best. Also, usually the data is IID rather than sequential—there’s no time component to the data points (unless it’s a time-series problem, which it usually isn’t). The fact that we’re choosing a model which retrodicts well is why the presence/absence of a human is generally assumed to be irrelevant, and emphasizing this factor sounds wacky to my ML engineer ears.
So basically I suspect what you’re really trying to claim here, which incidentally I’ve also seen John allude to elsewhere, is that the standard assumptions of machine learning involving retrodiction and IID data points may break down once your system gets smart enough. This is a possibility worth exploring, I just want to clarify that it seems orthogonal to the issue of alien abstractions. In principle one can imagine a system that heavily features QFT in its internal ontology yet still can be characterized as retrodicting on IID data, or a system with vanilla abstractions that can’t be characterized as retrodicting on IID data. I think exploring this in a post could be valuable, because it seems like an under-discussed source of disagreement between certain doomer-type people and mainstream ML folks.
I think it ought to be possible for someone to always be present. [I’m also not sure it would be necessary.]
I think I don’t understand what you’re imagining here. Are you imagining a human manually overseeing all outputs of something like ChatGPT, or Microsoft Copilot, before those outputs are sent to the end user (or, worse yet, put directly into production)?
[I also think I don’t understand why you make the bracketed claim you do, but perhaps hashing that out isn’t a conversational priority.]
As I understand this thought experiment, we’re doing next-token prediction on e.g. a book written by a philosopher, and in order to predict the next token using QFT, the obvious method is to use QFT to simulate the philosopher. But that’s not quite enough—you also need to read the next token out of that QFT-based simulation if you actually want to predict it.
It sounds like your understanding of the thought experiment differs from mine. If I were to guess, I’d guess that by “you” you’re referring to someone or something outside of the model, who has access to the model’s internals, and who uses that access to, as you say, “read” the next token out of the model’s ontology. However, this is not the setup we’re in with respect to actual models (with the exception perhaps of some fairly limited experiments in mechanistic interpretability)—and it’s also not the setup of the thought experiment, which (after all) is about precisely what happens when you can’t read things out of the model’s internal ontology, because it’s too alien to be interpreted.
In other words: “you” don’t read the next token out of the QFT simulation. The model is responsible for doing that translation work. How do we get it to do that, even though we don’t know how to specify the nature of the translation work, much less do it ourselves? Well, simple: in cases where we have access to the ground truth of the next token, e.g. because we’re having it predict an existing book passage, we simply penalize it whenever its output fails to match the next token in the book. In this way, the model can be incentivized to correctly predict whatever we want it to predict, even if we wouldn’t know how to tell it explicitly to do whatever it’s doing.
(The nature of this relationship—whereby humans train opaque algorithms to do things they wouldn’t themselves be able to write out as pseudocode—is arguably the essence of modern deep learning in toto.)
For one thing, in a standard train/dev/test setup, the model is arguably not really doing prediction, it’s doing retrodiction. It’s making ‘predictions’ about things which already happened in the past. The final model is chosen based on what retrodicts the data the best.
Yes, this is a reasonable description to my eyes. Moreover, I actually think it maps fairly well to the above description of how a QFT-style model might be trained to predict the next token of some body of text; in your terms, this is possible specifically because the text already exists, and retrodictions of that text can be graded based on how well they compare against the ground truth.
Also, usually the data is IID rather than sequential—there’s no time component to the data points (unless it’s a time-series problem, which it usually isn’t).
This, on the other hand, doesn’t sound right to me. Yes, there are certainly applications where the training regime produces IID data, but next-token prediction is pretty clearly not one of those? Later tokens are highly conditionally dependent on previous tokens, in a way that’s much closer to a time series than to some kind of IID process. Possibly part of the disconnect is that we’re imagining different applications entirely—which might also explain our differing intuitions w.r.t. deployment?
The fact that we’re choosing a model which retrodicts well is why the presence/absence of a human is generally assumed to be irrelevant, and emphasizing this factor sounds wacky to my ML engineer ears.
Right, so just to check that we’re on the same page: do we agree that after a (retrodictively trained) model is deployed for some use case other than retrodicting existing data—for generative use, say, or for use in some kind of online RL setup—then it’ll doing something other than retrodicting? And that in that situation, the source of (retrodictable) ground truth that was present during training—whether that was a book, a philosopher, or something else—will be absent?
If we do actually agree about that, then that distinction is really all I’m referring to! You can think of it as training set versus test set, to use a more standard ML analogy, except in this case the “test set” isn’t labeled at all, because no one labeled it in advance, and also it’s coming in from an unpredictable outside world rather than from a folder on someone’s hard drive.
Why does that matter? Well, because then we’re essentially at the mercy of the model’s generalization properties, in a way we weren’t while it was retrodicting the training set (or even the validation set, if one of those existed). If it gets anything wrong, there’s no longer any training signal or gradient to penalize it for being “wrong”—so the only remaining question is, just how likely is it to be “wrong”, after being trained for however long it was trained?
And that’s where the QFT model comes in. It says, actually, even if you train me for a good long while on a good amount of data, there are lots of ways for me to generalize “wrongly” from your perspective, if I’m modeling the universe at the level of quantum fields. Sure, I got all the retrodictions right while there was something to be retrodicted, but what exactly makes you think I did that by modeling the philosopher whose remarks I was being trained on?
Maybe I was predicting the soundwaves passing through a particularly region of air in the room he was located—or perhaps I was predicting the pattern of physical transistors in the segment of memory of a particular computer containing his works. Those physical locations in spacetime still exist, and now that I’m deployed, I continue to make predictions using those as my referent—except, the encodings I’m predicting there no longer resemble anything like coherent moral philosophy, or coherent anything, really.
The philosopher has left the room, or the computer’s memory has been reconfigured—so what exactly are the criteria by which I’m supposed to act now? Well, they’re going to be something, presumably—but they’re not going to be something explicit. They’re going to be something implicit to my QFT ontology, something that—back when the philosopher was there, during training—worked in tandem with the specifics of his presence, and the setup involving him, to produce accurate retrodictions of his judgements on various matters.
Now that that’s no longer the case, those same criteria describe some mathematical function that bears no meaningful correspondence to anything a human would recognize, valuable or not—but the function exists, and it can be maximized. Not much can be said about what maximizing that function might result in, except that it’s unlikely to look anything like “doing right according to the philosopher”.
That’s why the QFT example is important. A more plausible model, one that doesn’t think natively in terms of quantum amplitudes, permits the possibility of correctly compressing what we want it to compress—of learning to retrodict, not some strange physical correlates of the philosopher’s various motor outputs, but the actual philosopher’s beliefs as we would understand them. Whether that happens, or whether a QFT-style outcome happens instead, depends in large part on the inductive biases of the model’s architecture and the training process—inductive biases on which the natural abstraction hypothesis asserts a possible constraint.
If I were to guess, I’d guess that by “you” you’re referring to someone or something outside of the model, who has access to the model’s internals, and who uses that access to, as you say, “read” the next token out of the model’s ontology.
Was using a metaphorical “you”. Probably should’ve said something like “gradient descent will find a way to read the next token out of the QFT-based simulation”.
Yes, there are certainly applications where the training regime produces IID data, but next-token prediction is pretty clearly not one of those?
I suppose I should’ve said various documents are IID to be more clear. I would certainly guess they are.
Right, so just to check that we’re on the same page: do we agree that after a (retrodictively trained) model is deployed for some use case other than retrodicting existing data—for generative use, say, or for use in some kind of online RL setup—then it’ll doing something other than retrodicting?
Generally speaking, yes.
And that’s where the QFT model comes in. It says, actually, even if you train me for a good long while on a good amount of data, there are lots of ways for me to generalize “wrongly” from your perspective, if I’m modeling the universe at the level of quantum fields. Sure, I got all the retrodictions right while there was something to be retrodicted, but what exactly makes you think I did that by modeling the philosopher whose remarks I was being trained on?
Well, if we’re following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren’t generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn’t overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it.
In principle we could construct a test set or dev set either before or after the model has been trained. It shouldn’t make a difference under normal circumstances. It sounds like maybe you’re discussing a scenario where the model has achieved a level of omniscience, and it does fine on data that was available during its training, because it’s able to read off of an omniscient world-model. But then it fails on data generated in the future, because the translation method for its omniscient world-model only works on artifacts that were present during training. Basically, the time at which the data was generated could constitute a hidden and unexpected source of distribution shift. Does that summarize the core concern?
(To be clear, this sort of acquired-omniscience is liable to sound kooky to many ML researchers. I think it’s worth stress-testing alignment proposals under these sort of extreme scenarios, but I’m not sure we should weight them heavily in terms of estimating our probability of success. In this particular scenario, the model’s performance would drop on data generated after training, and that would hurt the company’s bottom line, and they would have a strong financial incentive to fix it. So I don’t know if thinking about this is a comparative advantage for alignment researchers.)
BTW, the point about documents being IID was meant to indicate that there’s little incentive for the model to e.g. retrodict the coordinates of the server storing a particular document—the sort of data that could aid and incentivize omniscience to a greater degree.
In any case, I would argue that “accidental omniscience” characterizes the problem better than “alien abstractions”. As before, you can imagine an accidentally-omniscient model that uses vanilla abstractions, or a non-omniscient model that uses alien ones.
Well, if we’re following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren’t generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn’t overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it.
(Just to be clear: yes, I know what training and test sets are, as well as dev sets/validation sets. You might notice I actually used the phrase “validation set” in my earlier reply to you, so it’s not a matter of guessing someone’s password—I’m quite familiar with these concepts, as someone who’s implemented ML models myself.)
Generally speaking, training, validation, and test datasets are all sourced the same way—in fact, sometimes they’re literally sourced from the same dataset, and the delineation between train/dev/test is introduced during training itself, by arbitrarily carving up the original dataset into smaller sets of appropriate size. This may capture the idea of “IID” you seem to appeal to elsewhere in your comment—that it’s possible to test the model’s generalization performance on some held-out subset of data from the same source(s) it was trained on.
In ML terms, what the thought experiment points to is a form of underlying distributional shift, one that isn’t (and can’t be) captured by “IID” validation or test datasets. The QFT model in particular highlights the extent to which your training process, however broad or inclusive from a parochial human standpoint, contains many incidental distributional correlates to your training signal which (1) exist in all of your data, including any you might hope to rely on to validate your model’s generalization performance, and (2) cease to correlate off-distribution, during deployment.
This can be caused by what you call “omniscience”, but it need not; there are other, more plausible distributional differences that might be picked up on by other kinds of models. But QFT is (as far as our current understanding of physics goes) very close to the base ontology of our universe, and so what is inferrable using QFT is naturally going to be very different from what is inferrable using some other (less powerful) ontology. QFT is a very powerful ontology!
If you want to call that “omniscience”, you can, although note that strictly speaking the model is still just working from inferences from training data. It’s just that, if you feed enough data to a model that can hold entire swaths of the physical universe inside of its metaphorical “head”, pretty soon hypotheses that involve the actual state of that universe will begin to outperform hypotheses that don’t, and which instead use some kind of lossy approximation of that state involving intermediary concepts like “intent”, “belief”, “agent”, “subjective state”, etc.
In principle we could construct a test set or dev set either before or after the model has been trained. It shouldn’t make a difference under normal circumstances. It sounds like maybe you’re discussing a scenario where the model has achieved a level of omniscience, and it does fine on data that was available during its training, because it’s able to read off of an omniscient world-model. But then it fails on data generated in the future, because the translation method for its omniscient world-model only works on artifacts that were present during training. Basically, the time at which the data was generated could constitute a hidden and unexpected source of distribution shift. Does that summarize the core concern?
You’re close; I’d say the concern is slightly worse than that. It’s that the “future data” never actually comes into existence, at any point. So the source of distributional shift isn’t just “the data is generated at the wrong time”, it’s “the data never gets externally generated to begin with, and you (the model) have to work with predictions of what the data counterfactually would have been, had it been generated”.
(This would be the case e.g. with any concept of “human approval” that came from a literal physical human or group of humans during training, and not after the system was deployed “in the wild”.)
In any case, I would argue that “accidental omniscience” characterizes the problem better than “alien abstractions”. As before, you can imagine an accidentally-omniscient model that uses vanilla abstractions, or a non-omniscient model that uses alien ones.
The problem is that “vanilla” abstractions are not the most predictively useful possible abstractions, if you’ve got access to better ones. And models whose ambient hypothesis space is broad enough to include better abstractions (from the standpoint of predictive accuracy) will gravitate towards those, as is incentivized by the outer form of the training task. QFT is the extreme example of a “better abstraction”, but in principle (if the natural abstraction hypothesis fails) there will be all sorts and shapes of abstractions, and some of them will be available to us, and some of them will be available to the model, and these sets will not fully overlap—which is a concern in worlds where different abstractions lead to different generalization properties.
QFT is the extreme example of a “better abstraction”, but in principle (if the natural abstraction hypothesis fails) there will be all sorts and shapes of abstractions, and some of them will be available to us, and some of them will be available to the model, and these sets will not fully overlap—which is a concern in worlds where different abstractions lead to different generalization properties.
Indeed. I think the key thing for me is, I expect the model to be strongly incentivized to have a solid translation layer from its internal ontology to e.g. English language, due to being trained on lots of English language data. Due to Occam’s Razor, I expect the internal ontology to be biased towards that of an English-language speaker.
It’s just that, if you feed enough data to a model that can hold entire swaths of the physical universe inside of its metaphorical “head”, pretty soon hypotheses that involve the actual state of that universe will begin to outperform hypotheses that don’t, and which instead use some kind of lossy approximation of that state involving intermediary concepts like “intent”, “belief”, “agent”, “subjective state”, etc.
I’m imagining something like: early in training the model makes use of those lossy approximations because they are a cheap/accessible way to improve its predictive accuracy. Later in training, assuming it’s being trained on the sort of gigantic scale that would allow it to hold swaths of the physical universe in its head, it loses those desired lossy abstractions due to catastrophic forgetting. Is that an OK way to operationalize your concern?
I’m still not convinced that this problem is a priority. It seems like a problem which will be encountered very late if ever, and will lead to ‘random’ failures on predicting future/counterfactual data in a way that’s fairly obvious.
The problem shows up when the system finds itself acting in a regime where the notion of us (humans) “accepting” its optimizations becomes purely counterfactual, because no actual human is available to oversee its actions in that regime. Then the question of “would a human accept this outcome?” must ground itself somewhere in the system’s internal model of what those terms refer to, which (by hypothesis) need not remotely match their meanings in our native ontology.
This isn’t (as much of) a problem in regimes where an actual human overseer is present (setting aside concerns about actual human judgement being hackable because we don’t implement our idealized values, i.e. outer alignment), because there the system’s notion of ground truth actually is grounded by the validation of that overseer.
You can have a system that models the world using quantum field theory, task it with predicting the energetic fluctuations produced by a particular set of amplitude spikes corresponding to a human in our ontology, and it can perfectly well predict whether those fluctuations encode sounds or motor actions we’d interpret as indications of approval of disapproval—and as long as there’s an actual human there to be predicted, the system will do so without issue (again modulo outer alignment concerns).
But remove the human, and suddenly the system is no longer operating based on its predictions of the behavior of a real physical system, and is instead operating from some learned counterfactual representation consisting of proxies in its native QFT-style ontology which happened to coincide with the actual human’s behavior while the human was present. And that learned representation, in an ontology as alien as QFT, is (assuming the falsehood of the natural abstraction hypothesis) not going to look very much like the human we want it to look like.
I’m confused about what it means to “remove the human”, and why it’s so important whether the human is ‘removed’. Maybe if I try to nail down more parameters of the hypothetical, that will help with my confusion. For the sake of argument, can I assume...
That the AI is running computations involving quantum fields because it found that was the most effective way to make e.g. next-token predictions on its training set?
That the AI is in principle capable of running computations involving quantum fields to represent a genius philosopher?
If I can assume that stuff, then it feels like a fairly core task, abundantly stress-tested during training, to read off the genius philosopher’s spoken opinions about e.g. moral philosophy from the quantum fields. How else could quantum fields be useful for next-token predictions?
Another probe: Is alignment supposed to be hard in this hypothetical because the AI can’t represent human values in principle? Or is it supposed to be hard because it also has a lot of unsatisfactory representations of human values, and there’s no good method for finding a satisfactory needle in the unsatisfactory haystack? Or some other reason?
This sounds a lot like saying “it might fail to generalize”. Supposing we make a lot of progress on out-of-distribution generalization, is alignment getting any easier according to you? Wouldn’t that imply our systems are getting better at choosing proxies which generalize even when the human isn’t ‘present’?
Because the human isn’t going to constantly be present for everything the system does after it’s deployed (unless for some reason it’s not deployed).
Quantum fields are useful for an endless variety of things, from modeling genius philosophers to predicting lottery numbers. If your next-token prediction task involves any physically instantiated system, a model that uses QFT will be able to predict that system’s time-evolution with alacrity.
(Yes, this is computationally intractable, but we’re already in full-on hypothetical land with the QFT-based model to begin with. Remember, this is an exercise in showing what happens in the worst-case scenario for alignment, where the model’s native ontology completely diverges from our own.)
So we need not assume that predicting “the genius philosopher” is a core task. It’s enough to assume that the model is capable of it, among other things—which a QFT-based model certainly would be. Which, not so coincidentally, brings us to your next question:
Consider how, during training, the human overseer (or genius philosopher, if you prefer) would have been pointed out to the model. We don’t have reliable access to its internal world-model, and even if we did we’d see blobs of amplitude and not much else. There’s no means, in that setting, of picking out the human and telling the model to unambiguously defer to that human.
What must happen instead, then, is something like next-token prediction: we perform gradient descent (or some other optimization method; it doesn’t really matter for the purposes of our story) on the model’s outputs, rewarding it when its outputs happen to match those of the human. The hope is that this will lead, in the limit, to the matching no longer occurring by happenstance—that if we train for long enough and in a varied enough set of situations, the best way for the model to produce outputs that track those of the human is to model that human, even in its QFT ontology.
But do we know for a fact that this will be the case? Even if it is, what happens when the overseer isn’t present to provide their actual feedback, as was never the case during training? What becomes the model’s referent then? We’d like to deploy it without an overseer, or in situations too complex for an overseer to understand. And whether the model’s behavior in those situations conforms to what the overseer would want, ideally, depends on what kinds of behind-the-scenes extrapolation the model is doing—which, if the model’s native ontology is something in which “human philosophers” are not basic objects, is liable to look very weird indeed.
Sort of, yes—but I’d call it “malgeneralization” rather than “misgeneralization”. It’s not failing to generalize, it’s just not generalizing the way you’d want it to.
Depends on what you mean by “progress”, and “out-of-distribution”. A powerful QFT-based model can make perfectly accurate predictions in any scenario you care to put it in, so it’s not like you’ll observe it getting things wrong. What experiments, and experimental outcomes, are you imagining here, such that those outcomes would provide evidence of “progress on out-of-distribution generalization”, when fundamentally the issue is expected to arise in situations where the experimenters are themselves absent (and which—crucially—is not a condition you can replicate as part of an experimental setup)?
I think it ought to be possible for someone to always be present. [I’m also not sure it would be necessary.]
It’s not the genius philosopher that’s the core task, it’s the reading of their opinions out of a QFT-based simulation of them. As I understand this thought experiment, we’re doing next-token prediction on e.g. a book written by a philosopher, and in order to predict the next token using QFT, the obvious method is to use QFT to simulate the philosopher. But that’s not quite enough—you also need to read the next token out of that QFT-based simulation if you actually want to predict it. This sort of ‘reading tokens out of a QFT simulation’ thing would be very common, thus something the system gets good at in order to succeed at next-token prediction.
I think perhaps there’s more to your thought experiment than just alien abstractions, and it’s worth disentangling these assumptions. For one thing, in a standard train/dev/test setup, the model is arguably not really doing prediction, it’s doing retrodiction. It’s making ‘predictions’ about things which already happened in the past. The final model is chosen based on what retrodicts the data the best. Also, usually the data is IID rather than sequential—there’s no time component to the data points (unless it’s a time-series problem, which it usually isn’t). The fact that we’re choosing a model which retrodicts well is why the presence/absence of a human is generally assumed to be irrelevant, and emphasizing this factor sounds wacky to my ML engineer ears.
So basically I suspect what you’re really trying to claim here, which incidentally I’ve also seen John allude to elsewhere, is that the standard assumptions of machine learning involving retrodiction and IID data points may break down once your system gets smart enough. This is a possibility worth exploring, I just want to clarify that it seems orthogonal to the issue of alien abstractions. In principle one can imagine a system that heavily features QFT in its internal ontology yet still can be characterized as retrodicting on IID data, or a system with vanilla abstractions that can’t be characterized as retrodicting on IID data. I think exploring this in a post could be valuable, because it seems like an under-discussed source of disagreement between certain doomer-type people and mainstream ML folks.
I think I don’t understand what you’re imagining here. Are you imagining a human manually overseeing all outputs of something like ChatGPT, or Microsoft Copilot, before those outputs are sent to the end user (or, worse yet, put directly into production)?
[I also think I don’t understand why you make the bracketed claim you do, but perhaps hashing that out isn’t a conversational priority.]
It sounds like your understanding of the thought experiment differs from mine. If I were to guess, I’d guess that by “you” you’re referring to someone or something outside of the model, who has access to the model’s internals, and who uses that access to, as you say, “read” the next token out of the model’s ontology. However, this is not the setup we’re in with respect to actual models (with the exception perhaps of some fairly limited experiments in mechanistic interpretability)—and it’s also not the setup of the thought experiment, which (after all) is about precisely what happens when you can’t read things out of the model’s internal ontology, because it’s too alien to be interpreted.
In other words: “you” don’t read the next token out of the QFT simulation. The model is responsible for doing that translation work. How do we get it to do that, even though we don’t know how to specify the nature of the translation work, much less do it ourselves? Well, simple: in cases where we have access to the ground truth of the next token, e.g. because we’re having it predict an existing book passage, we simply penalize it whenever its output fails to match the next token in the book. In this way, the model can be incentivized to correctly predict whatever we want it to predict, even if we wouldn’t know how to tell it explicitly to do whatever it’s doing.
(The nature of this relationship—whereby humans train opaque algorithms to do things they wouldn’t themselves be able to write out as pseudocode—is arguably the essence of modern deep learning in toto.)
Yes, this is a reasonable description to my eyes. Moreover, I actually think it maps fairly well to the above description of how a QFT-style model might be trained to predict the next token of some body of text; in your terms, this is possible specifically because the text already exists, and retrodictions of that text can be graded based on how well they compare against the ground truth.
This, on the other hand, doesn’t sound right to me. Yes, there are certainly applications where the training regime produces IID data, but next-token prediction is pretty clearly not one of those? Later tokens are highly conditionally dependent on previous tokens, in a way that’s much closer to a time series than to some kind of IID process. Possibly part of the disconnect is that we’re imagining different applications entirely—which might also explain our differing intuitions w.r.t. deployment?
Right, so just to check that we’re on the same page: do we agree that after a (retrodictively trained) model is deployed for some use case other than retrodicting existing data—for generative use, say, or for use in some kind of online RL setup—then it’ll doing something other than retrodicting? And that in that situation, the source of (retrodictable) ground truth that was present during training—whether that was a book, a philosopher, or something else—will be absent?
If we do actually agree about that, then that distinction is really all I’m referring to! You can think of it as training set versus test set, to use a more standard ML analogy, except in this case the “test set” isn’t labeled at all, because no one labeled it in advance, and also it’s coming in from an unpredictable outside world rather than from a folder on someone’s hard drive.
Why does that matter? Well, because then we’re essentially at the mercy of the model’s generalization properties, in a way we weren’t while it was retrodicting the training set (or even the validation set, if one of those existed). If it gets anything wrong, there’s no longer any training signal or gradient to penalize it for being “wrong”—so the only remaining question is, just how likely is it to be “wrong”, after being trained for however long it was trained?
And that’s where the QFT model comes in. It says, actually, even if you train me for a good long while on a good amount of data, there are lots of ways for me to generalize “wrongly” from your perspective, if I’m modeling the universe at the level of quantum fields. Sure, I got all the retrodictions right while there was something to be retrodicted, but what exactly makes you think I did that by modeling the philosopher whose remarks I was being trained on?
Maybe I was predicting the soundwaves passing through a particularly region of air in the room he was located—or perhaps I was predicting the pattern of physical transistors in the segment of memory of a particular computer containing his works. Those physical locations in spacetime still exist, and now that I’m deployed, I continue to make predictions using those as my referent—except, the encodings I’m predicting there no longer resemble anything like coherent moral philosophy, or coherent anything, really.
The philosopher has left the room, or the computer’s memory has been reconfigured—so what exactly are the criteria by which I’m supposed to act now? Well, they’re going to be something, presumably—but they’re not going to be something explicit. They’re going to be something implicit to my QFT ontology, something that—back when the philosopher was there, during training—worked in tandem with the specifics of his presence, and the setup involving him, to produce accurate retrodictions of his judgements on various matters.
Now that that’s no longer the case, those same criteria describe some mathematical function that bears no meaningful correspondence to anything a human would recognize, valuable or not—but the function exists, and it can be maximized. Not much can be said about what maximizing that function might result in, except that it’s unlikely to look anything like “doing right according to the philosopher”.
That’s why the QFT example is important. A more plausible model, one that doesn’t think natively in terms of quantum amplitudes, permits the possibility of correctly compressing what we want it to compress—of learning to retrodict, not some strange physical correlates of the philosopher’s various motor outputs, but the actual philosopher’s beliefs as we would understand them. Whether that happens, or whether a QFT-style outcome happens instead, depends in large part on the inductive biases of the model’s architecture and the training process—inductive biases on which the natural abstraction hypothesis asserts a possible constraint.
Was using a metaphorical “you”. Probably should’ve said something like “gradient descent will find a way to read the next token out of the QFT-based simulation”.
I suppose I should’ve said various documents are IID to be more clear. I would certainly guess they are.
Generally speaking, yes.
Well, if we’re following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren’t generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn’t overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it.
In principle we could construct a test set or dev set either before or after the model has been trained. It shouldn’t make a difference under normal circumstances. It sounds like maybe you’re discussing a scenario where the model has achieved a level of omniscience, and it does fine on data that was available during its training, because it’s able to read off of an omniscient world-model. But then it fails on data generated in the future, because the translation method for its omniscient world-model only works on artifacts that were present during training. Basically, the time at which the data was generated could constitute a hidden and unexpected source of distribution shift. Does that summarize the core concern?
(To be clear, this sort of acquired-omniscience is liable to sound kooky to many ML researchers. I think it’s worth stress-testing alignment proposals under these sort of extreme scenarios, but I’m not sure we should weight them heavily in terms of estimating our probability of success. In this particular scenario, the model’s performance would drop on data generated after training, and that would hurt the company’s bottom line, and they would have a strong financial incentive to fix it. So I don’t know if thinking about this is a comparative advantage for alignment researchers.)
BTW, the point about documents being IID was meant to indicate that there’s little incentive for the model to e.g. retrodict the coordinates of the server storing a particular document—the sort of data that could aid and incentivize omniscience to a greater degree.
In any case, I would argue that “accidental omniscience” characterizes the problem better than “alien abstractions”. As before, you can imagine an accidentally-omniscient model that uses vanilla abstractions, or a non-omniscient model that uses alien ones.
(Just to be clear: yes, I know what training and test sets are, as well as dev sets/validation sets. You might notice I actually used the phrase “validation set” in my earlier reply to you, so it’s not a matter of guessing someone’s password—I’m quite familiar with these concepts, as someone who’s implemented ML models myself.)
Generally speaking, training, validation, and test datasets are all sourced the same way—in fact, sometimes they’re literally sourced from the same dataset, and the delineation between train/dev/test is introduced during training itself, by arbitrarily carving up the original dataset into smaller sets of appropriate size. This may capture the idea of “IID” you seem to appeal to elsewhere in your comment—that it’s possible to test the model’s generalization performance on some held-out subset of data from the same source(s) it was trained on.
In ML terms, what the thought experiment points to is a form of underlying distributional shift, one that isn’t (and can’t be) captured by “IID” validation or test datasets. The QFT model in particular highlights the extent to which your training process, however broad or inclusive from a parochial human standpoint, contains many incidental distributional correlates to your training signal which (1) exist in all of your data, including any you might hope to rely on to validate your model’s generalization performance, and (2) cease to correlate off-distribution, during deployment.
This can be caused by what you call “omniscience”, but it need not; there are other, more plausible distributional differences that might be picked up on by other kinds of models. But QFT is (as far as our current understanding of physics goes) very close to the base ontology of our universe, and so what is inferrable using QFT is naturally going to be very different from what is inferrable using some other (less powerful) ontology. QFT is a very powerful ontology!
If you want to call that “omniscience”, you can, although note that strictly speaking the model is still just working from inferences from training data. It’s just that, if you feed enough data to a model that can hold entire swaths of the physical universe inside of its metaphorical “head”, pretty soon hypotheses that involve the actual state of that universe will begin to outperform hypotheses that don’t, and which instead use some kind of lossy approximation of that state involving intermediary concepts like “intent”, “belief”, “agent”, “subjective state”, etc.
You’re close; I’d say the concern is slightly worse than that. It’s that the “future data” never actually comes into existence, at any point. So the source of distributional shift isn’t just “the data is generated at the wrong time”, it’s “the data never gets externally generated to begin with, and you (the model) have to work with predictions of what the data counterfactually would have been, had it been generated”.
(This would be the case e.g. with any concept of “human approval” that came from a literal physical human or group of humans during training, and not after the system was deployed “in the wild”.)
The problem is that “vanilla” abstractions are not the most predictively useful possible abstractions, if you’ve got access to better ones. And models whose ambient hypothesis space is broad enough to include better abstractions (from the standpoint of predictive accuracy) will gravitate towards those, as is incentivized by the outer form of the training task. QFT is the extreme example of a “better abstraction”, but in principle (if the natural abstraction hypothesis fails) there will be all sorts and shapes of abstractions, and some of them will be available to us, and some of them will be available to the model, and these sets will not fully overlap—which is a concern in worlds where different abstractions lead to different generalization properties.
Indeed. I think the key thing for me is, I expect the model to be strongly incentivized to have a solid translation layer from its internal ontology to e.g. English language, due to being trained on lots of English language data. Due to Occam’s Razor, I expect the internal ontology to be biased towards that of an English-language speaker.
I’m imagining something like: early in training the model makes use of those lossy approximations because they are a cheap/accessible way to improve its predictive accuracy. Later in training, assuming it’s being trained on the sort of gigantic scale that would allow it to hold swaths of the physical universe in its head, it loses those desired lossy abstractions due to catastrophic forgetting. Is that an OK way to operationalize your concern?
I’m still not convinced that this problem is a priority. It seems like a problem which will be encountered very late if ever, and will lead to ‘random’ failures on predicting future/counterfactual data in a way that’s fairly obvious.