I really do appreciate this being written up, but to the extent that this is intended to be a rebuttal to the sorts of counting arguments that I like, I think you would have basically no chance of passing my ITT here. From my perspective reading this post, it read to me like “I didn’t understand the counting argument, therefore it doesn’t make sense” which is (obviously) not very compelling to me. That being said, to give credit where credit is due, I think some people would make a more simplistic counting argument like the one you’re rebutting. So I’m not saying that you’re not rebutting anyone here, but you’re definitely not rebutting my position.
Edit: Another point of clarification here—my objection is not that there is a “finite bitstring case” and an “infinite bitstring case” and you should be using the “infinite bitstring case”. My objection is that the sort of finite bitstring analysis in this post does not yield any well-defined mathematical object at all, and certainly not one that would predict generalization.
Let’s work through how to properly reason about counting arguments:
When doing reasoning about simplicity priors, a really important thing to keep in mind is the relationship between infinite bitstring simplicity and finite bitstring simplicity. When you just start counting the ways in which the model can behave on unseen inputs and then saying that the more ways there are the more likely it is, what you’re implicitly computing there is actually an inverse simplicity prior: Consider two programs, one that takes n bits and then stops, and one that takes 2n bits to specify the necessary logic but then uses m remaining bits to fill in additional pieces for how it might behave on unseen inputs. Obviously the n bit program is simpler, but by your logic the 2n bit program would seem to be simpler because it leaves more things unspecified in terms of all the ways to fill in the remaining m bits. But if you understand that we can recast everything into infinite bitstring complexity, then it’s clear that actually the n bit program is leaving n+m bits unspecified—even though those bits don’t do anything in that case, they’re still unspecified parts of the overall infinite bitstring.
Once we understand that relationship, it should become pretty clear why the overfitting argument doesn’t work: the overfit model is essentially the 2n model, where it takes more bits to specify the core logic, and then tries to “win” on the simplicity by having m unspecified bits of extra information. But that doesn’t really matter: what matters is the size of the core logic, and if there are simple patterns that can fit the data in n bits rather than 2n bits, you’ll learn those.
However, this doesn’t apply at all to the counting argument for deception. In fact, understanding this distinction properly is critically important to make the argument work. Let’s break it down:
Suppose my model has the following structure: an n bit world model, an m bit search procedure, and an x bit objective function. This isn’t a very realistic assumption, but it’ll be enough to make it clear why the counting argument here doesn’t make use of the same fallacious reasoning that would lead you to think that the 2n bit model was simpler.
We’ll assume that the deceptive and non-deceptive models require the same n+m bits of world modeling and search, so the only question is the objective function.
I would usually then make an argument here for why in most cases the simplest objective that leads to deception is simpler than the simplest objective that leads to alignment, but that’s just a simplicity argument, not a counting argument. Since we want to do the counting argument here, let’s assume that the simplest objective that leads to alignment is simpler than the simplest objective that leads to deception.
Okay, but now if the simplest objective that leads to alignment is simpler than the simplest objective that leads to deception, how could deception win? Well, the key is that the core logic necessary for deception is simpler: the only thing required for deception is a long-term objective, everything else is unspecified. So, mathematically, we have:
Complexity of simplest aligned objective: a
Complexity of simplest deceptive objective: l+b where l is the minimum necessary for any long-term objective and b is everything else necessary to implement some particular long-term objective.
We’re assuming that a<l+b, but that l<a.
Casting into infinite bitstring land, we see that the set of aligned objectives includes those with anything after the first a bits, whereas the set of deceptive objectives includes anything after the first l bits. Even though you don’t get a full program until you’re l+b bits deep, the complexity here is just l, because all the bits after the first l bits aren’t pinned down. So if we’re assuming that l<a, then deception wins.
Certainly, you could contest the assumption that l<a—and conversely I would even go further and say probably a>l+b—but either way the point is that this argument is totally sound given its assumptions.
At a high level, what I’m saying here is that counting arguments are totally valid and in fact strongly predict that you won’t learn to memorize, but only when you do them over infinite bitstrings, not when done over finite bitstrings. If we think about the simplicity of learning a line to fit a set of linear datapoints vs. the simplicity of memorizing everything, there are more ways to implement a line than there are to memorize, but only over infinite bitstrings. In the line case, the extra bits don’t do anything, whereas in the memorization case, they do, but that’s not a relevant distinction: they’re still unspecified bits, and what we’re doing is counting up the measure of the infinite bitstrings which implement that algorithm.
I think this analysis should also make clear what’s going on with the indifference principle here. The “indifference principle” in this context is about being indifferent across all infinite bitstrings—it’s not some arbitrary thing where you can carve up the space however you like and then say you’re indifferent across the different pieces—it’s a very precise notion that comes from theoretical computer science (though there is a question about what UTM to use; there you’re trying to get as close as possible to a program prior that would generalize well in practice given that we know ML generalizes well). The idea is that indifference across infinite bitstrings gives you a universal semi-measure, from which you can derive a universal prior (which you’re trying to select out of the space of all universal priors to match ML well). Of course, it’s certainly not the case that actual machine learning inductive biases are purely simplicity, or that they’re even purely indifferent across all concrete parameterizations, but I think it’s a reasonable first-pass assumption given facts like ML not generally overfitting as you note.
Looking at this more broadly, from my perspective, the fact that we don’t see overfitting is the entire reason why deceptive alignment is likely. The fact that models tend to learn simple patterns that fit the data rather than memorize a bunch of stuff is exactly why deception, a simple strategy that compresses a lot of data, might be a very likely thing for them to learn. If models were more likely to learn overfitting-style solutions, I would be much, much less concerned about deception—but of course, that would also mean they were less capable, so it’s not much solace.
“indifference over infinite bitstrings” is a misnomer in an important sense, because it’s literally impossible to construct a normalized probability measure over infinite bitstrings that assigns equal probability to each one. What you’re talking about is the length weighted measure that assigns exponentially more probability mass to shorter programs. That’s definitely not an indifference principle, it’s baking in substantive assumptions about what’s more likely.
I don’t see why we should expect any of this reasoning about Turing machines to transfer over to neural networks at all, which is why I didn’t cast the counting argument in terms of Turing machines in the post. In the past I’ve seen you try to run counting or simplicity arguments in terms of parameters. I don’t think any of that works, but I at least take it more seriously than the Turing machine stuff.
If we’re really going to assume the Solomonoff prior here, then I may just agree with you that it’s malign in Christiano’s sense and could lead to scheming, but I take this to be a reductio of the idea that we can use Solomonoff as any kind of model for real world machine learning. Deep learning does not approximate Solomonoff in any meaningful sense.
Terminological point: it seems like you are using the term “simple” as if it has a unique and objective referent, namely Kolmogorov-simplicity. That’s definitely not how I use the term; for me it’s always relative to some subjective prior. Just wanted to make sure this doesn’t cause confusion.
“indifference over infinite bitstrings” is a misnomer in an important sense, because it’s literally impossible to construct a normalized probability measure over infinite bitstrings that assigns equal probability to each one. What you’re talking about is the length weighted measure that assigns exponentially more probability mass to shorter programs. That’s definitely not an indifference principle, it’s baking in substantive assumptions about what’s more likely.
No; this reflects a misunderstanding of how the universal prior is traditionally derived in information theory. We start by assuming that we are running our UTM over code such that every time the UTM looks at a new bit in the tape, it has equal probability of being a 1 or a 0 (that’s the indifference condition). That induces what’s called the universal semi-measure, from which we can derive the universal prior by enforcing a halting condition. The exponential nature of the prior simply falls out of that derivation.
I don’t see why we should expect any of this reasoning about Turning machines to transfer over to neural networks at all, which is why I didn’t cast the counting argument in terms of Turing machines in the post. In the past I’ve seen you try to run counting or simplicity arguments in terms of parameters. I don’t think any of that works, but I at least take it more seriously than the Turing machine stuff.
Some notes:
I am very skeptical of hand-wavy arguments about simplicity that don’t have formal mathematical backing. This is a very difficult area to reason about correctly and it’s easy to go off the rails if you’re trying to do so without relying on any formalism.
There are many, many ways to adjust the formalism to take into account various ways in which realistic neural network inductive biases are different than basic simplicity biases. My sense is that most of these changes generally don’t change the bottom-line conclusion, but if you have a concrete mathematical model that you’d like to present here that you think gives a different result, I’m all ears.
All of that being said, I’m absolutely with you that this whole space of trying to apply theoretical reasoning about inductive biases to concrete ML systems is quite fraught. But it’s even more fraught if you drop the math!
So I’m happy with turning to empirics instead, which is what I have actually done! I think our Sleeper Agents results, for example, empirically disprove the hypothesis that deceptive reasoning will be naturally regularized away (interestingly, we find that it does get regularized away for small models—but not for large models!).
I’m well aware of how it’s derived. I still don’t think it makes sense to call that an indifference prior, precisely because enforcing an uncomputable halting requirement induces an exponentially strong bias toward short programs. But this could become a terminological point.
I think relying on an obviously incorrect formalism is much worse than relying on no formalism at all. I also don’t think I’m relying on zero formalism. The literature on the frequency/spectral bias is quite rigorous, and is grounded in actual facts about how neural network architectures work.
I am very skeptical of hand-wavy arguments about simplicity that don’t have formal mathematical backing. This is a very difficult area to reason about correctly and it’s easy to go off the rails if you’re trying to do so without relying on any formalism.
I’m surprised by this. It seems to me like most of your reasoning about simplicity is either hand-wavy or only nominally formally backed by symbols which don’t (AFAICT) have much to do with the reality of neural networks. EG, your comments above:
I would usually then make an argument here for why in most cases the simplest objective that leads to deception is simpler than the simplest objective that leads to alignment, but that’s just a simplicity argument, not a counting argument. Since we want to do the counting argument here, let’s assume that the simplest objective that leads to alignment is simpler than the simplest objective that leads to deception.
Or the times you’ve talked about how there are “more” sycophants but only “one” saint.
There are many, many ways to adjust the formalism to take into account various ways in which realistic neural network inductive biases are different than basic simplicity biases. My sense is that most of these changes generally don’t change the bottom-line conclusion, but if you have a concrete mathematical model that you’d like to present here that you think gives a different result, I’m all ears.
This is a very strange burden of proof. It seems to me that you presented a specific model of how NNs work which is clearly incorrect, and instead of processing counterarguments that it doesn’t make sense, you want someone else to propose to you a similarly detailed model which you think is better. Presenting an alternative is a logically separate task from pointing out the problems in the model you gave.
I’m surprised by this. It seems to me like most of your reasoning about simplicity is either hand-wavy or only nominally formally backed by symbols which don’t (AFAICT) have much to do with the reality of neural networks.
The examples that you cite are from a LessWrong comment and a transcript of a talk that I gave. Of course when I’m presenting something in a context like that I’m not going to give the most formal version of it; that doesn’t mean that the informal hand-wavy arguments are the reasons why I believe what I believe.
Maybe a better objection there would be: then why haven’t you written up anything more careful and more formal? Which is a pretty fair objection, as I note here. But alas I only have so much time and it’s not my current focus.
Yes, but your original comment was presented as explaining “how to properly reason about counting arguments.” Do you no longer claim that to be the case? If you do still claim that, then I maintain my objection that you yourself used hand-wavy reasoning in that comment, and it seems incorrect to present that reasoning as unusually formally supported.
Another concern I have is, I don’t think you’re gaining anything by formality in this thread. As I understand your argument, I think your symbols are formalizations of hand-wavy intuitions (like the ability to “decompose” a network into the given pieces; the assumption that description length is meaningfully relevant to the NN prior; assumptions about informal notions of “simplicity” being realized in a given UTM prior). If anything, I think that the formality makes things worse because it makes it harder to evaluate or critique your claims.
I also don’t think I’ve seen an example of reasoning about deceptive alignment where I concluded that formality had helped the case, as opposed to obfuscated the case or lent the concern unearned credibility.
The main thing I was trying to show there is just that having the formalism prevents you from making logical mistakes in how to apply counting arguments in general, as I think was done in this post. So my comment is explaining how to use the formalism to avoid mistakes like that, not trying to work through the full argument for deceptive alignment.
It’s not that the formalism provides really strong evidence for deceptive alignment, it’s that it prevents you from making mistakes in your reasoning. It’s like plugging your argument into a proof-checker: it doesn’t check that your argument is correct, since the assumptions could be wrong, but it does check that your argument is sound.
Do you believe that the cited hand-wavy arguments are, at a high informal level, sound reason for belief in deceptive alignment? (It sounds like you don’t, going off of your original comment which seems to distance yourself from the counting arguments critiqued by the post.)
EDITed to remove last bit after reading elsewhere in thread.
I think you should allocate time to devising clearer arguments, then. I am worried that lots of people are misinterpreting your arguments and then making significant life choices on the basis of their new beliefs about deceptive alignment, and I think we’d both prefer for that to not happen.
Were I not busy with all sorts of empirical stuff right now, I would consider prioritizing a project like that, but alas I expect to be too busy. I think it would be great if somebody else wanted devote more time to working through the arguments in detail publicly, and I might encourage some of my mentees to do so.
empirically disprove the hypothesis that deceptive reasoning will be naturally regularized away (interestingly, we find that it does get regularized away for small models—but not for large models!).
You did not “empirically disprove” that hypothesis. You showed that if you explicitly train a backdoor for a certain behavior under certain regimes, then training on other behaviors will not cause catastrophic forgetting. You did not address the regime where the deceptive reasoning arises as instrumental to some other goal embedded in the network, or in a natural context (as you’re aware). I think that you did find a tiny degree of evidence about the question (it really is tiny IMO), but you did not find “disproof.”
Indeed, I predicted that people would incorrectly represent these results; so little time has passed!
I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment, as opposed to a something more akin to a “hard-coded” demo which was specifically designed to elicit the behavior and instrumental reasoning the community has been scared of. I think that people will predictably
...
[claim] that we’ve observed it’s hard to uproot deceptive alignment (even though “uprooting a backdoored behavior” and “pushing back against misgeneralization” are different things),
I’m quite aware that we did not see natural deceptive alignment, so I don’t think I’m misinterpreting my own results in the way you were predicting. Perhaps “empirically disprove” is too strong; I agree that our results are evidence but not definitive evidence. But I think they’re quite strong evidence and by far the strongest evidence available currently on the question of whether deception will be regularized away.
You didn’t claim it for deceptive alignment, but you claimed disproof of the idea that deceptive reasoning would be trained away, which is an important subcomponent of deceptive alignment. But your work provides no strong conclusions on that matter as it pertains to deceptive reasoning in general.
I think the presentation of your work (which, again, I like in many respects) would be strengthened if you clarified the comment which I responded to.
But I think they’re quite strong evidence and by far the strongest evidence available currently on the question of whether deception will be regularized away.
Because the current results only deal with backdoor removal, I personally think it’s outweighed by e.g. results on how well instruction-tuning generalizes.
But your work provides no strong conclusions on that matter as it pertains to deceptive reasoning in general.
I just disagree with this. Our chain of thought models do tons of very deceptive reasoning during safety training and the deceptiveness of that reasoning is totally unaffected by safety training, and in fact the deceptiveness increases in the case of adversarial training.
I said “Deceptive reasoning in general”, not the trainability of the backdoor behavior in your experimental setup. The issue isn’t just “what was the trainability of the surface behavior”, but “what is the trainability of the cognition implementing this behavior in-the-wild.” That is, the local inductive biases are probably far different for “parameterization implementing directly-trained deceptive reasoning” vs “parameterization outputting deceptive reasoning as an instrumental backchain from consequentialist reasoning.”
Imagine if I were arguing for some hypothetical results of mine, saying “The aligned models kept using aligned reasoning in the backdoor context, even as we trained them to be mean in other situations. That means we disproved the idea that aligned reasoning can be trained away with existing techniques, especially for larger models.” Would that be a valid argument given the supposed experimental result?
I’m referring to the deceptiveness of the reasoning displayed in the chain of thought during training time. So it’s not a generalization question, it’s about whether, if the model is using deceptive reasoning to compute its answer (as we think it is, since we think our models really are using their chain of thought), does that deceptive reasoning go away when the model has to use it to produce aligned answers during training? And we find that not only does it not go away, it actually gets more deceptive when you train it to produce aligned answers.
Here’s another fun way to think about this—you can basically cast what’s wrong here as an information theory exercise.
Problem:
Spot the step where the following argument goes wrong:
Suppose I have a dataset of finitely many points arranged in a line. Now, suppose I fit a (reasonable) universal prior to that dataset, and compare two cases: learning a line and learning to memorize each individual datapoint.
In the linear case, there is only one way to implement a line.
In the memorization case, I can implement whatever I want on the other datapoints in an arbitrary way.
Thus, since there are more ways to memorize than to learn a line, there should be greater total measure on memorization than on learning the line.
Therefore, you’ll learn to memorize each individual datapoint rather than learning to implement a line.
Solution:
By the logic of the post, step 4 is the problem, but I think step 4 is actually valid. The problem is step 2: there are actually a huge number of different ways to implement a line! Not only are there many different programs that implement the line in different ways, I can also just take the simplest program that does so and keep on adding comments or other extraneous bits. It’s totally valid to say that the algorithm with the most measure across all ways of implementing it is more likely, but you have to actually include all ways of implementing it, including all the cases where many of those bits are garbage and aren’t actually doing anything.
By the logic of the post, step 4 is the problem, but I think step 4 is actually valid. The problem is step 2: there are actually a huge number of different ways to implement a line! Not only are there many different programs that implement the line in different ways, I can also just take the simplest program that does so and keep on adding comments or other extraneous bits.
Evan, I wonder how much your disagreement is engaging with OPs’ reasons. A draft of this post motivated the misprediction of both counting arguments as trying to count functions instead of parameterizations of functions; one has to consider the compressivity of the parameter-function map (many different internal parameterizations map to the same external behavior). Given that the authors actually agree that 2 is incorrect, does this change your views?
I would be much happier with that; I think that’s much more correct. Then, my objection would just be that at least the sort of counting arguments for deceptive alignment that I like are and always have been about parameterizations rather than functions. I agree that if you try to run a counting argument directly in function space it won’t work.
deceptive alignment that I like are and always have been about parameterizations rather than functions.
How can this be true, when you e.g. say there’s “only one saint”? That doesn’t make any sense with parameterizations due to internal invariances; there are uncountably many “saints” in parameter-space (insofar as I accept that frame, which I don’t really but that’s not the point here). I’d expect you to raise that as an obvious point in worlds where this really was about parameterizations.
And, as you’ve elsewhere noted, we don’t know enough about parameterizations to make counting arguments over them. So how are you doing that?
How can this be true, when you e.g. say there’s “only one saint”? That doesn’t make any sense with parameterizations due to internal invariances; there are uncountably many saints.
Because it was the transcript of a talk? I was trying to explain an argument at a very high level. And there’s certainly not uncountably many; in the infinite bitstring case there would be countably many, though usually I prefer priors that put caps on total computation such that there are only finitely many.
I’d expect you to raise that as an obvious point in worlds where this really was about parameterizations.
I don’t really appreciate the psychoanalysis here. I told you what I thought and think, and I have far more evidence about that than you do.
And, as you’ve elsewhere noted, we don’t know enough about parameterizations to make counting arguments over them. So how are you doing that?
As I’ve said, I usually try to take whatever the most realistic prior is that we can reason about at a high-level, e.g. a circuit prior or a speed prior.
From my perspective reading this post, it read to me like “I didn’t understand the counting argument, therefore it doesn’t make sense” which is (obviously) not very compelling to me.
I definitely appreciate how it can feel frustrating or bad when you feel that someone isn’t properly engaging with your ideas. However, I also feel frustrated by this statement. Your comment seems to have a tone of indignation that Quintin and Nora weren’t paying attention to what you wrote.
I myself expected you to respond to this post with some ML-specific reasoning about simplicity and measure of parameterizations, instead of your speculation about a relationship between the universal measure and inductive biases. I spoke with dozens of people about the ideas in OP’s post, and none of them mentioned arguments like the one you gave. I myself have spent years in the space and am also not familiar with this particular argument about bitstrings.
(EDIT: Having read Ryan’s comment, it now seems to me that you have exclusively made a simplicity argument without any counting involved, and an empirical claim about the relationship between description length of a mesa objective and the probability of SGD sampling a function which implements such an objective. Is this correct?)
If these are your real reasons for expecting deceptive alignment, that’s fine, but I think you’ve mentioned this rather infrequently. Your profile links to How likely is deceptive alignment?, which is an (introductory) presentation you gave. In that presentation, you make no mention of Turing machines, universal semimeasures, bitstrings, and so on. On a quick search, the closest you seem to come is the following:
We’re going to start with simplicity. Simplicity is about specifying the thing that you want in the space of all possible things. You can think about simplicity as “How much do you have to aim to hit the exact thing in the space of all possible models?” How many bits does it take to find the thing that you want in the model space? And so, as a first pass, we can understand simplicity by doing a counting argument, which is just asking, how many models are in each model class?[1]
But this is ambiguous (as can be expected for a presentation at this level). We could view this as “bitlength under a given decoding scheme, viewing an equivalence class over parameterizations as a set of possible messages” or “Shannon information (in bits) of a function induced by a given probability distribution over parameterizations” or something else entirely (perhaps having to do with infinite bitstrings).
My critique is not “this was ambiguous.” My critique is “how was anyone supposed to be aware of the ‘real’ argument which I (and many others) seem to now be encountering for the first time?”.
My objection is that the sort of finite bitstring analysis in this post does not yield any well-defined mathematical object at all, and certainly not one that would predict generalization.
This seems false? All that needs be done is to formally define
F:={f:Rn→Rm∣f(x)=label(x)∀x∈Xtrain},
which is the set of functions which (when e.g. greedily sampled) perfectly label the (categorical) training data Xtrain, and we can parameterize such functions using the neural network parameter space. This yields a perfectly well-defined counting argument over F.
I myself expected you to respond to this post with some ML-specific reasoning about simplicity and measure of parameterizations, instead of your speculation about a relationship between the universal measure and inductive biases. I spoke with dozens of people about the ideas in OP’s post, and none of them mentioned arguments like the one you gave. I myself have spent years in the space and am also not familiar with this particular argument about bitstrings.
That probably would have been my objection had the reasoning about priors in this post been sound, but since the reasoning was unsound, I turned to the formalism to try to show why it’s unsound.
If these are your real reasons for expecting deceptive alignment, that’s fine, but I think you’ve mentioned this rather infrequently.
I think you’re misunderstanding the nature of my objection. It’s not that Solomonoff induction is my real reason for believing in deceptive alignment or something, it’s that the reasoning in this post is mathematically unsound, and I’m using the formalism to show why. If I weren’t responding to this post specifically, I probably wouldn’t have brought up Solomonoff induction at all.
This yields a perfectly well-defined counting argument over F.
we can parameterize such functions using the neural network parameter space
I’m very happy with running counting arguments over the actual neural network parameter space; the problem there is just that I don’t think we understand it well enough to do so effectively.
You could instead try to put a measure directly over the functions in your setup, but the problem there is that function space really isn’t the right space to run a counting argument like this; you need to be in algorithm space, otherwise you’ll do things like what happens in this post where you end up predicting overfitting rather than generalization (which implies that you’re using a prior that’s not suitable for running counting arguments on).
I’m very happy with running counting arguments over the actual neural network parameter space; the problem there is just that I don’t think we understand it well enough to do so effectively.
This is basically my position as well
The cited argument is a counting argument over the space of functions which achieve zero/low training loss.
You could instead try to put a measure directly over the functions in your setup, but the problem there is that function space really isn’t the right space to run a counting argument like this; you need to be in algorithm space, otherwise you’ll do things like what happens in this post where you end up predicting overfitting rather than generalization (which implies that you’re using a prior that’s not suitable for running counting arguments on).
Indeed, this is a crucial point that I think the post is trying to make. The cited counting arguments are counting functions instead of parameterizations. That’s the mistake (or, at least “a” mistake). I’m glad we agree it’s a mistake, but then I’m confused why you think that part of the post is unsound.
(Rereads)
Rereading the portion in question now, it seems that they changed it a lot since the draft. Personally, I think their argumentation is now weaker than it was before. The original argumentation clearly explained the mistake of counting functions instead of parameterizations, while the present post does not. It instead abstracts it as “an indifference principle”, where the reader has to do the work to realize that indifference over functions is inappropriate.
I’m sorry to hear that you think the argumentation is weaker now.
the reader has to do the work to realize that indifference over functions is inappropriate
I don’t think that indifference over functions in particular is inappropriate. I think indifference reasoning in general is inappropriate.
I’m very happy with running counting arguments over the actual neural network parameter space
I wouldn’t call the correct version of this a counting argument. The correct version uses the actual distribution used to initialize the parameters as a measure, and not e.g. the Lebesgue measure. This isn’t appealing to the indifference principle at all, and so in my book it’s not a counting argument. But this could be terminological.
I found the explanation at the point where you introduce b confusing.
Here’s a revised version of the text there that would have been less confusing to me (assuming I haven’t made any errors):
Complexity of simplest deceptive objective: l+b where l is the number of bits needed to select the part of the objective space which is just long term objectives and b is the additional number of bits required to select the most simple long run objective. In other words b is the minimum number of bits required to pick out a particular objective among all of the deceptive objects (aka the simplest one).
We’re assuming that a<l+b, but that l<a. That is, the measure of any long run objective is higher than the measure on the (simplest) aligned objective.
Casting into infinite bitstring land, we see that the set of aligned objectives includes those with anything after the first a bits, whereas the set of deceptive objectives includes anything after the first l bits (as all of these are long run objectives, though the differ). Even though you don’t get a full program until you’re l+b bits deep, the complexity here is just l, because all the bits after the first l bits aren’t pinned down. So if we’re assuming that l<a, then deception wins.
In this argument, you’ve implicitly assumed that there is only one function/structure which suffices for being getting high enough training performance to be selected while also not being a long term objective (aka a deceptive objective).
I could imagine this being basically right, but it certainly seems non-obvious to me.
E.g., there might be many things which are extremely highly correlated with reward that are represented in the world model. Or more generally, there are in principle many objective computations that result in trying as hard to get reward as the deceptive model would try.
(The potential for “multiple” objectives only makes a constant factor difference, but this is exactly the same as the case for deceptive objectives.)
The fact that these objectives generalize differently maybe implies they aren’t “aligned”, but in that case there is another key category of objectives: non-exactly-aligned and non-deceptive objectives. And obviously our AI isn’t going to be literally exactly aligned.
Note that non-exactly-aligned and non-deceptive objectives could suffice for safety in practice even if not perfectly aligned (e.g. due to myopia).
Yep, that’s exactly right. As always, once you start making more complex assumptions, things get more and more complicated, and it starts to get harder to model things in nice concrete mathematical terms. I would defend the value of having actual concrete mathematical models here—I think it’s super easy to confuse yourself in this domain if you aren’t doing that (e.g. as I think the confused reasoning about counting arguments in this post demonstrates). So I like having really concrete models, but only in the “all models are wrong, but some are useful” sense, as I talk about in “In defense of probably wrong mechanistic models.”
Also, the main point I was trying to make is that the counting argument is both sound and consistent with known generalization properties of machine learning (and in fact predicts them), and for that purpose I went with the simplest possible formalization of the counting argument.
Once we understand that relationship, it should become pretty clear why the overfitting argument doesn’t work: the overfit model is essentially the 2n model, where it takes more bits to specify the core logic, and then tries to “win” on the simplicity by having m unspecified bits of extra information. But that doesn’t really matter: what matters is the size of the core logic, and if there are simple patterns that can fit the data in n bits rather than 2n bits, you’ll learn those.
Under this picture, or any other simplicity bias, why NNs with more parameters generalize better?
The idea is that when you make your network larger, you increase the size of the search space and thus the number of algorithms that you’re considering to include algorithms which take more computation. That reduces the relative importance of the speed prior, but increases the relative importance of the simplicity prior, because your inductive biases are still selecting from among those algorithms according to the simplest pattern that fits the data, such that you get good generalization—and in fact even better generalization because now the space of algorithms in which you’re searching for the simplest one in is even larger.
Another way to think about this: if you really believe Occam’s razor, then any learning algorithm generalizes exactly to the extent that it approximates a simplicity prior—thus, since we know neural networks generalize better as they get larger, they must be approximating a simplicity prior better as they do so.
What in your view is the fundamental difference between world models and goals such that the former generalise well and the latter generalise poorly?
One can easily construct a model with a free parameter X and training data such that many choices of X will match the training data but results will diverge in situations not represented in the training data (for example, the model is a physical simulation and X tracks the state of some region in the simulation that will affect the learner’s environment later, but hasn’t done so during training). The simplest x_s could easily be wrong. We can even moralise the story: the model regards its job as predicting the output under x_s and if the world happens to operate according to some other x’ then the model doesn’t care. However it’s still going to be ineffective in the future where the value of X matters.
I really do appreciate this being written up, but to the extent that this is intended to be a rebuttal to the sorts of counting arguments that I like, I think you would have basically no chance of passing my ITT here. From my perspective reading this post, it read to me like “I didn’t understand the counting argument, therefore it doesn’t make sense” which is (obviously) not very compelling to me. That being said, to give credit where credit is due, I think some people would make a more simplistic counting argument like the one you’re rebutting. So I’m not saying that you’re not rebutting anyone here, but you’re definitely not rebutting my position.
Edit: If you’re struggling to grasp the distinction I’m pointing to here, it might be worth trying this exercise pointing out where the argument in the post goes wrong in a very simple case and/or looking at Ryan’s restatement of my mathematical argument.
Edit: Another point of clarification here—my objection is not that there is a “finite bitstring case” and an “infinite bitstring case” and you should be using the “infinite bitstring case”. My objection is that the sort of finite bitstring analysis in this post does not yield any well-defined mathematical object at all, and certainly not one that would predict generalization.
Let’s work through how to properly reason about counting arguments:
When doing reasoning about simplicity priors, a really important thing to keep in mind is the relationship between infinite bitstring simplicity and finite bitstring simplicity. When you just start counting the ways in which the model can behave on unseen inputs and then saying that the more ways there are the more likely it is, what you’re implicitly computing there is actually an inverse simplicity prior: Consider two programs, one that takes n bits and then stops, and one that takes 2n bits to specify the necessary logic but then uses m remaining bits to fill in additional pieces for how it might behave on unseen inputs. Obviously the n bit program is simpler, but by your logic the 2n bit program would seem to be simpler because it leaves more things unspecified in terms of all the ways to fill in the remaining m bits. But if you understand that we can recast everything into infinite bitstring complexity, then it’s clear that actually the n bit program is leaving n+m bits unspecified—even though those bits don’t do anything in that case, they’re still unspecified parts of the overall infinite bitstring.
Once we understand that relationship, it should become pretty clear why the overfitting argument doesn’t work: the overfit model is essentially the 2n model, where it takes more bits to specify the core logic, and then tries to “win” on the simplicity by having m unspecified bits of extra information. But that doesn’t really matter: what matters is the size of the core logic, and if there are simple patterns that can fit the data in n bits rather than 2n bits, you’ll learn those.
However, this doesn’t apply at all to the counting argument for deception. In fact, understanding this distinction properly is critically important to make the argument work. Let’s break it down:
Suppose my model has the following structure: an n bit world model, an m bit search procedure, and an x bit objective function. This isn’t a very realistic assumption, but it’ll be enough to make it clear why the counting argument here doesn’t make use of the same fallacious reasoning that would lead you to think that the 2n bit model was simpler.
We’ll assume that the deceptive and non-deceptive models require the same n+m bits of world modeling and search, so the only question is the objective function.
I would usually then make an argument here for why in most cases the simplest objective that leads to deception is simpler than the simplest objective that leads to alignment, but that’s just a simplicity argument, not a counting argument. Since we want to do the counting argument here, let’s assume that the simplest objective that leads to alignment is simpler than the simplest objective that leads to deception.
Okay, but now if the simplest objective that leads to alignment is simpler than the simplest objective that leads to deception, how could deception win? Well, the key is that the core logic necessary for deception is simpler: the only thing required for deception is a long-term objective, everything else is unspecified. So, mathematically, we have:
Complexity of simplest aligned objective: a
Complexity of simplest deceptive objective: l+b where l is the minimum necessary for any long-term objective and b is everything else necessary to implement some particular long-term objective.
We’re assuming that a<l+b, but that l<a.
Casting into infinite bitstring land, we see that the set of aligned objectives includes those with anything after the first a bits, whereas the set of deceptive objectives includes anything after the first l bits. Even though you don’t get a full program until you’re l+b bits deep, the complexity here is just l, because all the bits after the first l bits aren’t pinned down. So if we’re assuming that l<a, then deception wins.
Certainly, you could contest the assumption that l<a—and conversely I would even go further and say probably a>l+b—but either way the point is that this argument is totally sound given its assumptions.
At a high level, what I’m saying here is that counting arguments are totally valid and in fact strongly predict that you won’t learn to memorize, but only when you do them over infinite bitstrings, not when done over finite bitstrings. If we think about the simplicity of learning a line to fit a set of linear datapoints vs. the simplicity of memorizing everything, there are more ways to implement a line than there are to memorize, but only over infinite bitstrings. In the line case, the extra bits don’t do anything, whereas in the memorization case, they do, but that’s not a relevant distinction: they’re still unspecified bits, and what we’re doing is counting up the measure of the infinite bitstrings which implement that algorithm.
I think this analysis should also make clear what’s going on with the indifference principle here. The “indifference principle” in this context is about being indifferent across all infinite bitstrings—it’s not some arbitrary thing where you can carve up the space however you like and then say you’re indifferent across the different pieces—it’s a very precise notion that comes from theoretical computer science (though there is a question about what UTM to use; there you’re trying to get as close as possible to a program prior that would generalize well in practice given that we know ML generalizes well). The idea is that indifference across infinite bitstrings gives you a universal semi-measure, from which you can derive a universal prior (which you’re trying to select out of the space of all universal priors to match ML well). Of course, it’s certainly not the case that actual machine learning inductive biases are purely simplicity, or that they’re even purely indifferent across all concrete parameterizations, but I think it’s a reasonable first-pass assumption given facts like ML not generally overfitting as you note.
Looking at this more broadly, from my perspective, the fact that we don’t see overfitting is the entire reason why deceptive alignment is likely. The fact that models tend to learn simple patterns that fit the data rather than memorize a bunch of stuff is exactly why deception, a simple strategy that compresses a lot of data, might be a very likely thing for them to learn. If models were more likely to learn overfitting-style solutions, I would be much, much less concerned about deception—but of course, that would also mean they were less capable, so it’s not much solace.
Thanks for the reply. A couple remarks:
“indifference over infinite bitstrings” is a misnomer in an important sense, because it’s literally impossible to construct a normalized probability measure over infinite bitstrings that assigns equal probability to each one. What you’re talking about is the length weighted measure that assigns exponentially more probability mass to shorter programs. That’s definitely not an indifference principle, it’s baking in substantive assumptions about what’s more likely.
I don’t see why we should expect any of this reasoning about Turing machines to transfer over to neural networks at all, which is why I didn’t cast the counting argument in terms of Turing machines in the post. In the past I’ve seen you try to run counting or simplicity arguments in terms of parameters. I don’t think any of that works, but I at least take it more seriously than the Turing machine stuff.
If we’re really going to assume the Solomonoff prior here, then I may just agree with you that it’s malign in Christiano’s sense and could lead to scheming, but I take this to be a reductio of the idea that we can use Solomonoff as any kind of model for real world machine learning. Deep learning does not approximate Solomonoff in any meaningful sense.
Terminological point: it seems like you are using the term “simple” as if it has a unique and objective referent, namely Kolmogorov-simplicity. That’s definitely not how I use the term; for me it’s always relative to some subjective prior. Just wanted to make sure this doesn’t cause confusion.
No; this reflects a misunderstanding of how the universal prior is traditionally derived in information theory. We start by assuming that we are running our UTM over code such that every time the UTM looks at a new bit in the tape, it has equal probability of being a 1 or a 0 (that’s the indifference condition). That induces what’s called the universal semi-measure, from which we can derive the universal prior by enforcing a halting condition. The exponential nature of the prior simply falls out of that derivation.
Some notes:
I am very skeptical of hand-wavy arguments about simplicity that don’t have formal mathematical backing. This is a very difficult area to reason about correctly and it’s easy to go off the rails if you’re trying to do so without relying on any formalism.
There are many, many ways to adjust the formalism to take into account various ways in which realistic neural network inductive biases are different than basic simplicity biases. My sense is that most of these changes generally don’t change the bottom-line conclusion, but if you have a concrete mathematical model that you’d like to present here that you think gives a different result, I’m all ears.
All of that being said, I’m absolutely with you that this whole space of trying to apply theoretical reasoning about inductive biases to concrete ML systems is quite fraught. But it’s even more fraught if you drop the math!
So I’m happy with turning to empirics instead, which is what I have actually done! I think our Sleeper Agents results, for example, empirically disprove the hypothesis that deceptive reasoning will be naturally regularized away (interestingly, we find that it does get regularized away for small models—but not for large models!).
I’m well aware of how it’s derived. I still don’t think it makes sense to call that an indifference prior, precisely because enforcing an uncomputable halting requirement induces an exponentially strong bias toward short programs. But this could become a terminological point.
I think relying on an obviously incorrect formalism is much worse than relying on no formalism at all. I also don’t think I’m relying on zero formalism. The literature on the frequency/spectral bias is quite rigorous, and is grounded in actual facts about how neural network architectures work.
I’m surprised by this. It seems to me like most of your reasoning about simplicity is either hand-wavy or only nominally formally backed by symbols which don’t (AFAICT) have much to do with the reality of neural networks. EG, your comments above:
Or the times you’ve talked about how there are “more” sycophants but only “one” saint.
This is a very strange burden of proof. It seems to me that you presented a specific model of how NNs work which is clearly incorrect, and instead of processing counterarguments that it doesn’t make sense, you want someone else to propose to you a similarly detailed model which you think is better. Presenting an alternative is a logically separate task from pointing out the problems in the model you gave.
The examples that you cite are from a LessWrong comment and a transcript of a talk that I gave. Of course when I’m presenting something in a context like that I’m not going to give the most formal version of it; that doesn’t mean that the informal hand-wavy arguments are the reasons why I believe what I believe.
Maybe a better objection there would be: then why haven’t you written up anything more careful and more formal? Which is a pretty fair objection, as I note here. But alas I only have so much time and it’s not my current focus.
Yes, but your original comment was presented as explaining “how to properly reason about counting arguments.” Do you no longer claim that to be the case? If you do still claim that, then I maintain my objection that you yourself used hand-wavy reasoning in that comment, and it seems incorrect to present that reasoning as unusually formally supported.
Another concern I have is, I don’t think you’re gaining anything by formality in this thread. As I understand your argument, I think your symbols are formalizations of hand-wavy intuitions (like the ability to “decompose” a network into the given pieces; the assumption that description length is meaningfully relevant to the NN prior; assumptions about informal notions of “simplicity” being realized in a given UTM prior). If anything, I think that the formality makes things worse because it makes it harder to evaluate or critique your claims.
I also don’t think I’ve seen an example of reasoning about deceptive alignment where I concluded that formality had helped the case, as opposed to obfuscated the case or lent the concern unearned credibility.
The main thing I was trying to show there is just that having the formalism prevents you from making logical mistakes in how to apply counting arguments in general, as I think was done in this post. So my comment is explaining how to use the formalism to avoid mistakes like that, not trying to work through the full argument for deceptive alignment.
It’s not that the formalism provides really strong evidence for deceptive alignment, it’s that it prevents you from making mistakes in your reasoning. It’s like plugging your argument into a proof-checker: it doesn’t check that your argument is correct, since the assumptions could be wrong, but it does check that your argument is sound.
Do you believe that the cited hand-wavy arguments are, at a high informal level, sound reason for belief in deceptive alignment? (It sounds like you don’t, going off of your original comment which seems to distance yourself from the counting arguments critiqued by the post.)
EDITed to remove last bit after reading elsewhere in thread.
I think they are valid if interpreted properly, but easy to misinterpret.
I think you should allocate time to devising clearer arguments, then. I am worried that lots of people are misinterpreting your arguments and then making significant life choices on the basis of their new beliefs about deceptive alignment, and I think we’d both prefer for that to not happen.
Were I not busy with all sorts of empirical stuff right now, I would consider prioritizing a project like that, but alas I expect to be too busy. I think it would be great if somebody else wanted devote more time to working through the arguments in detail publicly, and I might encourage some of my mentees to do so.
You did not “empirically disprove” that hypothesis. You showed that if you explicitly train a backdoor for a certain behavior under certain regimes, then training on other behaviors will not cause catastrophic forgetting. You did not address the regime where the deceptive reasoning arises as instrumental to some other goal embedded in the network, or in a natural context (as you’re aware). I think that you did find a tiny degree of evidence about the question (it really is tiny IMO), but you did not find “disproof.”
Indeed, I predicted that people would incorrectly represent these results; so little time has passed!
I’m quite aware that we did not see natural deceptive alignment, so I don’t think I’m misinterpreting my own results in the way you were predicting. Perhaps “empirically disprove” is too strong; I agree that our results are evidence but not definitive evidence. But I think they’re quite strong evidence and by far the strongest evidence available currently on the question of whether deception will be regularized away.
You didn’t claim it for deceptive alignment, but you claimed disproof of the idea that deceptive reasoning would be trained away, which is an important subcomponent of deceptive alignment. But your work provides no strong conclusions on that matter as it pertains to deceptive reasoning in general.
I think the presentation of your work (which, again, I like in many respects) would be strengthened if you clarified the comment which I responded to.
Because the current results only deal with backdoor removal, I personally think it’s outweighed by e.g. results on how well instruction-tuning generalizes.
I just disagree with this. Our chain of thought models do tons of very deceptive reasoning during safety training and the deceptiveness of that reasoning is totally unaffected by safety training, and in fact the deceptiveness increases in the case of adversarial training.
I said “Deceptive reasoning in general”, not the trainability of the backdoor behavior in your experimental setup. The issue isn’t just “what was the trainability of the surface behavior”, but “what is the trainability of the cognition implementing this behavior in-the-wild.” That is, the local inductive biases are probably far different for “parameterization implementing directly-trained deceptive reasoning” vs “parameterization outputting deceptive reasoning as an instrumental backchain from consequentialist reasoning.”
Imagine if I were arguing for some hypothetical results of mine, saying “The aligned models kept using aligned reasoning in the backdoor context, even as we trained them to be mean in other situations. That means we disproved the idea that aligned reasoning can be trained away with existing techniques, especially for larger models.” Would that be a valid argument given the supposed experimental result?
I’m referring to the deceptiveness of the reasoning displayed in the chain of thought during training time. So it’s not a generalization question, it’s about whether, if the model is using deceptive reasoning to compute its answer (as we think it is, since we think our models really are using their chain of thought), does that deceptive reasoning go away when the model has to use it to produce aligned answers during training? And we find that not only does it not go away, it actually gets more deceptive when you train it to produce aligned answers.
Here’s another fun way to think about this—you can basically cast what’s wrong here as an information theory exercise.
Problem:
Solution:
By the logic of the post, step 4 is the problem, but I think step 4 is actually valid. The problem is step 2: there are actually a huge number of different ways to implement a line! Not only are there many different programs that implement the line in different ways, I can also just take the simplest program that does so and keep on adding comments or other extraneous bits. It’s totally valid to say that the algorithm with the most measure across all ways of implementing it is more likely, but you have to actually include all ways of implementing it, including all the cases where many of those bits are garbage and aren’t actually doing anything.
Evan, I wonder how much your disagreement is engaging with OPs’ reasons. A draft of this post motivated the misprediction of both counting arguments as trying to count functions instead of parameterizations of functions; one has to consider the compressivity of the parameter-function map (many different internal parameterizations map to the same external behavior). Given that the authors actually agree that 2 is incorrect, does this change your views?
I would be much happier with that; I think that’s much more correct. Then, my objection would just be that at least the sort of counting arguments for deceptive alignment that I like are and always have been about parameterizations rather than functions. I agree that if you try to run a counting argument directly in function space it won’t work.
See also discussion here.
How can this be true, when you e.g. say there’s “only one saint”? That doesn’t make any sense with parameterizations due to internal invariances; there are uncountably many “saints” in parameter-space (insofar as I accept that frame, which I don’t really but that’s not the point here). I’d expect you to raise that as an obvious point in worlds where this really was about parameterizations.
And, as you’ve elsewhere noted, we don’t know enough about parameterizations to make counting arguments over them. So how are you doing that?
Because it was the transcript of a talk? I was trying to explain an argument at a very high level. And there’s certainly not uncountably many; in the infinite bitstring case there would be countably many, though usually I prefer priors that put caps on total computation such that there are only finitely many.
I don’t really appreciate the psychoanalysis here. I told you what I thought and think, and I have far more evidence about that than you do.
As I’ve said, I usually try to take whatever the most realistic prior is that we can reason about at a high-level, e.g. a circuit prior or a speed prior.
FWIW I object to 2, 3, and 4, and maybe also 1.
Nabgure senzr gung zvtug or hfrshy:
Gurer’f n qvssrerapr orgjrra gur ahzore bs zngurzngvpny shapgvbaf gung vzcyrzrag n frg bs erdhverzragf naq gur ahzore bs cebtenzf gung vzcyrzrag gur frg bs erdhverzragf.
Fvzcyvpvgl vf nobhg gur ynggre, abg gur sbezre.
Gur rkvfgrapr bs n ynetr ahzore bs cebtenzf gung cebqhpr gur rknpg fnzr zngurzngvpny shapgvba pbagevohgrf gbjneqf fvzcyvpvgl.
I definitely appreciate how it can feel frustrating or bad when you feel that someone isn’t properly engaging with your ideas. However, I also feel frustrated by this statement. Your comment seems to have a tone of indignation that Quintin and Nora weren’t paying attention to what you wrote.
I myself expected you to respond to this post with some ML-specific reasoning about simplicity and measure of parameterizations, instead of your speculation about a relationship between the universal measure and inductive biases. I spoke with dozens of people about the ideas in OP’s post, and none of them mentioned arguments like the one you gave. I myself have spent years in the space and am also not familiar with this particular argument about bitstrings.
(EDIT: Having read Ryan’s comment, it now seems to me that you have exclusively made a simplicity argument without any counting involved, and an empirical claim about the relationship between description length of a mesa objective and the probability of SGD sampling a function which implements such an objective. Is this correct?)
If these are your real reasons for expecting deceptive alignment, that’s fine, but I think you’ve mentioned this rather infrequently. Your profile links to How likely is deceptive alignment?, which is an (introductory) presentation you gave. In that presentation, you make no mention of Turing machines, universal semimeasures, bitstrings, and so on. On a quick search, the closest you seem to come is the following:
But this is ambiguous (as can be expected for a presentation at this level). We could view this as “bitlength under a given decoding scheme, viewing an equivalence class over parameterizations as a set of possible messages” or “Shannon information (in bits) of a function induced by a given probability distribution over parameterizations” or something else entirely (perhaps having to do with infinite bitstrings).
My critique is not “this was ambiguous.” My critique is “how was anyone supposed to be aware of the ‘real’ argument which I (and many others) seem to now be encountering for the first time?”.
This seems false? All that needs be done is to formally define
F:={f:Rn→Rm∣f(x)=label(x)∀x∈Xtrain},which is the set of functions which (when e.g. greedily sampled) perfectly label the (categorical) training data Xtrain, and we can parameterize such functions using the neural network parameter space. This yields a perfectly well-defined counting argument over F.
This seems to be exactly the counting argument the post is critiquing, by the way.
That probably would have been my objection had the reasoning about priors in this post been sound, but since the reasoning was unsound, I turned to the formalism to try to show why it’s unsound.
I think you’re misunderstanding the nature of my objection. It’s not that Solomonoff induction is my real reason for believing in deceptive alignment or something, it’s that the reasoning in this post is mathematically unsound, and I’m using the formalism to show why. If I weren’t responding to this post specifically, I probably wouldn’t have brought up Solomonoff induction at all.
I’m very happy with running counting arguments over the actual neural network parameter space; the problem there is just that I don’t think we understand it well enough to do so effectively.
You could instead try to put a measure directly over the functions in your setup, but the problem there is that function space really isn’t the right space to run a counting argument like this; you need to be in algorithm space, otherwise you’ll do things like what happens in this post where you end up predicting overfitting rather than generalization (which implies that you’re using a prior that’s not suitable for running counting arguments on).
This is basically my position as well
The cited argument is a counting argument over the space of functions which achieve zero/low training loss.
Indeed, this is a crucial point that I think the post is trying to make. The cited counting arguments are counting functions instead of parameterizations. That’s the mistake (or, at least “a” mistake). I’m glad we agree it’s a mistake, but then I’m confused why you think that part of the post is unsound.
(Rereads)
Rereading the portion in question now, it seems that they changed it a lot since the draft. Personally, I think their argumentation is now weaker than it was before. The original argumentation clearly explained the mistake of counting functions instead of parameterizations, while the present post does not. It instead abstracts it as “an indifference principle”, where the reader has to do the work to realize that indifference over functions is inappropriate.
I’m sorry to hear that you think the argumentation is weaker now.
I don’t think that indifference over functions in particular is inappropriate. I think indifference reasoning in general is inappropriate.
I wouldn’t call the correct version of this a counting argument. The correct version uses the actual distribution used to initialize the parameters as a measure, and not e.g. the Lebesgue measure. This isn’t appealing to the indifference principle at all, and so in my book it’s not a counting argument. But this could be terminological.
I found the explanation at the point where you introduce b confusing.
Here’s a revised version of the text there that would have been less confusing to me (assuming I haven’t made any errors):
Yep, I endorse that text as being equivalent to what I wrote; sorry if my language was a bit confusing.
In this argument, you’ve implicitly assumed that there is only one function/structure which suffices for being getting high enough training performance to be selected while also not being a long term objective (aka a deceptive objective).
I could imagine this being basically right, but it certainly seems non-obvious to me.
E.g., there might be many things which are extremely highly correlated with reward that are represented in the world model. Or more generally, there are in principle many objective computations that result in trying as hard to get reward as the deceptive model would try.
(The potential for “multiple” objectives only makes a constant factor difference, but this is exactly the same as the case for deceptive objectives.)
The fact that these objectives generalize differently maybe implies they aren’t “aligned”, but in that case there is another key category of objectives: non-exactly-aligned and non-deceptive objectives. And obviously our AI isn’t going to be literally exactly aligned.
Note that non-exactly-aligned and non-deceptive objectives could suffice for safety in practice even if not perfectly aligned (e.g. due to myopia).
Yep, that’s exactly right. As always, once you start making more complex assumptions, things get more and more complicated, and it starts to get harder to model things in nice concrete mathematical terms. I would defend the value of having actual concrete mathematical models here—I think it’s super easy to confuse yourself in this domain if you aren’t doing that (e.g. as I think the confused reasoning about counting arguments in this post demonstrates). So I like having really concrete models, but only in the “all models are wrong, but some are useful” sense, as I talk about in “In defense of probably wrong mechanistic models.”
Also, the main point I was trying to make is that the counting argument is both sound and consistent with known generalization properties of machine learning (and in fact predicts them), and for that purpose I went with the simplest possible formalization of the counting argument.
Under this picture, or any other simplicity bias, why NNs with more parameters generalize better?
Paradoxically, I think larger neural networks are more simplicity-biased.
The idea is that when you make your network larger, you increase the size of the search space and thus the number of algorithms that you’re considering to include algorithms which take more computation. That reduces the relative importance of the speed prior, but increases the relative importance of the simplicity prior, because your inductive biases are still selecting from among those algorithms according to the simplest pattern that fits the data, such that you get good generalization—and in fact even better generalization because now the space of algorithms in which you’re searching for the simplest one in is even larger.
Another way to think about this: if you really believe Occam’s razor, then any learning algorithm generalizes exactly to the extent that it approximates a simplicity prior—thus, since we know neural networks generalize better as they get larger, they must be approximating a simplicity prior better as they do so.
What in your view is the fundamental difference between world models and goals such that the former generalise well and the latter generalise poorly?
One can easily construct a model with a free parameter X and training data such that many choices of X will match the training data but results will diverge in situations not represented in the training data (for example, the model is a physical simulation and X tracks the state of some region in the simulation that will affect the learner’s environment later, but hasn’t done so during training). The simplest x_s could easily be wrong. We can even moralise the story: the model regards its job as predicting the output under x_s and if the world happens to operate according to some other x’ then the model doesn’t care. However it’s still going to be ineffective in the future where the value of X matters.