I think the infinite bitstring case has zero relevance to deep learning.
There does exist a concept you might call “simplicity” which is relevant to deep learning. The neural network Gaussian process describes the prior distribution over functions which is induced by the initialization distribution over neural net parameters. Under weak assumptions about the activation function and initialization variance, the NNGP is biased toward lower frequency functions. I think this cuts against scheming, and we plan to write up a post on this in the next month or two.
I think the infinite bitstring case has zero relevance to deep learning.
I think you are still not really understanding my objection. It’s not that there is a “finite bitstring case” and an “infinite bitstring case”. My objection is that the sort of finite bitstring analysis that you use does not yield any well-defined mathematical object that you could call a prior, and certainly not one that would predict generalization.
Yes, that’s exactly the problem: you tried to make a counting argument, but because you didn’t engage with the proper formalism, you ended up using reasoning that doesn’t actually correspond to any well-defined mathematical object.
Analogously, it’s like you wrote an essay about why 0.999… != 1 and your response to “under the formalism of real numbers as Dedekind cuts, those are identical” was “where did I say I was referring to Dedekind cuts?” It’s fine if you don’t want to use the standard formalism, but you need some formalism to anchor your words to, otherwise you’re just pushing around words with no real way to ensure that your words actually correspond to something. I think the 0.999… != 1 analogy is quite apt here, because the problem really is that there is no formalism under which 0.999… != 1 that looks anything like the real numbers that you know, in the same way that there really is no formalism under which the sort of reasoning that you’re using is meaningful.
Yes, that’s exactly the problem: you tried to make a counting argument, but because you didn’t engage with the proper formalism, you ended up using reasoning that doesn’t actually correspond to any well-defined mathematical object.
Analogously, it’s like you wrote an essay about why 0.999… != 1 and your response to “under the formalism of real numbers as Dedekind cuts, those are identical” was “where did I say I was referring to Dedekind cuts?”
No. I think you are wrong. This passage makes me suspect that you didn’t understand the arguments Nora was trying to make. Her arguments are easily formalizable as critiquing an indifference principle over functions in function-space, as opposed to over parameterizations in parameter-space. I’ll write this out for you if you really want me to.
I think you should be more cautious at unilaterally diagnosing Nora’s “errors”, as opposed to asking for clarification, because I think you two agree a lot more than you realize.
I agree that there is a valid argument that critiques counting arguments over function space that sort of has the same shape as the one presented in this post. If that was what the authors had in mind, it was not what I got from reading the post, and I haven’t seen anyone making that clarification other than yourself.
Regardless, though, I think that’s still not a great objection to counting arguments for deceptive alignment in general, because it’s explicitly responding only to a very weak and obviously wrong form of a counting argument. My response there is just that of course you shouldn’t run a counting argument over function space—I would never suggest that.
I think you should have asked for clarification before making blistering critiques about how Nora “ended up using reasoning that doesn’t actually correspond to any well-defined mathematical object.” I think your comments paint a highly uncharitable and (more importantly) incorrect view of N/Q’s claims.
My response there is just that of course you shouldn’t run a counting argument over function space—I would never suggest that.
Your presentations often include a counting argument over a function space, in the form of “saints” versus “schemers” and “sycophants.” So it seems to me that you do suggest that. What am I missing?
I also welcome links to counting arguments which you consider stronger. I know you said you haven’t written one up yet to your satisfaction, but surely there have to be some non-obviously wrong and weak arguments written up, right?
I think you should have asked for clarification before making blistering critiques about how Nora “ended up using reasoning that doesn’t actually correspond to any well-defined mathematical object.” I think your comments paint a highly uncharitable and (more importantly) incorrect view of N/Q’s claims.
I’m happy to apologize if I misinterpreted anyone, but afaict my critique remains valid. My criticism is precisely that counting arguments over function space aren’t generally well-defined, and even if they were they wouldn’t be the right way to run a counting argument. So my criticism that the original post misunderstands how to properly run a counting argument still seems correct to me. Perhaps you could say that it’s not the authors’ fault, that they were responding to weak arguments that other people were actually making, but regardless the point remains that the authors haven’t engaged with the sort of counting arguments that I actually think are valid.
Your presentations often include a counting argument over a function space, in the form of “saints” versus “schemers” and “sycophants.” So it seems to me that you do suggest that. What am I missing?
What makes you think that’s intended to be a counting argument over function space? I usually think of this as a counting argument over infinite bitstrings, as I noted in my comment (though there are many other valid presentations). It’s possible I said something in that talk that gave a misleading impression there, but I certainly don’t believe and have never believed in any counting arguments over function space.
afaict my critique remains valid. My criticism is precisely that counting arguments over function space aren’t generally well-defined, and even if they were they wouldn’t be the right way to run a counting argument.
Going back through the post, Nora+Quintin indeed made a specific and perfectly formalizable claim here:
These results strongly suggest that SGD is not doing anything like sampling uniformly at random from the set of representable functions that do well on the training set.
They’re making a perfectly valid point. The point was in the original post AFAICT—it wasn’t just only now explained by me. I agree that they could have presented it more clearly, but that’s a way different critique than you’re “using reasoning that doesn’t actually correspond to any well-defined mathematical object.”
regardless the point remains that the authors haven’t engaged with the sort of counting arguments that I actually think are valid.
If that’s truly your remaining objection, then I think that you should retract the unmerited criticisms about how they’re trying to prove 0.9999… != 1 or whatever. In my opinion, you have confidently misrepresented their arguments, and the discussion would benefit from your revisions.
And then it’d be nice if someone would provide links to the supposed valid counting arguments! From my perspective, it’s very frustrating to hear that there (apparently) are valid counting arguments but also they aren’t the obvious well-known ones that everyone seems to talk about. (But also the real arguments aren’t linkable.)
If that’s truly the state of the evidence, then I’m happy to just conclude that Nora+Quintin are right, and update if/when actually valid arguments come along.
If that’s truly your remaining objection, then I think that you should retract the unmerited criticisms about how they’re trying to prove 0.9999… != 1 or whatever. In my opinion, you have confidently misrepresented their arguments, and the discussion would benefit from your revisions.
This point seems right to me: if the post is specifically about representable functions than that is a valid formalization AFAICT. (Though a extremely cursed formalization for reasons mentioned in a variety of places. And if you dropped “representable”, then it’s extremely, extremely cursed for various analysis related reasons, though I think there is still a theoretically sound uniform measure maybe???)
It would also be nice if the original post:
Clarified that the rebuttal is specifically about a version of the counting-argument which counts functions.
Noted that people making counting arguments weren’t intending to count functions, though this might be a common misconception about counting arguments. (Seems fine to also clarify that existing counting arguments are too hand wavy to really engage with if that’s the view also.) (See also here.)
And then it’d be nice if someone would provide links to the supposed valid counting arguments! From my perspective, it’s very frustrating to hear that there (apparently) are valid counting arguments but also they aren’t the obvious well-known ones that everyone seems to talk about. (But also the real arguments aren’t linkable.)
Isn’t Evan giving you what he thinks is a valid counting argument i.e. a counting argument over parameterizations?
A bunch of LW talk about NN scheming relies on inductive biases of neural nets, or of other learning algorithms.
The arguments individual people make for scheming, including those that may fit the name “counting arguments”, seem to differ greatly. Which is basically the norm in alignment.
Like, Joe Carlsmith lists out a bunch of arguments for scheming regarding simplicity biases, including parameter counts, and thinks that they’re weak in various ways and his “intuitive” counting argument is stronger. Ronny and Nate discuss parameter-count mappings and seem to have pretty different views on how much scheming relies on that. Mark Xu claims AFAICT that bc. that PC’s arguments about NN biases rely on the solomonoff prior being malign like 3 years ago, which may support Nora’s claim. I am unsure if Paul Christiano’s arguments for scheming routed through parameter function mappings. I also have vague memories of Johnswentworth talking about the parameter-counting argument in a youtube video years ago in a way that suggested he supported it, but I can’t find the video.
I think alignment has historically had poor feedback loops, though IMO they’ve improved somewhat in the last few years, and this conceals peoples’ wildly different models and ontologies that make it very hard to notice when people are completely misinterpreting one another. You can have people like Yudkowsky and Hanson who have engaged in hundreds of hours, or maybe more, and still don’t seem to grok the other’s models. I’d bet that this is much more common than people think.
In fact, I think this whole discussion is an example of this.
This was quite recent, so Ronny talking about the shift in the counting argument he was using may well be due to discussions with Quintin, who he was engaing with sometime before the dialogue.
I think this Q/A pair at the bottom provides evidence that Even has been using the parameter-function map framing for quite a while:
Question: When you say model space, you mean the functional behavior as opposed to the literal parameter space?
So there’s not quite a one to one mapping because there are multiple implementations of the exact same function in a network. But it’s pretty close. I mean, most of the time when I’m saying model space, I’m talking either about the weight space or about the function space where I’m interpreting the function over all inputs, not just the training data.
Though it is also possible that he’s been implicitly lumping the parameter-function map stuff together with the function-space stuff that Nora and Quintin were critiquing.
Isn’t Evan giving you what he thinks is a valid counting argument i.e. a counting argument over parameterizations?
Where is the argument? If you run the counting argument in function space, it’s at least clear why you might think there are “more” schemers than saints. But if you’re going to say there are “more” params that correspond to scheming than there are saint-params, that looks like a substantive empirical claim that could easily turn out to be false.
From my perspective, it’s very frustrating to hear that there (apparently) are valid counting arguments but also they aren’t the obvious well-known ones that everyone seems to talk about. (But also the real arguments aren’t linkable.)
Personally, I don’t think there are “solid” counting arguments, but I think you can think though a bunch more cases and feel like the underlying intuition is at least somewhat reasonable.
Overall, I’m a simple man, I still like Joe’s report : ). Fair enough if you don’t find the arguments in here convincing. I think Joe’s report is pretty close to the SOTA with open mindedness and a bit of reinvention work to fill in various gaps.
What makes you think that’s intended to be a counting argument over function space? I usually think of this as a counting argument over infinite bitstrings
I definitely thought you were making a counting argument over function space, and AFAICT Joe also thought this in his report.
The bitstring version of the argument, to the extent I can understand it, just seems even worse to me. You’re making an argument about one type of learning procedure, Solomonoff induction, which is physically unrealizable and AFAICT has not even inspired any serious real-world approximations, and then assuming that somehow the conclusions will transfer over to a mechanistically very different learning procedure, gradient descent. The same goes for the circuit prior thing (although FWIW I think you’re very likely wrong that minimal circuits can be deceptive).
(Fair enough if you never read any of these comments.)
As I’ve noted in all of these comments, people consistently use terminology when making counting style arguments (except perhaps in Joe’s report) which rules out the person intending the argument to be about function space. (E.g., people say things like “bits” and “complexity in terms of the world model”.)
(I also think these written up arguments (Evan’s talk in particular) are very hand wavy, and just provide a vague intuition. So regardless of what he was intending, the actual words of the argument aren’t very solid IMO. Further, using words that rule out the intention of function space doesn’t necessarily imply there is an actually good model behind these words. To actually get anywhere with this reasoning, I think you’d have to reinvent the full argument and think through it in more detail yourself. I also think Evan is substantially wrong in practice though my current guess is that he isn’t too far off about the bottom line (maybe a factor of 3 off). I think Joe’s report is much better in that it’s very clear what level of abstraction and rigor it’s talking about. From reading this post, it doesn’t seem like you came into this project from the perspective of “is there an interesting recoverable intuition here, can we recover or generate a good argument” which would have been considerably better IMO.)
AFAICT Joe also thought this in his report
I think Joe was just operating from a much vaguer counting argument perspective based on my conversations with him about the report and his comments here. As in, he was just talking about the broadly construed counting-argument which can be applied to a wide range of possible inductive biases. As in, for any specific formal model of the situation, a counting-style argument will be somewhat applicable. (Though in practice, we might be able to have much more specific intuitions.)
Note that Joe and Evan have a very different perspective on the case for scheming.
(From my perspective, the correct intuition underlying the counting argument is something like “you only need to compute something which nearly exactly correlates with predicted reward once while you’ll need to compute many long range predictions to perform well in training”. See this comment for a more detailed discussion.)
As I’ve noted in all of these comments, people consistently use terminology when making counting style arguments (except perhaps in Joe’s report) which rules out the person intending the argument to be about function space. (E.g., people say things like “bits” and “complexity in terms of the world model”.)
Aren’t these arguments about simplicity, not counting?
Fair enough if you never read any of these comments.
Yeah, I never saw any of those comments. I think it’s obvious that the most natural reading of the counting argument is that it’s an argument over function space (specifically, over equivalence classes of functions which correspond to “goals.”) And I also think counting arguments for scheming over parameter space, or over Turing machines, or circuits, or whatever, are all much weaker. So from my perspective I’m attacking a steelman rather than a strawman.
I definitely thought you were making a counting argument over function space, and AFAICT Joe also thought this in his report.
Sorry about that—I wish you had been at the talk and could have asked a question about this.
You’re making an argument about one type of learning procedure, Solomonoff induction, which is physically unrealizable and AFAICT has not even inspired any serious real-world approximations, and then assuming that somehow the conclusions will transfer over to a mechanistically very different learning procedure, gradient descent.
I agree that Solomonoff induction is obviously wrong in many ways, which is why you want to substitute it out for whatever the prior is that you think is closest to deep learning that you can still reason about theoretically. But that should never lead you to do a counting argument over function space, since that is never a sound thing to do.
But that should never lead you to do a counting argument over function space, since that is never a sound thing to do.
Do you agree that “instrumental convergence → meaningful evidence for doom” is also unsound, because it’s a counting argument that most functions of shape Y have undesirable property X?
I think instrumental convergence does provide meaningful evidence of doom, and you can make a valid counting argument for it, but as with deceptive alignment you have to run the counting argument over algorithms not over functions.
It’s not clear to me what an “algorithm” is supposed to be here, and I suspect that this might be cruxy. In particular I suspect (40-50% confidence) that:
You think there are objective and determinate facts about what “algorithm” a neural net is implementing, where
Algorithms are supposed to be something like a Boolean circuit or a Turing machine rather than a neural network, and
We can run counting arguments over these objective algorithms, which are distinct both from the neural net itself and the function it expresses.
I reject all three of these premises, but I would consider it progress if I got confirmation that you in fact believe in them.
The real counting argument that Evan believes in is just a repackaging of Paul’s argument for the malignity of the Solomonoff prior, and not anything novel.
Evan admits that Solomonoff is a very poor guide to neural network inductive biases.
At this point, I’m not sure why you’re privileging the hypothesis of scheming at all.
you want to substitute it out for whatever the prior is that you think is closest to deep learning that you can still reason about theoretically.
I mean, the neural network Gaussian process is literally this, and you can make it more realistic by using the neural tangent kernel to simulate training dynamics, perhaps with some finite width corrections. There is real literature on this.
The real counting argument that Evan believes in is just a repackaging of Paul’s argument for the malignity of the Solomonoff prior, and not anything novel.
I’m going to stop responding to you now, because it seems that you are just not reading anything that I am saying. For the last time, my criticism has absolutely nothing to do with Solomonoff induction in particular, as I have now tried to explain to you here and here and here etc.
I mean, the neural network Gaussian process is literally this, and you can make it more realistic by using the neural tangent kernel to simulate training dynamics, perhaps with some finite width corrections. There is real literature on this.
Yes—that’s exactly the sort of counting argument that I like! Though note that it can be very hard to reason properly about counting arguments once you’re using a prior like that; it gets quite tricky to connect those sorts of low-level properties to high-level properties about stuff like deception.
I know that you think your criticism isn’t dependent on Solomonoff induction in particular, because you also claim that a counting argument goes through under circuit prior. It still seems like you view the Solomonoff case as the central one, because you keep talking about “bitstrings.” And I’ve repeatedly said that I don’t think the circuit prior works either, and why I think that.
At no point in this discussion have you provided any reason for thinking that in fact, the Solomonoff prior and/or circuit prior do provide non-negligible evidence about neural network inductive biases, despite the very obvious mechanistic disanalogies.
Yes—that’s exactly the sort of counting argument that I like!
Then make an NNGP counting argument! I have not seen such an argument anywhere. You seem to be alluding to unpublished, or at least little-known, arguments that did not make their way into Joe’s scheming report.
I obviously don’t think the counting argument for overfitting is actually sound, that’s the whole point. But I think the counting argument for scheming is just as obviously invalid, and misuses formalisms just as egregiously, if not moreso.
I deny that your Kolmogorov framework is anything like “the proper formalism” for neural networks. I also deny that the counting argument for overfitting is appropriately characterized as a “finite bitstring” argument, because that suggests I’m talking about Turing machine programs of finite length, which I’m not- I’m directly enumerating functions over a subset of the natural numbers. Are you saying the set of functions over 1...10,000 is not a well defined mathematical object?
I obviously don’t think the counting argument for overfitting is actually sound, that’s the whole point.
Yes, I’m well aware. The problem is that when you make the counting argument for overfitting, you do so in a way that seriously misuses the formalism, which is why the argument fails. So you can’t draw any lessons about counting arguments for deception from the failure of your counting argument for overfitting.
But I think the counting argument for scheming is just as obviously invalid, and misuses formalisms just as egregiously, if not moreso.
Then show me how! If you think there are errors in the math, please point them out.
Of course, it’s worth stating that I certainly don’t have some sort of airtight mathematical argument proving that deception is likely in neural networks—there are lots of assumptions there that could very well be wrong. But I do think that the basic style of reasoning employed by such arguments is sound.
I deny that your Kolmogorov framework is anything like “the proper formalism” for neural networks.
Err… I’m using K-complexity here because it’s a simple framework to reason about, but my criticism isn’t “you should use K-complexity to reason about neural networks.” I think K-complexity captures some important facts about neural network generalization, but is clearly egregiously wrong in other areas. But there are lots of other formalisms! My criticism isn’t that you should use K-complexity, it’s that you should use any formalism at all.
The basic criticism is that the reasoning you use in the post doesn’t correspond to any formalism at all; it’s self-contradictory and inconsistent. So by all means you should replace K-complexity with something better (that’s what I usually try to do as well) but you still need to be reasoning in a way that’s mathematically consistent.
I also deny that the counting argument for overfitting is appropriately characterized as a “finite bitstring” argument, because that suggests I’m talking about Turing machine programs of finite length, which I’m not- I’m directly enumerating functions over a subset of the natural numbers.
One person’s modus ponens is another’s modus tollens. If you say you have a formalism, and that formalism predicts overfitting rather than generalization, then my first objection to your formalism is that it’s clearly a bad formalism for understanding neural networks in practice. Maybe the most basic thing that any good formalism here should get right is that it should predict generalization; if your formalism doesn’t, then it’s clearly not a good formalism.
Then show me how! If you think there are errors in the math, please point them out.
I’m not aware of any actual math behind the counting argument for scheming. I’ve only ever seen handwavy informal arguments about the number of Christs vs Martin Luthers vs Blaise Pascals. There certainly was no formal argument presented in Joe’s extensive scheming report, which I assumed would be sufficient context for writing this essay.
Well, I presented a very simple formulation in my comment, so that could be a reasonable starting point.
But I agree that unfortunately there hasn’t been that much good formal analysis here that’s been written up. At least on my end, that’s for two reasons:
Most of the formal analysis of this form that I’ve published (e.g. this and this) has been focused on sycophancy (human imitator vs. direct translator) rather than deceptive alignment, as sycophancy is a substantially more tractable problem. Finding a prior that reasonably rules out deceptive alignment seems quite out of reach to me currently; at one point I thought a circuit prior might do it, but I now think that circuit priors don’t get rid of deceptive alignment.
I’m currently more optimistic about empirical evidence rather than theoretical evidence for resolving this question, which is why I’ve been focusing on projects such as Sleeper Agents.
Right, and I’ve explained why I don’t think any of those analyses are relevant to neural networks. Deep learning simply does not search over Turing machines or circuits of varying lengths. It searches over parameters of an arithmetic circuit of fixed structure, size, and runtime. So Solomonoff induction, speed priors, and circuit priors are all inapplicable. There has been a lot of work in the mainstream science of deep learning literature on the generalization behavior of actual neural nets, and I’m pretty baffled at why you don’t pay more attention to that stuff.
Right, and I’ve explained why I don’t think any of those analyses are relevant to neural networks. Deep learning simply does not search over Turing machines or circuits of varying lengths. It searches over parameters of an arithmetic circuit of fixed structure, size, and runtime. So Solomonoff induction, speed priors, and circuit priors are all inapplicable.
It is trivially easy to modify the formalism to search only over fixed-size algorithms, and in fact that’s usually what I do when I run this sort of analysis. I feel like you still aren’t understanding the key criticism here—it’s really not about Solomonoff induction—and I’m not sure how to explain that in any way other than how I’ve already done so.
There has been a lot of work in the mainstream science of deep learning literature on the generalization behavior of actual neural nets, and I’m pretty baffled at why you don’t pay more attention to that stuff.
I’m going to assume you just aren’t very familiar with my writing, because working through empirical evidence about neural network inductive biases is somethingI loveto doall thetime.
It is trivially easy to modify the formalism to search only over fixed-size algorithms, and in fact that’s usually what I do when I run this sort of analysis.
What? Which formalism? I don’t see how this is true at all. Please elaborate or send an example of “modifying” Solomonoff so that all the programs have fixed length, or “modifying” the circuit prior so all circuits are the same size.
No, I’m pretty familiar with your writing. I still don’t think you’re focusing on mainstream ML literature enough because you’re still putting nonzero weight on these other irrelevant formalisms. Taking that literature seriously would mean ceasing to take the Solomonoff or circuit prior literature seriously.
Zero relevance? I’m not saying any infinite bitstrings actually exist in deep learning. I’m saying that my intuitions about how deep learning measure works DON’T say that there are many more ways to overfit than generalize, and people whose intuitions say otherwise are probably confused, and they’d be less confused if they understood the example/analogy given by the infinite bitstring case.
I think the infinite bitstring case has zero relevance to deep learning.
There does exist a concept you might call “simplicity” which is relevant to deep learning. The neural network Gaussian process describes the prior distribution over functions which is induced by the initialization distribution over neural net parameters. Under weak assumptions about the activation function and initialization variance, the NNGP is biased toward lower frequency functions. I think this cuts against scheming, and we plan to write up a post on this in the next month or two.
I think you are still not really understanding my objection. It’s not that there is a “finite bitstring case” and an “infinite bitstring case”. My objection is that the sort of finite bitstring analysis that you use does not yield any well-defined mathematical object that you could call a prior, and certainly not one that would predict generalization.
I never used any kind of bitstring analysis.
Yes, that’s exactly the problem: you tried to make a counting argument, but because you didn’t engage with the proper formalism, you ended up using reasoning that doesn’t actually correspond to any well-defined mathematical object.
Analogously, it’s like you wrote an essay about why 0.999… != 1 and your response to “under the formalism of real numbers as Dedekind cuts, those are identical” was “where did I say I was referring to Dedekind cuts?” It’s fine if you don’t want to use the standard formalism, but you need some formalism to anchor your words to, otherwise you’re just pushing around words with no real way to ensure that your words actually correspond to something. I think the 0.999… != 1 analogy is quite apt here, because the problem really is that there is no formalism under which 0.999… != 1 that looks anything like the real numbers that you know, in the same way that there really is no formalism under which the sort of reasoning that you’re using is meaningful.
No. I think you are wrong. This passage makes me suspect that you didn’t understand the arguments Nora was trying to make. Her arguments are easily formalizable as critiquing an indifference principle over functions in function-space, as opposed to over parameterizations in parameter-space. I’ll write this out for you if you really want me to.
I think you should be more cautious at unilaterally diagnosing Nora’s “errors”, as opposed to asking for clarification, because I think you two agree a lot more than you realize.
I agree that there is a valid argument that critiques counting arguments over function space that sort of has the same shape as the one presented in this post. If that was what the authors had in mind, it was not what I got from reading the post, and I haven’t seen anyone making that clarification other than yourself.
Regardless, though, I think that’s still not a great objection to counting arguments for deceptive alignment in general, because it’s explicitly responding only to a very weak and obviously wrong form of a counting argument. My response there is just that of course you shouldn’t run a counting argument over function space—I would never suggest that.
I think you should have asked for clarification before making blistering critiques about how Nora “ended up using reasoning that doesn’t actually correspond to any well-defined mathematical object.” I think your comments paint a highly uncharitable and (more importantly) incorrect view of N/Q’s claims.
Your presentations often include a counting argument over a function space, in the form of “saints” versus “schemers” and “sycophants.” So it seems to me that you do suggest that. What am I missing?
I also welcome links to counting arguments which you consider stronger. I know you said you haven’t written one up yet to your satisfaction, but surely there have to be some non-obviously wrong and weak arguments written up, right?
I’m happy to apologize if I misinterpreted anyone, but afaict my critique remains valid. My criticism is precisely that counting arguments over function space aren’t generally well-defined, and even if they were they wouldn’t be the right way to run a counting argument. So my criticism that the original post misunderstands how to properly run a counting argument still seems correct to me. Perhaps you could say that it’s not the authors’ fault, that they were responding to weak arguments that other people were actually making, but regardless the point remains that the authors haven’t engaged with the sort of counting arguments that I actually think are valid.
What makes you think that’s intended to be a counting argument over function space? I usually think of this as a counting argument over infinite bitstrings, as I noted in my comment (though there are many other valid presentations). It’s possible I said something in that talk that gave a misleading impression there, but I certainly don’t believe and have never believed in any counting arguments over function space.
Going back through the post, Nora+Quintin indeed made a specific and perfectly formalizable claim here:
They’re making a perfectly valid point. The point was in the original post AFAICT—it wasn’t just only now explained by me. I agree that they could have presented it more clearly, but that’s a way different critique than you’re “using reasoning that doesn’t actually correspond to any well-defined mathematical object.”
If that’s truly your remaining objection, then I think that you should retract the unmerited criticisms about how they’re trying to prove 0.9999… != 1 or whatever. In my opinion, you have confidently misrepresented their arguments, and the discussion would benefit from your revisions.
And then it’d be nice if someone would provide links to the supposed valid counting arguments! From my perspective, it’s very frustrating to hear that there (apparently) are valid counting arguments but also they aren’t the obvious well-known ones that everyone seems to talk about. (But also the real arguments aren’t linkable.)
If that’s truly the state of the evidence, then I’m happy to just conclude that Nora+Quintin are right, and update if/when actually valid arguments come along.
This point seems right to me: if the post is specifically about representable functions than that is a valid formalization AFAICT. (Though a extremely cursed formalization for reasons mentioned in a variety of places. And if you dropped “representable”, then it’s extremely, extremely cursed for various analysis related reasons, though I think there is still a theoretically sound uniform measure maybe???)
It would also be nice if the original post:
Clarified that the rebuttal is specifically about a version of the counting-argument which counts functions.
Noted that people making counting arguments weren’t intending to count functions, though this might be a common misconception about counting arguments. (Seems fine to also clarify that existing counting arguments are too hand wavy to really engage with if that’s the view also.) (See also here.)
Isn’t Evan giving you what he thinks is a valid counting argument i.e. a counting argument over parameterizations?
But looking at a bunch of other LW posts, like Carlsmith’s report, a dialogue between Ronny Fernandez and Nate[1], Mark Xu talking about malignity of Solomonoff induction, Paul Christiano talking about NN priors, Evhub’s post on how likely is deceptive alignment etc[2]. I have concluded that:
A bunch of LW talk about NN scheming relies on inductive biases of neural nets, or of other learning algorithms.
The arguments individual people make for scheming, including those that may fit the name “counting arguments”, seem to differ greatly. Which is basically the norm in alignment.
Like, Joe Carlsmith lists out a bunch of arguments for scheming regarding simplicity biases, including parameter counts, and thinks that they’re weak in various ways and his “intuitive” counting argument is stronger. Ronny and Nate discuss parameter-count mappings and seem to have pretty different views on how much scheming relies on that. Mark Xu claims AFAICT that bc. that PC’s arguments about NN biases rely on the solomonoff prior being malign like 3 years ago, which may support Nora’s claim. I am unsure if Paul Christiano’s arguments for scheming routed through parameter function mappings. I also have vague memories of Johnswentworth talking about the parameter-counting argument in a youtube video years ago in a way that suggested he supported it, but I can’t find the video.
I think alignment has historically had poor feedback loops, though IMO they’ve improved somewhat in the last few years, and this conceals peoples’ wildly different models and ontologies that make it very hard to notice when people are completely misinterpreting one another. You can have people like Yudkowsky and Hanson who have engaged in hundreds of hours, or maybe more, and still don’t seem to grok the other’s models. I’d bet that this is much more common than people think.
In fact, I think this whole discussion is an example of this.
This was quite recent, so Ronny talking about the shift in the counting argument he was using may well be due to discussions with Quintin, who he was engaing with sometime before the dialogue.
I think this Q/A pair at the bottom provides evidence that Even has been using the parameter-function map framing for quite a while:
Though it is also possible that he’s been implicitly lumping the parameter-function map stuff together with the function-space stuff that Nora and Quintin were critiquing.
Where is the argument? If you run the counting argument in function space, it’s at least clear why you might think there are “more” schemers than saints. But if you’re going to say there are “more” params that correspond to scheming than there are saint-params, that looks like a substantive empirical claim that could easily turn out to be false.
Personally, I don’t think there are “solid” counting arguments, but I think you can think though a bunch more cases and feel like the underlying intuition is at least somewhat reasonable.
Overall, I’m a simple man, I still like Joe’s report : ). Fair enough if you don’t find the arguments in here convincing. I think Joe’s report is pretty close to the SOTA with open mindedness and a bit of reinvention work to fill in various gaps.
I definitely thought you were making a counting argument over function space, and AFAICT Joe also thought this in his report.
The bitstring version of the argument, to the extent I can understand it, just seems even worse to me. You’re making an argument about one type of learning procedure, Solomonoff induction, which is physically unrealizable and AFAICT has not even inspired any serious real-world approximations, and then assuming that somehow the conclusions will transfer over to a mechanistically very different learning procedure, gradient descent. The same goes for the circuit prior thing (although FWIW I think you’re very likely wrong that minimal circuits can be deceptive).
I’ve argued multiple times that Evan was not intending to make a counting argument in function space:
In discussion with Alex Turner (TurnTrout) when commenting on an earlier draft of this post.
In discussion with Quintin after sharing some comments on the draft. (Also shared with you TBC.)
In this earlier comment.
(Fair enough if you never read any of these comments.)
As I’ve noted in all of these comments, people consistently use terminology when making counting style arguments (except perhaps in Joe’s report) which rules out the person intending the argument to be about function space. (E.g., people say things like “bits” and “complexity in terms of the world model”.)
(I also think these written up arguments (Evan’s talk in particular) are very hand wavy, and just provide a vague intuition. So regardless of what he was intending, the actual words of the argument aren’t very solid IMO. Further, using words that rule out the intention of function space doesn’t necessarily imply there is an actually good model behind these words. To actually get anywhere with this reasoning, I think you’d have to reinvent the full argument and think through it in more detail yourself. I also think Evan is substantially wrong in practice though my current guess is that he isn’t too far off about the bottom line (maybe a factor of 3 off). I think Joe’s report is much better in that it’s very clear what level of abstraction and rigor it’s talking about. From reading this post, it doesn’t seem like you came into this project from the perspective of “is there an interesting recoverable intuition here, can we recover or generate a good argument” which would have been considerably better IMO.)
I think Joe was just operating from a much vaguer counting argument perspective based on my conversations with him about the report and his comments here. As in, he was just talking about the broadly construed counting-argument which can be applied to a wide range of possible inductive biases. As in, for any specific formal model of the situation, a counting-style argument will be somewhat applicable. (Though in practice, we might be able to have much more specific intuitions.)
Note that Joe and Evan have a very different perspective on the case for scheming.
(From my perspective, the correct intuition underlying the counting argument is something like “you only need to compute something which nearly exactly correlates with predicted reward once while you’ll need to compute many long range predictions to perform well in training”. See this comment for a more detailed discussion.)
Aren’t these arguments about simplicity, not counting?
Yeah, I never saw any of those comments. I think it’s obvious that the most natural reading of the counting argument is that it’s an argument over function space (specifically, over equivalence classes of functions which correspond to “goals.”) And I also think counting arguments for scheming over parameter space, or over Turing machines, or circuits, or whatever, are all much weaker. So from my perspective I’m attacking a steelman rather than a strawman.
Sorry about that—I wish you had been at the talk and could have asked a question about this.
I agree that Solomonoff induction is obviously wrong in many ways, which is why you want to substitute it out for whatever the prior is that you think is closest to deep learning that you can still reason about theoretically. But that should never lead you to do a counting argument over function space, since that is never a sound thing to do.
Do you agree that “instrumental convergence → meaningful evidence for doom” is also unsound, because it’s a counting argument that most functions of shape Y have undesirable property X?
I think instrumental convergence does provide meaningful evidence of doom, and you can make a valid counting argument for it, but as with deceptive alignment you have to run the counting argument over algorithms not over functions.
It’s not clear to me what an “algorithm” is supposed to be here, and I suspect that this might be cruxy. In particular I suspect (40-50% confidence) that:
You think there are objective and determinate facts about what “algorithm” a neural net is implementing, where
Algorithms are supposed to be something like a Boolean circuit or a Turing machine rather than a neural network, and
We can run counting arguments over these objective algorithms, which are distinct both from the neural net itself and the function it expresses.
I reject all three of these premises, but I would consider it progress if I got confirmation that you in fact believe in them.
So today we’ve learned that:
The real counting argument that Evan believes in is just a repackaging of Paul’s argument for the malignity of the Solomonoff prior, and not anything novel.
Evan admits that Solomonoff is a very poor guide to neural network inductive biases.
At this point, I’m not sure why you’re privileging the hypothesis of scheming at all.
I mean, the neural network Gaussian process is literally this, and you can make it more realistic by using the neural tangent kernel to simulate training dynamics, perhaps with some finite width corrections. There is real literature on this.
I’m going to stop responding to you now, because it seems that you are just not reading anything that I am saying. For the last time, my criticism has absolutely nothing to do with Solomonoff induction in particular, as I have now tried to explain to you here and here and here etc.
Yes—that’s exactly the sort of counting argument that I like! Though note that it can be very hard to reason properly about counting arguments once you’re using a prior like that; it gets quite tricky to connect those sorts of low-level properties to high-level properties about stuff like deception.
I’ve read every word of all of your comments.
I know that you think your criticism isn’t dependent on Solomonoff induction in particular, because you also claim that a counting argument goes through under circuit prior. It still seems like you view the Solomonoff case as the central one, because you keep talking about “bitstrings.” And I’ve repeatedly said that I don’t think the circuit prior works either, and why I think that.
At no point in this discussion have you provided any reason for thinking that in fact, the Solomonoff prior and/or circuit prior do provide non-negligible evidence about neural network inductive biases, despite the very obvious mechanistic disanalogies.
Then make an NNGP counting argument! I have not seen such an argument anywhere. You seem to be alluding to unpublished, or at least little-known, arguments that did not make their way into Joe’s scheming report.
I obviously don’t think the counting argument for overfitting is actually sound, that’s the whole point. But I think the counting argument for scheming is just as obviously invalid, and misuses formalisms just as egregiously, if not moreso.
I deny that your Kolmogorov framework is anything like “the proper formalism” for neural networks. I also deny that the counting argument for overfitting is appropriately characterized as a “finite bitstring” argument, because that suggests I’m talking about Turing machine programs of finite length, which I’m not- I’m directly enumerating functions over a subset of the natural numbers. Are you saying the set of functions over 1...10,000 is not a well defined mathematical object?
Yes, I’m well aware. The problem is that when you make the counting argument for overfitting, you do so in a way that seriously misuses the formalism, which is why the argument fails. So you can’t draw any lessons about counting arguments for deception from the failure of your counting argument for overfitting.
Then show me how! If you think there are errors in the math, please point them out.
Of course, it’s worth stating that I certainly don’t have some sort of airtight mathematical argument proving that deception is likely in neural networks—there are lots of assumptions there that could very well be wrong. But I do think that the basic style of reasoning employed by such arguments is sound.
Err… I’m using K-complexity here because it’s a simple framework to reason about, but my criticism isn’t “you should use K-complexity to reason about neural networks.” I think K-complexity captures some important facts about neural network generalization, but is clearly egregiously wrong in other areas. But there are lots of other formalisms! My criticism isn’t that you should use K-complexity, it’s that you should use any formalism at all.
The basic criticism is that the reasoning you use in the post doesn’t correspond to any formalism at all; it’s self-contradictory and inconsistent. So by all means you should replace K-complexity with something better (that’s what I usually try to do as well) but you still need to be reasoning in a way that’s mathematically consistent.
One person’s modus ponens is another’s modus tollens. If you say you have a formalism, and that formalism predicts overfitting rather than generalization, then my first objection to your formalism is that it’s clearly a bad formalism for understanding neural networks in practice. Maybe the most basic thing that any good formalism here should get right is that it should predict generalization; if your formalism doesn’t, then it’s clearly not a good formalism.
I’m not aware of any actual math behind the counting argument for scheming. I’ve only ever seen handwavy informal arguments about the number of Christs vs Martin Luthers vs Blaise Pascals. There certainly was no formal argument presented in Joe’s extensive scheming report, which I assumed would be sufficient context for writing this essay.
Well, I presented a very simple formulation in my comment, so that could be a reasonable starting point.
But I agree that unfortunately there hasn’t been that much good formal analysis here that’s been written up. At least on my end, that’s for two reasons:
Most of the formal analysis of this form that I’ve published (e.g. this and this) has been focused on sycophancy (human imitator vs. direct translator) rather than deceptive alignment, as sycophancy is a substantially more tractable problem. Finding a prior that reasonably rules out deceptive alignment seems quite out of reach to me currently; at one point I thought a circuit prior might do it, but I now think that circuit priors don’t get rid of deceptive alignment.
I’m currently more optimistic about empirical evidence rather than theoretical evidence for resolving this question, which is why I’ve been focusing on projects such as Sleeper Agents.
Right, and I’ve explained why I don’t think any of those analyses are relevant to neural networks. Deep learning simply does not search over Turing machines or circuits of varying lengths. It searches over parameters of an arithmetic circuit of fixed structure, size, and runtime. So Solomonoff induction, speed priors, and circuit priors are all inapplicable. There has been a lot of work in the mainstream science of deep learning literature on the generalization behavior of actual neural nets, and I’m pretty baffled at why you don’t pay more attention to that stuff.
It is trivially easy to modify the formalism to search only over fixed-size algorithms, and in fact that’s usually what I do when I run this sort of analysis. I feel like you still aren’t understanding the key criticism here—it’s really not about Solomonoff induction—and I’m not sure how to explain that in any way other than how I’ve already done so.
I’m going to assume you just aren’t very familiar with my writing, because working through empirical evidence about neural network inductive biases is something I love to do all the time.
What? Which formalism? I don’t see how this is true at all. Please elaborate or send an example of “modifying” Solomonoff so that all the programs have fixed length, or “modifying” the circuit prior so all circuits are the same size.
No, I’m pretty familiar with your writing. I still don’t think you’re focusing on mainstream ML literature enough because you’re still putting nonzero weight on these other irrelevant formalisms. Taking that literature seriously would mean ceasing to take the Solomonoff or circuit prior literature seriously.
Zero relevance? I’m not saying any infinite bitstrings actually exist in deep learning. I’m saying that my intuitions about how deep learning measure works DON’T say that there are many more ways to overfit than generalize, and people whose intuitions say otherwise are probably confused, and they’d be less confused if they understood the example/analogy given by the infinite bitstring case.