Since there are “more” possible schemers than non-schemers, the argument goes, we should expect training to produce schemers most of the time. In Carlsmith’s words:
It’s important to note that the exact counting argument you quote isn’t one that Carlsmith endorses, just one that he is explaning. And in fact Carlsmith specifically notes that you can’t just apply something like the principle of indifference without more reasoning about the actual neural network prior.
(You mention this later in the “simplicity arguments” section, but I think this objection is sufficiently important and sufficiently missing early in the post that it is important to emphasize.)
Quoting somewhat more context:
I start, in section 4.2, with what I call the “counting argument.” It runs as follows:
The non-schemer model classes, here, require fairly specific goals in order to get high reward.
By contrast, the schemer model class is compatible with a very wide range of (beyond- episode) goals, while still getting high reward (at least if we assume that the other require- ments for scheming to make sense as an instrumental strategy are in place—e.g., that the classic goal-guarding story, or some alternative, works).48
In this sense, there are “more” schemers that get high reward than there are non-schemers that do so.
So, other things equal, we should expect SGD to select a schemer.
Something in the vicinity accounts for a substantial portion of my credence on schemers (and I think it often undergirds other, more specific arguments for expecting schemers as well). However, the argument I give most weight to doesn’t move immediately from “there are more possible schemers that get high reward than non-schemers that do so” to “absent further argument, SGD probably selects a schemer” (call this the “strict counting argument”), because it seems possible that SGD actively privileges one of these model classes over the others. Rather, the argument I give most weight to is something like:
It seems like there are “lots of ways” that a model could end up a schemer and still get high reward, at least assuming that scheming is in fact a good instrumental strategy for pursuing long-term goals.
So absent some additional story about why training won’t select a schemer, it feels, to me, like the possibility should be getting substantive weight. I call this the “hazy counting argument.” It’s not especially principled, but I find that it moves me
We argue against the counting argument in general (more specifically, against the presumption of a uniform prior as a “safe default” to adopt in the absence of better information). This applies to the hazy counting argument as well.
We also don’t really think there’s that much difference between the structure of the hazy argument and the strict one. Both are trying to introduce some form of ~uniformish prior over the outputs of a stochastic AI generating process. The strict counting argument at least has the virtue of being precise about which stochastic processes it’s talking about.
If anything, having more moving parts in the causal graph responsible for producing the distribution over AI goals should make you more skeptical of assigning a uniform prior to that distribution.
I agree that you can’t adopt a uniform prior. (By uniform prior, I assume you mean something like, we represent goals as functions from world states to a (real) number where the number says how good the world state is, then we take a uniform distribution over this function space. (Uniform sampling from function space is extremely, extremely cursed for analysis related reasons without imposing some additional constraints, so it’s not clear uniform sampling even makes sense!))
Separately, I’m also skeptical that any serious historical arguments were actually assuming a uniform prior as opposed to trying to actual reason about the complexity/measure of various goal in terms of some fixed world model given some vague guess about the representation of this world model. This is also somewhat dubious due to assuming a goal slot, assuming a world model, and needing to guess at the representation of the world model.
(You’ll note that ~all prior arguements mention terms like “complexity” and “bits”.)
Of course, the “Against goal realism” and “Simplicity arguments” sections can apply here and indeed, I’m much more sympathetic to these sections than to the counting argument section which seems like a strawman as far as I can tell. (I tried to get to ground on this by communicating back and forth some with you and some with Alex Turner, but I failed, so now I’m just voicing my issues for third parties.)
We’re going to start with simplicity. Simplicity is about specifying the thing that you want in the space of all possible things. You can think about simplicity as “How much do you have to aim to hit the exact thing in the space of all possible models?” How many bits does it take to find the thing that you want in the model space? And so, as a first pass, we can understand simplicity by doing a counting argument, which is just asking, how many models are in each model class?
First, how many Christs are there? Well, I think there’s essentially only one, since there’s only one way for humans to be structured in exactly the same way as God. God has a particular internal structure that determines exactly the things that God wants and the way that God works, and there’s really only one way to port that structure over and make the unique human that wants exactly the same stuff.
Okay, how many Martin Luthers are there? Well, there’s actually more than one Martin Luther (contrary to actual history) because the Martin Luthers can point to the Bible in different ways. There’s a lot of different equivalent Bibles and a lot of different equivalent ways of understanding the Bible. You might have two copies of the Bible that say exactly the same thing such that it doesn’t matter which one you point to, for example. And so there’s more Luthers than there are Christs.
But there’s even more Pascals. You can be a Pascal and it doesn’t matter what you care about. You can care about anything in the world, all of the various different possible things that might exist for you to care about, because all that Pascal needs to do is care about something over the long term, and then have some reason to believe they’re going to be punished if they don’t do the right thing. And so there’s just a huge number of Pascals because they can care about anything in the world at all.
So the point is that there’s more Pascals than there are the others, and so probably you’ll have to fix fewer bits to specify them in the space.
Evan then goes on to try to use the complexity of the simplest member of each model class as an estimate for the size of the classes (which is probably wrong, IMO, but I’m also not entirely sure how he’s defining the “complexity” of a given member in this context), but this section seems more like an elaboration on the above counting argument. Evan calls it “a slightly more concrete version of essentially the same counting argument”.
And IMO, it’s pretty clear that the above quoted argument is implicitly appealing to some sort of uniformish prior assumption over ways to specify different types of goal classes. Otherwise, why would it matter that there are “more Pascals”, unless Evan thought the priors over the different members of each category were sufficiently similar that he could assess their relative likelihoods by enumerating the number of “ways” he thought each type of goal specification could be structured?
Look, Evan literally called his thing a “counting argument”, Joe said “Something in this vicinity [of the hazy counting argument] accounts for a substantial portion of [his] credence on schemers [...] and often undergirds other, more specific arguments”, and EY often expounds on the “width” of mind design space. I think counting arguments represent substantial intuition pumps for a lot of people (though often implicitly so), so I think a post pushing back on them in general is good.
I’m sympathetic to pushing back on counting arguments on the ground ‘it’s hard to know what the exact measure should be, so maybe the measure on the goal of “directly pursue high performance/anything nearly perfectly correlated the outcome that it reinforced (aka reward)” is comparable/bigger than the measure on “literally any long run outcome”’.
So I appreciate the push back here. I just think the exact argument and the comparison to overfitting is a strawman.
(Note that above I’m assuming a specific goal slot, that the AI’s predictions are aware of what its goal slot contains, and that in order for the AI to perform sufficiently well as to be a plausible result of training it has to explicitly “play the training game” (e.g. explicitly reason about and try to get high performance). It also seems reasonable to contest these assumption, but this is a different thing than the counting argument.)
(Also, if we imagine an RL’d neural network computing a bunch of predictions, then it does seem plausible that it will have a bunch of long horizon predictions with higher aggregate measure than predicting things that perfectly correlate with the outcome that was reinforced (aka reward)! As in, if we imagine randomly sampling a linear probe, it will be far more likely to sample a probe where most of the variance is driven by long run outcomes than to sample a linear probe which is almost perfectly correlated with reward (e.g. a near perfect predictor of reward up to monotone regression). Neural networks are likely to compute a bunch of long range predictions at least as intermediates, but they only need to compute things that nearly perfectly correlate with reward once! (With some important caveats about transfer from other distributions.))
I also think Evan’s arguments are pretty sloppy in this presentation and he makes a bunch of object level errors/egregious simplifications FWIW, but he is actually trying to talk about models represented in weight space and how many bits are required to specify this. (Not how many bits are required in function space which is crazy!)
By “bits in model space” a more charitable interpretation is something like “among the initialization space of the neural network, how many bits are required to point at this subset relative to other subsets”. I think this corresponds to a view like “neural network inductive biases are well approximated by doing conditional sampling from the initialization space (ala Mingard et al.). I think Evan makes errors in reasoning about this space and that his problematic simplifications (at least for the Christ argument) are similar to some sort of “principle of indifference” (it makes similar errors), but I also think that his errors aren’t quite this and that there is a recoverable argument here. (See my parentheticals above.)
“There is only 1 Christ” is straightforwardly wrong in practice due to gauge invariances and other equivalences in weight space. (But might be spiritually right? I’m skeptical it is honestly.)
The rest of the argument is to vague to know if it’s really wrong or right.
Evan then goes on to try to use the complexity of the simplest member of each model class as an estimate for the size of the classes (which is probably wrong, IMO, but I’m also not entirely sure how he’s defining the “complexity” of a given member in this context)
It’s important to note that the exact counting argument you quote isn’t one that Carlsmith endorses, just one that he is explaning. And in fact Carlsmith specifically notes that you can’t just apply something like the principle of indifference without more reasoning about the actual neural network prior.
(You mention this later in the “simplicity arguments” section, but I think this objection is sufficiently important and sufficiently missing early in the post that it is important to emphasize.)
Quoting somewhat more context:
[Emphasis mine.]
We argue against the counting argument in general (more specifically, against the presumption of a uniform prior as a “safe default” to adopt in the absence of better information). This applies to the hazy counting argument as well.
We also don’t really think there’s that much difference between the structure of the hazy argument and the strict one. Both are trying to introduce some form of ~uniformish prior over the outputs of a stochastic AI generating process. The strict counting argument at least has the virtue of being precise about which stochastic processes it’s talking about.
If anything, having more moving parts in the causal graph responsible for producing the distribution over AI goals should make you more skeptical of assigning a uniform prior to that distribution.
I agree that you can’t adopt a uniform prior. (By uniform prior, I assume you mean something like, we represent goals as functions from world states to a (real) number where the number says how good the world state is, then we take a uniform distribution over this function space. (Uniform sampling from function space is extremely, extremely cursed for analysis related reasons without imposing some additional constraints, so it’s not clear uniform sampling even makes sense!))
Separately, I’m also skeptical that any serious historical arguments were actually assuming a uniform prior as opposed to trying to actual reason about the complexity/measure of various goal in terms of some fixed world model given some vague guess about the representation of this world model. This is also somewhat dubious due to assuming a goal slot, assuming a world model, and needing to guess at the representation of the world model.
(You’ll note that ~all prior arguements mention terms like “complexity” and “bits”.)
Of course, the “Against goal realism” and “Simplicity arguments” sections can apply here and indeed, I’m much more sympathetic to these sections than to the counting argument section which seems like a strawman as far as I can tell. (I tried to get to ground on this by communicating back and forth some with you and some with Alex Turner, but I failed, so now I’m just voicing my issues for third parties.)
I don’t think this is a strawman. E.g., in How likely is deceptive alignment?, Evan Hubinger says:
Evan then goes on to try to use the complexity of the simplest member of each model class as an estimate for the size of the classes (which is probably wrong, IMO, but I’m also not entirely sure how he’s defining the “complexity” of a given member in this context), but this section seems more like an elaboration on the above counting argument. Evan calls it “a slightly more concrete version of essentially the same counting argument”.
And IMO, it’s pretty clear that the above quoted argument is implicitly appealing to some sort of uniformish prior assumption over ways to specify different types of goal classes. Otherwise, why would it matter that there are “more Pascals”, unless Evan thought the priors over the different members of each category were sufficiently similar that he could assess their relative likelihoods by enumerating the number of “ways” he thought each type of goal specification could be structured?
Look, Evan literally called his thing a “counting argument”, Joe said “Something in this vicinity [of the hazy counting argument] accounts for a substantial portion of [his] credence on schemers [...] and often undergirds other, more specific arguments”, and EY often expounds on the “width” of mind design space. I think counting arguments represent substantial intuition pumps for a lot of people (though often implicitly so), so I think a post pushing back on them in general is good.
I’m sympathetic to pushing back on counting arguments on the ground ‘it’s hard to know what the exact measure should be, so maybe the measure on the goal of “directly pursue high performance/anything nearly perfectly correlated the outcome that it reinforced (aka reward)” is comparable/bigger than the measure on “literally any long run outcome”’.
So I appreciate the push back here. I just think the exact argument and the comparison to overfitting is a strawman.
(Note that above I’m assuming a specific goal slot, that the AI’s predictions are aware of what its goal slot contains, and that in order for the AI to perform sufficiently well as to be a plausible result of training it has to explicitly “play the training game” (e.g. explicitly reason about and try to get high performance). It also seems reasonable to contest these assumption, but this is a different thing than the counting argument.)
(Also, if we imagine an RL’d neural network computing a bunch of predictions, then it does seem plausible that it will have a bunch of long horizon predictions with higher aggregate measure than predicting things that perfectly correlate with the outcome that was reinforced (aka reward)! As in, if we imagine randomly sampling a linear probe, it will be far more likely to sample a probe where most of the variance is driven by long run outcomes than to sample a linear probe which is almost perfectly correlated with reward (e.g. a near perfect predictor of reward up to monotone regression). Neural networks are likely to compute a bunch of long range predictions at least as intermediates, but they only need to compute things that nearly perfectly correlate with reward once! (With some important caveats about transfer from other distributions.))
I also think Evan’s arguments are pretty sloppy in this presentation and he makes a bunch of object level errors/egregious simplifications FWIW, but he is actually trying to talk about models represented in weight space and how many bits are required to specify this. (Not how many bits are required in function space which is crazy!)
By “bits in model space” a more charitable interpretation is something like “among the initialization space of the neural network, how many bits are required to point at this subset relative to other subsets”. I think this corresponds to a view like “neural network inductive biases are well approximated by doing conditional sampling from the initialization space (ala Mingard et al.). I think Evan makes errors in reasoning about this space and that his problematic simplifications (at least for the Christ argument) are similar to some sort of “principle of indifference” (it makes similar errors), but I also think that his errors aren’t quite this and that there is a recoverable argument here. (See my parentheticals above.)
“There is only 1 Christ” is straightforwardly wrong in practice due to gauge invariances and other equivalences in weight space. (But might be spiritually right? I’m skeptical it is honestly.)
The rest of the argument is to vague to know if it’s really wrong or right.
[Low importance aside]
I think this is equivalent to a well known approximation from algorithmic information theory. I think this approximation might be too lossy in practice in the case of actual neural nets though.