ryan_greenblatt comments on Counting arguments provide no evidence for AI doom

ryan_greenblatt 28 Feb 2024 0:45 UTC
LW: 6 AF: 5
1
AF
I’m sympathetic to pushing back on counting arguments on the ground ‘it’s hard to know what the exact measure should be, so maybe the measure on the goal of “directly pursue high performance/anything nearly perfectly correlated the outcome that it reinforced (aka reward)” is comparable/bigger than the measure on “literally any long run outcome”’.

So I appreciate the push back here. I just think the exact argument and the comparison to overfitting is a strawman.

(Note that above I’m assuming a specific goal slot, that the AI’s predictions are aware of what its goal slot contains, and that in order for the AI to perform sufficiently well as to be a plausible result of training it has to explicitly “play the training game” (e.g. explicitly reason about and try to get high performance). It also seems reasonable to contest these assumption, but this is a different thing than the counting argument.)

(Also, if we imagine an RL’d neural network computing a bunch of predictions, then it does seem plausible that it will have a bunch of long horizon predictions with higher aggregate measure than predicting things that perfectly correlate with the outcome that was reinforced (aka reward)! As in, if we imagine randomly sampling a linear probe, it will be far more likely to sample a probe where most of the variance is driven by long run outcomes than to sample a linear probe which is almost perfectly correlated with reward (e.g. a near perfect predictor of reward up to monotone regression). Neural networks are likely to compute a bunch of long range predictions at least as intermediates, but they only need to compute things that nearly perfectly correlate with reward once! (With some important caveats about transfer from other distributions.))

I also think Evan’s arguments are pretty sloppy in this presentation and he makes a bunch of object level errors/egregious simplifications FWIW, but he is actually trying to talk about models represented in weight space and how many bits are required to specify this. (Not how many bits are required in function space which is crazy!)

By “bits in model space” a more charitable interpretation is something like “among the initialization space of the neural network, how many bits are required to point at this subset relative to other subsets”. I think this corresponds to a view like “neural network inductive biases are well approximated by doing conditional sampling from the initialization space (ala Mingard et al.). I think Evan makes errors in reasoning about this space and that his problematic simplifications (at least for the Christ argument) are similar to some sort of “principle of indifference” (it makes similar errors), but I also think that his errors aren’t quite this and that there is a recoverable argument here. (See my parentheticals above.)

“There is only 1 Christ” is straightforwardly wrong in practice due to gauge invariances and other equivalences in weight space. (But might be spiritually right? I’m skeptical it is honestly.)

The rest of the argument is to vague to know if it’s really wrong or right.
What links here?
- ryan_greenblatt's comment on Counting arguments provide no evidence for AI doom by Nora Belrose (28 Feb 2024 1:32 UTC; 24 points)
- A Dialogue on Deceptive Alignment Risks by Rauno Arike (25 Sep 2024 16:10 UTC; 11 points)