Quintin Pope

Karma: 4,900

Quintin Pope May 1, 2024, 11:55 AM
18 points
9
in reply to: Wei Dai’s comment on: Ironing Out the Squiggles
I think it actually points to convergence between human and NN learning dynamics. Human visual cortices are also bad at hands and text, to the point that lucid dreamers often look for issues with their hands / nearby text to check whether they’re dreaming.

One issue that I think causes people to underestimate the degree of convergence between brain and NN learning is to compare the behaviors of entire brains to the behaviors of individual NNs. Brains consist of many different regions which are “trained” on different internal objectives, then interact with each other to collectively produce human outputs. In contrast, most current NNs contain only one “region”, which is all trained on the single objective of imitating certain subsets of human behaviors.

We should thus expect NN learning dynamics to most resemble those of single brain regions, and that the best match for humanlike generalization patterns will arise from putting together multiple NNs that interact with each other in a similar manner as human brain regions.

Quintin Pope Apr 21, 2024, 4:09 AM
LW: 4 AF: 2
0
AF
on: Quintin Pope’s Shortform
Idea for using current AI to accelerate medical research: suppose you were to take a VLM and train it to verbally explain the differences between two image data distributions. E.g., you could take 100 dog images, split them into two classes, insert tiny rectangles into class 1, feed those 100 images into the VLM, and then train it to generate the text “class 1 has tiny rectangles in the images”. Repeat this for a bunch of different augmented datasets where we know exactly how they differ, aiming for a VLM with a general ability to in-context learn and verbally describe the differences between two sets of images. As training processes, keep making there be more and subtler differences, while training the VLM to describe all of them.

Then, apply the model to various medical images. E.g., brain scans of people who are about to develop dementia versus those who aren’t, skin photos of malignant and non-malignant blemishes, electron microscope images of cancer cells that can / can’t survive some drug regimen, etc. See if the VLM can describe any new, human interpretable features.

The VLM would generate a lot of false positives, obviously. But once you know about a possible feature, you can manually investigate whether it holds to distinguish other examples of the thing you’re interested in. Once you find valid features, you can add those into the training data of the VLM, so it’s no longer just trained on synthetic augmentations.

You might have to start with real datasets that are particularly easy to tell apart, in order to jumpstart your VLM’s ability to accurately describe the differences in real data.

The other issue with this proposal is that it currently happens entirely via in context learning. This is inefficient and expensive (100 images is a lot for one model at once!). Ideally, the VLM would learn the difference between the classes by actually being trained on images from those classes, and learn to connect the resulting knowledge to language descriptions of the associated differences through some sort of meta learning setup. Not sure how best to do that, though.

Quintin Pope Mar 23, 2024, 5:43 PM
19 points
7
in reply to: tin482’s comment on: “Deep Learning” Is Function Approximation
NO rigorous, first-principles analysis has ever computed any aspect of any deep learning model beyond toy settings
This is false. From the abstract of Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (muP), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call muTransfer: parametrize the target model in muP, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all. We verify muTransfer on Transformer and ResNet. For example, 1) by transferring pretraining HPs from a model of 13M parameters, we outperform published numbers of BERT-large (350M parameters), with a total tuning cost equivalent to pretraining BERT-large once; 2) by transferring from 40M parameters, we outperform published numbers of the 6.7B GPT-3 model, with tuning cost only 7% of total pretraining cost. A Pytorch implementation of our technique can be found at this http URL and installable via `pip install mup`.
muP comes from a principled mathematical analysis of how different ways of scaling various architectural hyperparameters alongside model width influences activation statistics.

Quintin Pope Mar 19, 2024, 10:54 PM
24 points
6
in reply to: Noosphere89’s comment on: ‘Empiricism!’ as Anti-Epistemology
The basic issue though is that evolution doesn’t have a purpose or goal
FWIW, I don’t think this is the main issue with the evolution analogy. The main issue is that evolution faced a series of basically insurmountable, yet evolution-specific, challenges in successfully generalizing human ‘value alignment’ to the modern environment, such as the fact that optimization over the genome can only influence within lifetime value formation theough insanely unstable Rube Goldberg-esque mechanisms that rely on steps like “successfully zero-shot directing an organism’s online learning processes through novel environments via reward shaping”, or the fact that accumulated lifetime value learning is mostly reset with each successive generation without massive fixed corpuses of human text / RLHF supervisors to act as an anchor against value drift, or evolution having a massive optimization power overhang in the inner loop of its optimization process.

These issues fully explain away the ‘misalignment’ humans have with IGF and other intergenerational value instability. If we imagine a deep learning optimization process with an equivalent structure to evolution, then we could easily predict similar stability issues would arise due to that unstable structure, without having to posit an additional “general tendency for inner misalignment” in arbitrary optimization processes, which is the conclusion that Yudkowsky and others typically invoke evolution to support.

In other words, the issues with evolution as an analogy have little to do with the goals we might ascribe to DL/evolutionary optimization processes, and everything to do with simple mechanistic differences in structure between those processes.

Quintin Pope Mar 19, 2024, 9:58 PM
11 points
16
in reply to: tailcalled’s comment on: ‘Empiricism!’ as Anti-Epistemology
I stand by pretty much everything I wrote in Objections, with the partial exception of the stuff about strawberry alignment, which I should probably rewrite at some point.
Also, Yudkowsky explained exactly how he’d prefer someone to engage with his position “To grapple with the intellectual content of my ideas, consider picking one item from “A List of Lethalities” and engaging with that.”, which I pointed out I’d previously done in a post that literally quotes exactly one point from LoL and explains why it’s wrong. I’ve gotten no response from him on that post, so it seems clear that Yudkowsky isn’t running an optimal ‘good discourse promoting’ engagement policy.

I don’t hold that against him, though. I personally hate arguing with people on this site.

Quintin Pope Mar 11, 2024, 9:33 AM
5 points
0
on: How to (hopefully ethically) make money off of AGI
I at least seem to have some beliefs about how big of a deal AI will be that disagrees pretty heavily with what the market beliefs [...] I feel like I would want to make a somewhat concentrated bet on those beliefs with like 20%-40% of my portfolio or so, and I feel like I am not going to get that by just holding some very broad index funds...
Fidelity allows users to purchase call options on the S&P 500 that are dated to more than 5 years out. Buying those seems like a very agnostic way to make a leveraged bet on higher growth/volatility, without having to rely on margin. Though do note that they may require a lot of liquidity, depending on your choice of strike price.

They also have very low trading volume, with a large gap between bids and asks. Buying them at a good price may be difficult.

Quintin Pope Mar 7, 2024, 4:08 AM
2 points
−2
in reply to: Noosphere89’s comment on: Counting arguments provide no evidence for AI doom
Well, I have <0.1% on spontaneous scheming, period. I suspect Nora is similar and just misspoke in that comment.

Quintin Pope Mar 7, 2024, 3:24 AM
4 points
2
in reply to: mike_hawke’s comment on: Counting arguments provide no evidence for AI doom
The post says “we should assign very low credence to the spontaneous emergence of scheming in future AI systems— perhaps 0.1% or less.”
I.e., not “no AI will ever do anything that might be well-described as scheming, for any reason.”
It should be obvious that, if you train an AI to scheme, you can get an AI that schemes.

Quintin Pope Mar 5, 2024, 11:06 PM
LW: 36 AF: 18
3
AF
in reply to: tailcalled’s comment on: Many arguments for AI x-risk are wrong
RLHF as understood currently (with humans directly rating neural network outputs, a la DPO) is very different from RL as understood historically (with the network interacting autonomously in the world and receiving reward from a function of the world).
This is actually pointing to the difference between online and offline learning algorithms, not RL versus non-RL learning algorithms. Online learning has long been known to be less stable than offline learning. That’s what’s primarily responsible for most “reward hacking”-esque results, such as the CoastRunners degenerate policy. In contrast, offline RL is surprisingly stable and robust to reward misspecification. I think it would have been better if the alignment community had been focused on the stability issues of online learning, rather than the supposed “agentness” of RL.
I was under the impression that PPO was a recently invented algorithm? Wikipedia says it was first published in 2017, which if true would mean that all pre-2017 talk about reinforcement learning was about other algorithms than PPO.
PPO may have been invented in 2017, but there are many prior RL algorithms for which Alex’s description of “reward as learning rate multiplier” is true. In fact, PPO is essentially a tweaked version of REINFORCE, for which a bit of searching brings up Simple statistical gradient-following algorithms for connectionist reinforcement learning as the earliest available reference I can find. It was published in 1992, a full 22 years before Bostrom’s book. In fact, “reward as learning rate multiplier” is even more clearly true of most of the update algorithms described in that paper. E.g., equation 11:
Here, the reward (adjusted by a “reinforcement baseline” $b_{i j}$ ) literally just multiplies the learning rate. Beyond PPO and REINFORCE, this “x as learning rate multiplier” pattern is actually extremely common in different RL formulations. From lecture 7 of David Silver’s RL course:
To be honest, it was a major blackpill for me to see the rationalist community, whose whole whole founding premise was that they were supposed to be good at making efficient use of the available evidence, so completely missing this very straightforward interpretation of RL (at least, I’d never heard of it from alignment literature until I myself came up with it when I realized that the mechanistic function of per-trajectory rewards in a given batched update was to provide the weights of a linear combination of the trajectory gradients. Update: Gwern’s description here is actually somewhat similar).
implicitly assuming that all future AI architectures will be something like GPT+DPO is counterproductive.
When I bring up the “actual RL algorithms don’t seem very dangerous or agenty to me” point, people often respond with “Future algorithms will be different and more dangerous”.
I think this is a bad response for many reasons. In general, it serves as an unlimited excuse to never update on currently available evidence. It also has a bad track record in ML, as the core algorithmic structure of RL algorithms capable of delivering SOTA results has not changed that much in over 3 decades. In fact, just recently Cohere published Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs, which found that the classic REINFORCE algorithm actually outperforms PPO for LLM RLHF finetuning. Finally, this counterpoint seems irrelevant for Alex’s point in this post, which is about historical alignment arguments about historical RL algorithms. He even included disclaimers at the top about this not being an argument for optimism about future AI systems.

Quintin Pope Feb 28, 2024, 12:23 AM
LW: 15 AF: 6
4
AF
in reply to: ryan_greenblatt’s comment on: Counting arguments provide no evidence for AI doom
I don’t think this is a strawman. E.g., in How likely is deceptive alignment?, Evan Hubinger says:
We’re going to start with simplicity. Simplicity is about specifying the thing that you want in the space of all possible things. You can think about simplicity as “How much do you have to aim to hit the exact thing in the space of all possible models?” How many bits does it take to find the thing that you want in the model space? And so, as a first pass, we can understand simplicity by doing a counting argument, which is just asking, how many models are in each model class?

First, how many Christs are there? Well, I think there’s essentially only one, since there’s only one way for humans to be structured in exactly the same way as God. God has a particular internal structure that determines exactly the things that God wants and the way that God works, and there’s really only one way to port that structure over and make the unique human that wants exactly the same stuff.
Okay, how many Martin Luthers are there? Well, there’s actually more than one Martin Luther (contrary to actual history) because the Martin Luthers can point to the Bible in different ways. There’s a lot of different equivalent Bibles and a lot of different equivalent ways of understanding the Bible. You might have two copies of the Bible that say exactly the same thing such that it doesn’t matter which one you point to, for example. And so there’s more Luthers than there are Christs.
But there’s even more Pascals. You can be a Pascal and it doesn’t matter what you care about. You can care about anything in the world, all of the various different possible things that might exist for you to care about, because all that Pascal needs to do is care about something over the long term, and then have some reason to believe they’re going to be punished if they don’t do the right thing. And so there’s just a huge number of Pascals because they can care about anything in the world at all.
So the point is that there’s more Pascals than there are the others, and so probably you’ll have to fix fewer bits to specify them in the space.
Evan then goes on to try to use the complexity of the simplest member of each model class as an estimate for the size of the classes (which is probably wrong, IMO, but I’m also not entirely sure how he’s defining the “complexity” of a given member in this context), but this section seems more like an elaboration on the above counting argument. Evan calls it “a slightly more concrete version of essentially the same counting argument”.

And IMO, it’s pretty clear that the above quoted argument is implicitly appealing to some sort of uniformish prior assumption over ways to specify different types of goal classes. Otherwise, why would it matter that there are “more Pascals”, unless Evan thought the priors over the different members of each category were sufficiently similar that he could assess their relative likelihoods by enumerating the number of “ways” he thought each type of goal specification could be structured?
Look, Evan literally called his thing a “counting argument”, Joe said “Something in this vicinity [of the hazy counting argument] accounts for a substantial portion of [his] credence on schemers [...] and often undergirds other, more specific arguments”, and EY often expounds on the “width” of mind design space. I think counting arguments represent substantial intuition pumps for a lot of people (though often implicitly so), so I think a post pushing back on them in general is good.

Quintin Pope Feb 27, 2024, 11:24 PM
LW: 25 AF: 6
3
AF
in reply to: ryan_greenblatt’s comment on: Counting arguments provide no evidence for AI doom
We argue against the counting argument in general (more specifically, against the presumption of a uniform prior as a “safe default” to adopt in the absence of better information). This applies to the hazy counting argument as well.
We also don’t really think there’s that much difference between the structure of the hazy argument and the strict one. Both are trying to introduce some form of ~uniformish prior over the outputs of a stochastic AI generating process. The strict counting argument at least has the virtue of being precise about which stochastic processes it’s talking about.
If anything, having more moving parts in the causal graph responsible for producing the distribution over AI goals should make you more skeptical of assigning a uniform prior to that distribution.

Counting arguments provide no evidence for AI doom

Nora Belrose and Quintin Pope

Feb 27, 2024, 11:03 PM

101 points

188 comments14 min readLW link

Quintin Pope Feb 13, 2024, 4:45 AM
8 points
2
on: Dreams of AI alignment: The danger of suggestive names
How many times has someone expressed “I’m worried about ‘goal-directed optimizers’, but I’m not sure what exactly they are, so I’m going to work on deconfusion.”? There’s something weird about this sentiment, don’t you think?

IMO, the weird/off thing is that the people saying this don’t have sufficient evidence to highlight this specific vibe bundle as being a “real / natural thing that just needs to be properly formalized”, rather than there being no “True Name” for this concept, and it turns out to be just another situationally useful high level abstraction. It’s like someone saying they want to “deconfuse” the concept of a chair.
Or like someone pointing at a specific location on a blank map and confidently declaring that there’s a dragon at that spot, but then admitting that they don’t actually know what exactly a “dragon” is, have never seen one, and only have theoretical / allegorical arguments to support their existence^[1]. Don’t worry though, they’ll resolve the current state of confusion by thinking really hard about it and putting together a taxonomy of probable dragon subspecies.
1. ^
  If you push them on this point, they might say that actually humans have some pretty dragon-like features, so it only makes sense that real dragons would exist somewhere in creature space.
  Also, dragons are quite powerful, so naturally many types of other creatures would tend to become dragons over time. And given how many creatures there are in the world, it’s inevitable that at least one would become a dragon eventually.

Quintin Pope Jan 11, 2024, 8:43 AM
2 points
0
in reply to: Nathan Helm-Burger’s comment on: Evolution provides no evidence for the sharp left turn
The “alignment technique generalise across human contributions to architectures” isn’t about the SLT threat model. It’s about the “AIs do AI capabilities research” threat model.

Quintin Pope Dec 26, 2023, 9:57 PM
2 points
0
in reply to: Garrett Baker’s comment on: Current AIs Provide Nearly No Data Relevant to AGI Alignment
Not entirely sure what @Thane Ruthenis’ position is, but this feels like a maybe relevant piece of information: https://www.science.org/content/article/formerly-blind-children-shed-light-centuries-old-puzzle

Quintin Pope Dec 3, 2023, 12:14 AM
81 points
18
on: Quick takes on “AI is easy to control”
(Didn’t consult Nora on this; I speak for myself)

I only briefly skimmed this response, and will respond even more briefly.
Re “Re: “AIs are white boxes”″
You apparently completely misunderstood the point we were making with the white box thing. It has ~nothing to do with mech interp. It’s entirely about whitebox optimization being better at controlling stuff than blackbox optimization. This is true even if the person using the optimizers has no idea how the system functions internally.

Re: “Re: “Black box methods are sufficient”″ (and the other stuff about evolution)
Evolution analogies are bad. There are many specific differences between ML optimization processes and biological evolution that predictably result in very different high level dynamics. You should not rely on one to predict the other, as I have argued extensively elsewhere.

Trying to draw inferences about ML from bio evolution is only slightly less absurd than trying to draw inferences about cheesy humor from actual dairy products. Regardless of the fact they can both be called “optmization processes”, they’re completely different things, with different causal structures responsible for their different outcomes, and crucially, those differences in causal structure explain their different outcomes. There’s thus no valid inference from “X happened in biological evolution” to “X will eventually happen in ML”, because X happening in biological evolution is explained by evolution-specific details that don’t appear in ML (at least for most alignment-relevant Xs that I see MIRI people reference often, like the sharp left turn).
Re: “Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between “AI will be able to figure out what humans want” (yes; obviously; this was never under dispute) and “AI will care”″
This wasn’t the point we were making in that section at all. We were arguing about concept learning order and the ease of internalizing human values versus other features for basing decisions on. We were arguing that human values are easy features to learn / internalize / hook up to decision making, so on any natural progression up the learning capacity ladder, you end up with an AI that’s aligned before you end up with one that’s so capable it can destroy the entirety of human civilization by itself.

Re “Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are.”
I think you badly misunderstood the post (e.g., multiple times assuming we’re making an argument we’re not, based on shallow pattern matching of the words used: interpreting “whitebox” as meaning mech interp and “values are easy to learn” as “it will know human values”), and I wish you’d either take the time to actually read / engage with the post in sufficient depth to not make these sorts of mistakes, or not engage at all (or at least not be so rude when you do it).

(Note that this next paragraph is speculation, but a possibility worth bringing up, IMO):
As it is, your response feels like you skimmed just long enough to pattern match our content to arguments you’ve previously dismissed, then regurgitated your cached responses to those arguments. Without further commenting on the merits of our specific arguments, I’ll just note that this is a very bad habit to have if you want to actually change your mind in response to new evidence/arguments about the feasibility of alignment.

Re: “Overall take: unimpressed.”
I’m more frustrated and annoyed than “unimpressed”. But I also did not find this response impressive.
What links here?
- DavidW's comment on Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning by So8res (Dec 21, 2023, 1:48 AM; 70 points)

Quintin Pope Nov 25, 2023, 11:03 PM
6 points
−2
in reply to: Logan Zoellner’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
I think training such an AI to be really good at chess would be fine. Unless “Then apply extreme optimization pressure for never losing at chess.” means something like “deliberately train it to use a bunch of non-chess strategies to win more chess games, like threatening opponents, actively seeking out more chess games in real life, etc”, then it seems like you just get GPT-5 which is also really good at chess.

Quintin Pope Nov 16, 2023, 12:53 AM
LW: 12 AF: 8
−2
AF
in reply to: evhub’s comment on: New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”
I really don’t like that you’ve taken this discussion to Twitter. I think Twitter is really a much worse forum for talking about complex issues like this than LW/AF.
I haven’t “taken this discussion to Twitter”. Joe Carlsmith posted about the paper on Twitter. I saw that post, and wrote my response on Twitter. I didn’t even know it was also posted on LW until later, and decided to repost the stuff I’d written on Twitter here. If anything, I’ve taken my part of the discussion from Twitter to LW. I’m slightly baffled and offended that you seem to be platform-policing me?
Anyways, it looks like you’re making the objection I predicted with the paragraphs:
One obvious counterpoint I expect is to claim that the “[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)]” steps actually do contribute to the later steps, maybe because they’re a short way to compress a motivational pointer to “wanting” to do well on the training objective.
I don’t think this is how NN simplicity biases work. Under the “cognitive executions impose constraints on parameter settings” perspective, you don’t actually save any complexity by supposing that the model has some motive for figuring stuff out internally, because the circuits required to implement the “figure stuff out internally” computations themselves count as additional complexity. In contrast, if you have a view of simplicity that’s closer to program description length, then you’re not counting runtime execution against program complexity, and so a program that has short length in code but long runtime can count as simple.
In particular, when I said “maybe because they’re a short way to compress a motivational pointer to “wanting” to do well on the training objective.” I think this is pointing at the same thing you reference when you say “The entire question is about what the easiest way is to produce that distribution in terms of the inductive biases.”
I.e., given the actual simplicity bias of models, what is the shortest (or most compressed) way of specifying “a model that starts by trying to do well in training”? And my response is that I think the model pays a complexity penalty for runtime computations (since they translate into constraints on parameter values which are needed to implement those computations). Even if those computations are motivated by something we call a “goal”, they still need to be implemented in the circuitry of the model, and thus also constrain its parameters.
Also, when I reference models whose internal cognition looks like “[figure out how to do well at training] [actually do well at training]”, I don’t have sycophantic models in particular in mind. It also includes aligned models, since those models do implement the “[figure out how to do well at training] [actually do well at training]” steps (assuming that aligned behavior does well in training).

Quintin Pope Nov 15, 2023, 10:28 PM
LW: 28 AF: 16
9
AF
on: New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”
Reposting my response on Twitter (To clarify, the following was originally written as a Tweet in response to Joe Carlsmith’s Tweet about the paper, which I am now reposting here):
I just skimmed the section headers and a small amount of the content, but I’m extremely skeptical. E.g., the “counting argument” seems incredibly dubious to me because you can just as easily argue that text to image generators will internally create images of llamas in their early layers, which they then delete, before creating the actual asked for image in the later layers. There are many possible llama images, but “just one” network that straightforwardly implements the training objective, after all.
The issue is that this isn’t the correct way to do counting arguments on NN configurations. While there are indeed an exponentially large number of possible llama images that an NN might create internally, there are an even more exponentially large number of NNs that have random first layers, and then go on to do the actual thing in the later layers. Thus, the “inner llamaizers” are actually more rare in NN configuration space than the straightforward NN.
The key issue is that each additional computation you speculate an NN might be doing acts as an additional constraint on the possible parameters, since the NN has to internally contain circuits that implement those computations. The constraint that the circuits actually have to do “something” is a much stronger reduction in the number of possible configurations for those parameters than any additional configurations you can get out of there being multiple “somethings” that the circuits might be doing.
So in the case of deceptive alignment counting arguments, they seem to be speculating that the NN’s cognition looks something like:
[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)] [figure out how to do well at training] [actually do well at training]
and in comparison, the “honest” / direct solution looks like:
[figure out how to do well at training] [actually do well at training]
and then because there are so many different possibilities for “x”, they say there are more solutions that look like the deceptive cognition. My contention is that the steps “[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)]” in the deceptive cognition are actually unnecessary, and because implementing those steps requires that one have circuits that instantiate those computations, the requirement that the deceptive model perform those steps actually *constrains* the number of parameter configurations that implement the deceptive cognition, which reduces the volume of deceptive models in parameter space.
One obvious counterpoint I expect is to claim that the “[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)]” steps actually do contribute to the later steps, maybe because they’re a short way to compress a motivational pointer to “wanting” to do well on the training objective.
I don’t think this is how NN simplicity biases work. Under the “cognitive executions impose constraints on parameter settings” perspective, you don’t actually save any complexity by supposing that the model has some motive for figuring stuff out internally, because the circuits required to implement the “figure stuff out internally” computations themselves count as additional complexity. In contrast, if you have a view of simplicity that’s closer to program description length, then you’re not counting runtime execution against program complexity, and so a program that has short length in code but long runtime can count as simple.
@jkcarlsmith does seem more sympathetic to the “parameters as complexity” view than to the “code length as complexity” view. However, I go further and think that the “parameters as complexity” view actively points against deceptive alignment.
I also think NNs have an even stronger bias for short paths than you might expect from just the “parameters as complexity” view. Consider a problem that can either be solved with a single circuit of depth n, or by two parallel circuits of depth n/2 (so both solutions must set the values of an equal number of parameters, but one solution is more parallel than the other). I claim there are far more parameter configurations that implement the parallel solution than parameter configurations that implement the serial solution.
This is because the parallel solution has an entire permutation group that’s not available to the serial solution: the two parallel circuits can be moved to different relative depths with respect to each other, whereas all the parts of the serial solution must have fixed relative depths. Thus, the two parallel circuits represent less of a constraint on the possible configurations of the NN, and so there are far more NNs that implement the parallel solution.
As a consequence, I expect there are significant “short depth” biases in the NN simplicity prior, consistent with empirical results such as: https://arxiv.org/abs/1605.06431
Finally, I’m extremely skeptical of claims that NNs contain a ‘ghost of generalized instrumental reasoning’, able to perform very competent long term hidden scheming and deduce lots of world knowledge “in-context”. I think current empirical results point strongly against that being plausible.
For example, the “reversal curse” results (training on “A is B” doesn’t lead to models learning “B is A”). If the ghost can’t even infer from “A is B” to “B is A”, then I think stuff like inferring from “I have a goal x”, to “here is the specific task I must perform in order to maximize my reward” is pretty much out of the question. Thus, stories about how SGD might use arbitrary goals as a way to efficiently compress an (effective) desire for the NN to silently infer lots of very specific details about the training process seem incredibly implausible to me.
I expect objections of the form “I expect future training processes to not suffer from the reversal curse, and I’m worried about the future training processes.”
Obviously people will come up with training processes that don’t suffer from the reversal curse. However, comparing the simplicity of the reversal curse to the capability of current NNs is still evidence about the relative power of the ‘instrumental ghost’ in the model compared to the external capabilities of the model. If a similar ratio continues to hold for externally superintelligent AIs, then that casts enormous doubt on e.g., deceptive alignment scenarios where the model is internally and instrumentally deriving huge amounts of user-goal-related knowledge so that it can pursue its arbitrary mesaobjectives later down the line. I’m using the reversal curse to make a more generalized update about the types of internal cognition that are easy to learn and how they contribute to external capabilities.
Some other Tweets I wrote as part of the discussion:
Tweet 1:
The key points of my Tweet are basically “the better way to think about counting arguments is to compare constraints on parameter configurations”, and “corrected counting arguments introduce an implicit bias towards short, parallel solutions”, where both “counting the constrained parameters”, and “counting the permutations of those parameters” point in that direction.
Tweet 2:
I think shallow depth priors are pretty universal. E.g., they also make sense from a perspective of “any given step of reasoning could fail, so best to make as few sequential steps as possible, since each step is rolling the dice”, as well as a perspective of “we want to explore as many hypotheses as possible with as little compute as possible, so best have lots of cheap hypotheses”.
I’m not concerned about the training for goal achievement contributing to deceptive alignment, because such training processes ultimately come down to optimizing the model to imitate some mapping from “goal given by the training process” → “externally visible action sequence”. Feedback is always upweighting cognitive patterns that produce some externally visible action patterns (usually over short time horizons).
In contrast, it seems very hard to me to accidentally provide sufficient feedback to specify long-term goals that don’t distinguish themselves from short term one over short time horizons, given the common understanding in RL that credit assignment difficulties actively work against the formation of long term goals. It seems more likely to me that we’ll instill long term goals into AIs by “scaffolding” them via feedback over shorter time horizons. E.g., train GPT-N to generate text like “the company’s stock must go up” (short time horizon feedback), as well as text that represents GPT-N competently responding to a variety of situations and discussions about how to achieve long-term goals (more short time horizon feedback), and then putting GPT-N in a continuous loop of sampling from a combination of the behavioral patterns thereby constructed, in such a way that the overall effect is competent long term planning.
The point is: long term goals are sufficiently hard to form deliberately that I don’t think they’ll form accidentally.
Tweet 3:
...I think the llama analogy is exactly correct. It’s specifically designed to avoid triggering mechanistically ungrounded intuitions about “goals” and “tryingness”, which I think inappropriately upweight the compellingness of a conclusion that’s frankly ridiculous on the arguments themselves. Mechanistically, generating the intermediate llamas is just as causally upstream of generating the asked for images, as “having an inner goal” is causally upstream of the deceptive model doing well on the training objective. Calling one type of causal influence “trying” and the other not is an arbitrary distinction.
Tweets 4 / 5:
My point about the “instrumental ghost” wasn’t that NNs wouldn’t learn instrumental / flexible reasoning. It was that such capabilities were much more likely to derive from being straightforwardly trained to learn such capabilities, and then to be employed in a manner consistent with the target function of the training process. What I’m arguing *against* is the perspective that NNs will “accidentally” acquire such capabilities internally as a convergent result of their inductive biases, and direct them to purposes/along directions very different from what’s represented in the training data. That’s the sort of stuff I was talking about when I mentioned the “ghost”.
What I’m saying is there’s a difference between a model that can do flexible instrumental reasoning because it’s faithfully modeling a data distribution with examples of flexible instrumental reasoning, versus a model that acquired hidden flexible instrumental reasoning because NN inductive biases say the convergent best way to do well on tasks is to acquire hidden flexible instrumental reasoning and apply it to the task, even when the task itself doesn’t have any examples of such.

Quintin Pope Nov 4, 2023, 11:42 PM
LW: 38 AF: 17
11
AF
on: Genetic fitness is a measure of selection strength, not the selection target
This is a great post! Thank you for writing it.
There’s a huge amount of ontological confusion about how to think of “objectives” for optimization processes. I think people tend to take an inappropriate intentional stance and treat something like “deliberately steering towards certain abstract notions” as a simple primitive (because it feels introspectively simple to them). This background assumption casts a shadow over all future analysis, since people try to abstract the dynamics of optimization processes in terms of their “true objectives”, when there really isn’t any such thing.
Optimization processes (or at least, evolution and RL) are better thought of in terms of what sorts of behavioral patterns were actually selected for in the history of the process. E.g., @Kaj_Sotala’s point here about tracking the effects of evolution by thinking about what sorts of specific adaptations were actually historically selected for, rather than thinking about some abstract notion of inclusive genetic fitness, and how the difference between modern and ancestral humans seems much smaller from this perspective.
I want to make a similar point about reward in the context of RL: reward is a measure of update strength, not the selection target. We can see as much by just looking at the update equations for REINFORCE (from page 328 of Reinforcement Learning: An Introduction):
The reward^[1] is literally a (per step) multiplier of the learning rate. You can also think of it as providing the weights of a linear combination of the parameter gradients, which means that it’s the historical action trajectories that determine what subspaces of the parameters can potentially be explored. And due to the high correlations between gradients (at least compared to the full volume of parameter space), this means it’s the action trajectories, and not the reward function, that provides most of the information relevant for the NN’s learning process.
From Survival Instinct in Offline Reinforcement Learning:
on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with “wrong” reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL’s return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design.
Trying to preempt possible confusion:
I expect some people to object that the point of the evolutionary analogy is precisely to show that the high-level abstract objective of the optimization process isn’t incorporated into the goals of the optimized product, and that this is a reason for concern because it suggests an unpredictable/uncontrollable mapping between outer and inner optimization objectives.
My point here is that, if you want to judge an optimization process’s predictability/controllability, you should not be comparing some abstract notion of the process’s “true outer objective” to the result’s “true inner objective”. Instead, you should consider the historical trajectory of how the optimization process actually adjusted the behaviors of the thing being optimized, and consider how predictable that thing’s future behaviors are, given past behaviors / updates.
@Kaj_Sotala argues above that this perspective implies greater consistency in human goals between the ancestral and modern environments, since the goals evolution actually historically selected for in the ancestral environment are ~the same goals humans pursue in the modern environment.
For RL agents, I am also arguing that thinking in terms of the historical action trajectories that were actually reinforced during training implies greater consistency, as compared to thinking of things in terms of some “true goal” of the training process. E.g., Goal Misgeneralization in Deep Reinforcement Learning trained a mouse to navigate to cheese that was always placed in the upper right corner of the maze and found that it would continue going to the upper right even when the cheese was moved.
This is actually a high degree of consistency from the perspective of the historical action trajectories. During training, the mouse continually executed the action trajectories that navigated it to the upper right of the board, and continued to do the exact same thing in the modified testing environment.
1. ^
  Technically it’s the future return in this formulation, and current SOTA RL algorithms can be different / more complex, but I think this perspective is still a more accurate intuition pump than notions of “reward as objective”, even for setups where “reward as a learning rate multiplier” isn’t literally true.
What links here?
- Two Tales of AI Takeover: My Doubts by Violet Hour (Mar 5, 2024, 3:51 PM; 30 points)

Quintin Pope

Count­ing ar­gu­ments provide no ev­i­dence for AI doom

Trying to preempt possible confusion:

Counting arguments provide no evidence for AI doom