You should expand this into a top-level post (both because it’s great, and to keep up the dream of still having a website for rationality and not just AI futurism).
Zack_M_Davis
This obviously does not preclude writing for and talking with the ingroup, nor continuing to refine and polish my own world-model. But...well, I feel like I’ve mostly hit diminishing returns on that
I mean, before concluding that you’ve hit diminishing returns, have you looked at one of the standard textbooks on deep learning, like Prince 2023 or Bishop and Bishop 2024? I don’t think I’m suggesting this out of pointless gatekeeping. I actually unironically think if you’re devoting your life to a desperate campaign to get world powers to ban a technology, it’s helpful to have read a standard undergraduate textbook about the thing you’re trying to ban.
We don’t know what would be the simplest functions that approximate current or future training data. Why believe they would converge on something conveniently safe for us?
I mean, you can get a pretty good idea what the simplest function that approximates the data is like by, you know, looking at the data. (In slogan form, the model is the dataset.) Thus, language models—not hypothetical future superintelligences which don’t exist yet, but the actual technology that people are working on today—seem pretty safe for basically the same reason that text from the internet is safe: you’re sampling from the webtext distribution in a customized way.
(In more detail: you use gradient descent to approximate a “next token prediction” function of internet text. To make it more useful, we want to customize it away from the plain webtext distribution. To help automate that work, we train a “reward model”: basically, you start with a language model, but instead of the unembedding matrix which translates the residual stream to token probabilities, you tack on a layer that you train to predict human thumbs-up/thumbs-down ratings. Then you generate more samples from your base model, and use the output of your reward model to decide what gradient updates to do on them—with a Kullback–Leibler constraint to make sure you don’t update so far as to do something that it would be wildly unlikely for the original base model to do. It’s the same gradients you would get from adding more data to the pretraining set, except that the data is coming from the model itself rather than webtext, and the reward model puts a “multiplier” on the gradient: high reward is like training on that completion a bunch of times, and negative reward is issuing gradient updates in the opposite direction, to do less of that.)
That doesn’t mean future systems will be safe. At some point in the future, when you have AIs training other AIs on AI-generated data too fast for humans to monitor, you can’t just eyeball the data and feel confident that it’s not doing something you don’t want to happen. If your reward model accidentally reinforces the wrong things, then you get more of the wrong things. Importantly, this is a different threat model than “you don’t get what you train for”. In order to react to that threat in a dignified way, I want people to have read the standard undergraduate textbooks and be thinking about how to do better safety engineering in a way that’s oriented around the empirical details. Maybe we die either way, but I intend to die as a computer scientist.
“bias toward” feels insufficiently strong for me to be like “ah, okay, then the problem outlined above isn’t actually a problem.”
You’re right; Steven Byrnes wrote me a really educational comment today about what the correct goal-counting argument looks like, which I need to think more about; I just think it’s really crucial that this is fundamentally an argument about generalization and inductive biases, which I think is being obscured in the black-shape metaphor when you write that “each of these black shapes is basically just as good at passing that particular test” as if it didn’t matter how complex the shape is.
(I don’t think talking to middle schoolers about inductive biases is necessarily hopeless; consider a box behind a tree.)
cause for marginal hope and optimism
I think the temptation to frame technical discussions in terms of pessimism vs. optimism is itself a political distortion that I’m trying to avoid. (Apparently not successfully, if I’m coming off as a voice of marginal hope and optimism.)
You wrote an analogy that attempts to explain a reason why it’s hard to make neural networks do what we want; I’m arguing that the analogy is misleading. That disagreement isn’t about whether the humans survive. It’s about what’s going on with neural networks, and the pedagogy of how to explain it. Even if I’m right, that doesn’t mean the humans survive: we could just be dead for other reasons. But as you know, what matters in rationality is the arguments, not the conclusions; not only are bad arguments for a true conclusion still bad, even suboptimal pedagogy for a true lesson is still suboptimal.
I do not, to be clear, believe that my essay contains falsehoods that become permissible because they help idiots or children make inferential leaps [...] You will never ever ever ever ever see me telling someone a thing I know to be false because I believe that it will result in them outputting a correct belief or a correct behavior
This is good, but I think not saying false things turns out to be a surprisingly low bar, because the selection of which true things you communicate (and which true things you even notice) can have a large distortionary effect if the audience isn’t correcting for it.
Right, but I think a big part of how safety team earns its dignity points is by being as specific as possible about exactly how capabilities team is being suicidal, not just with metaphors and intuition pumps, but state-of-the-art knowledge: you want to be winning arguments with people who know the topic, not just policymakers and the public. My post on adversarial examples (currently up for 2024 Review voting) is an example of what I think this should look like. I’m not just saying “AI did something weird, therefore AI bad”, I’m reviewing the literature and trying to explain why the weird thing would go wrong.
The question is why that argument doesn’t rule out all the things we do successfully use deep learning for. Do image classification, or speech synthesis, or helpful assistants that speak natural language and know everything on the internet “fall nicely out of any analysis of the neural network prior and associated training dynamics”? These applications are only possible because generalization often works out in our favor. (For example, LLM assistants follow instructions that they haven’t seen before, and can even follow instructions in other languages despite the instruction-tuning data being in English.)
Again, obviously that doesn’t mean superintelligence won’t kill the humans for any number of other reasons that we’ve both read many hundreds of thousands of words about. But in order to convince people not to build it, we want to use the best, most convincing arguments, and “you don’t get what you want out of training” as a generic objection to deep learning isn’t very convincing if it proves too much.
To be clear, I agree that the situation is objectively terrifying and it’s quite probable that everyone dies. I gave a copy of If Anyone Builds It to two math professors of my acquaintance at San Francisco State University (and gave $1K to MIRI) because, in that context, conveying the fact that we’re in danger was all I had bandwidth for (and I didn’t have a better book on hand for that).
But in the context of my own writing, everyone who’s paying attention to me already knows about existential risk; I want my words to be focused on being rigorous and correct, not scaring policymakers and the public (notwithstanding that policymakers and the public should in fact be scared).
To the end of being rigorous and correct, I’m claiming that the “each of these black shapes is basically just as good at passing that particular test” story isn’t a good explanation of why alignment is hard (notwithstanding that alignment is in fact hard), because of the story about deep net architectures being biased towards simple functions.
I don’t think “well, I’m pitching to middle schoolers” saves it. If the actual problem is that we don’t know what training data would imply the behavior we want, rather than the outcomes of deep learning being intrinsically super-chaotic—which would be an entirely reasonable thing to suspect if it’s 2005 and you’re reasoning abstractly about optimization without having any empirical results to learn from—then you should be talking about how we don’t know what teal shape to draw, not that we might get a really complicated black shape for all we know.
I am of course aware that in the political arena, the thing I’m doing here would mark me as “not a team player”. If I agree with the conclusion that superintelligence is terrifying, why would I critique an argument with that conclusion? That’s shooting my own side’s soldiers! I think it would be patronizing for me to explain what the problem with that is; you already know.
All dimensions that turn out to matter for what? Current AI is already implicitly optimizing people to use the world “delve” more often than they otherwise would, which is weird and unexpected, but not that bad in the grand scheme of things. Further arguments are needed to distinguish whether this ends in “humans dead, all value lost” or “transhuman utopia, but with some weird and unexpected features, which would also be true of the human-intelligence-augmentation trajectory.” (I’m not saying I believe in the utopia, but if we want that Pause treaty, we need to find the ironclad arguments that convince skeptical experts, not just appeal to intuition.)
I mean, yes. What else could I possibly say? Of course, yes.
In the spirit of not trying to solve the entire alignment problem at once, I find it hard to be too specific about to what extent how my odds would shift without a more specific question. (I think LLMs are doing a pretty good job of knowing and doing what I mean, which implies some form of knowledge of “human values”, but it’s only a natural-language instruction-follower; it’s not supposed to be a sovereign superintelligence, which looks vastly harder and I would rather people not do that for a long time.) Show me the ArXiv paper about inductive biases that I’m supposed to be updating on, and I’ll tell you how much more terrified I am (above my baseline of “already pretty terrified, actually”).
Thanks: my comments about “simplest goals” were implicitly assuming deep nets are more speed-prior-like than Solomonoff-like, and I should have been explicit about that. I need to think more about the deceptive policies already present before we start talking.
The argument is that for all measures that seem reasonable or that we can reasonably well-define, it seems like AI cares about things that are different than human values. Or in other words “most goals are misaligned”.
I think it would be clarifying to try to talk about something other than “human values.” On the If Anyone Builds It resource page for the frequently asked question “Why would an AI steer toward anything other than what it was trained to steer toward?”, the subheader answers, “Because there are many ways to perform well in training.” That’s the class of argument that I understand Pope and Belrose to be critiquing: that deep learning systems can’t be given goals, because all sorts of goals could behave as desired on the training distribution, and almost all of them would do something different outside the training distribution. (Note that “human values” does not appear in the question.)
But the empirical situation doesn’t seem as dire as that kind of “counting argument” suggests. In Langosco et al. 2023′s “Goal Misgeneralization in Deep Reinforcement Learning”, an RL policy trained to collect a coin that was always at the right edge of a video game level learned to go right rather than collect the coin—but randomizing the position of the coin in just a couple percent of training episodes fixed the problem.
It’s not a priori obvious that would work! You could imagine that the policy would learn some crazy thing that happened to match the randomized examples, but didn’t generalize to coin-seeking. Indeed, there are astronomically many such crazy things—but they would seem to be more complicated than the intended generalization of coin-seeking. The counting argument of the form, “You don’t get what you train for, because there are many ways to perform well in training” doesn’t seem like the correct way to reason about neural network behavior (which does not mean that alignment is easy or that the humans will survive).
Now you have a counting argument. The counting argument predicts all the things usual counting arguments predict.
I guess I’m not sure what you mean by “counting argument.” I understand the phrase to refer to inferences of the form, “Almost all things are wrong, so if you pick a thing, it’s almost certainly going to be wrong.” For example, most lottery tickets aren’t winners, therefore your ticket won’t win the lottery.
But the counting works in the lottery example because there’s an implied uniform prior: every ticket has the same probability as any other. If we’re weighing things by simplicity, what work is being done by counting, as such?
Suppose a biased coin is flipped 1000 times, and it comes up 600 Headses and 400 Tailses. If someone made a counting argument of the form, “Almost all guesses at what the coin’s bias could be are wrong, so if you guess, you’re almost certainly going to be wrong”, that would be wrong: by the asymptotic equipartition property, I can actually be very confident that the bias is close to 0.6 in favor of Heads. You can’t just count the number of things when some of them are much more probable than others. (And if any correct probabilistic argument is a “counting argument”, that’s a really confusing choice of terminology.)
An objection I didn’t have time for in the above piece is something like “but what about Occam, though, and k-complexity? Won’t you most likely get the simple, boring, black shape, if you constrain it as in the above?”
This is why I’m concerned about deleterious effects of writing for the outgroup: I’m worried you end up optimizing your thinking for coming up with eloquent allegories to convey your intuitions to a mass audience, and end up not having time for the actual, non-allegorical explanation that would convince subject-matter experts (whose support would be awfully helpful in the desperate push for a Pause treaty).
I think we have a lot of intriguing theory and evidence pointing to a story where the reason neural networks generalize is because the parameter-to-function mapping is not a one-to-one correspondence, and is biased towards simple functions (as Occam and Solomonoff demand): to a first approximation, SGD is going to find the simplest function that fits the training data (because simple functions correspond to large “basins” of approximately equal loss which are easy for SGD to find because they use fewer parameters or are more robust to some parameters being wrong), even though the network architecture is capable of representing astronomically many other functions that also fit the training data but have more complicated behavior elsewhere.
But if that story is correct, then “But what about Occam” isn’t something you can offhandedly address as an afterthought to an allegory about how misalignment is the default because there are astronomically many functions that fit the training data. Whether the simplest function is misaligned (as posited by List of Lethalities #20) is the thing you have to explain!
We do not have the benefit that a breeding program done on dogs or humans has, of having already “pinned down” a core creature with known core traits and variation being laid down in a fairly predictable manner. There’s only so far you can “stretch,” if you’re taking single steps at a time from the starting point of “dog” or “human.”
But you must realize that this sounds remarkably like the safety case for the current AI paradigm of LLMs + RLHF/RFAIF/RLVR! That is, the reason some people think that current-paradigm AI looks relatively safe is because they think that the capabilities of LLMs come from approximating the pretraining distribution, and RLHF/RFAIF/RLVR merely better elicits those capabilities by upweighting the rewarded trajectories (as evidenced by base models outperforming RL-trained models in pass@k evaluations for k in the hundreds or thousands) rather than discovering new “alien” capabilities from scratch.
If anything, the alignment case for SGD looks a lot better than that for selective breeding, because we get to specify as many billions and billions of input–output pairs for our network to approximate as we want (with the misalignment risk being that, as you say, if we don’t know how to choose the right data, the network might not generalize the way we want). Imagine trying to breed a dog to speak perfect English the way LLMs do!
There’s a lot I don’t like about this post (trying to do away with the principle of indifference or goals is terrible greedy reductionism), but the core point, that goal counting arguments of the form “many goals could perform well on the training set, so you’ll probably get the wrong one” seem to falsely imply that neural networks shouldn’t generalize (because many functions could perform well on the training set), seems so important and underappreciated that I might have to give it a very grudging +9. (Update: downgraded to +1 in light of Steven Byrnes’s comment.)
In the comments, Evan Hubinger and John Wentworth argue for a corrected goal-counting argument, but in both cases, I don’t get it: it seems to me that simpler goals should be favored, rather than choosing randomly from a large space of equally probable goals. (This doesn’t solve the alignment problem, because the simplest generalization need not be the one we want, per “List of Lethalities” #20.)
I am painfully aware that the problem might be on my end. (I’m saying “I don’t get it”, not “This is wrong.”) Could someone help me out here? What does the correct goal-counting argument look like?
both counting arguments involve an inference from “there are ‘more’ networks with property X” to “networks are likely to have property X.”
This is still correct, even though the “models should overfit” conclusion is false, because simpler functions have more networks (parameterizations) corresponding to them.
OK, but is there a version of the MIRI position, more recent than 2022, that’s not written for the outgroup?
I’m guessing MIRI’s answer is probably something like, “No, and that’s fine, because there hasn’t been any relevant new evidence since 2022”?
But if you’re trying to make the strongest case, I don’t think the state of debate in 2022 ever got its four layers.
Take, say, Paul Christiano’s 2022 “Where I Agree and Disagree With Eliezer”, disagreement #18:
I think that natural selection is a relatively weak analogy for ML training. The most important disanalogy is that we can deliberately shape ML training. Animal breeding would be a better analogy, and seems to suggest a different and much more tentative conclusion. For example, if humans were being actively bred for corrigibility and friendliness, it looks to me like like they would quite likely be corrigible and friendly up through the current distribution of human behavior. If that breeding process was continuously being run carefully by the smartest of the currently-friendly humans, it seems like it would plausibly break down at a level very far beyond current human abilities.
If Christiano is right, that seems like a huge blow to the argumentative structure of If Anyone Builds It. You have a whole chapter in your book denying this.
What is MIRI’s response to the “but what about selective breeding” objection? I still don’t know! (Yudkowsky affirmed in the comment section that Christiano’s post as a whole was a solid contribution.) Is there just no response? I’m not seeing anything in the chapter 4 resources.
If there’s no response, then why not? Did you just not get around to it, and this will be addressed now that I’ve left this comment bringing it to your attention?
examples with −25 karma but +25 agree/disagree points?
At press time, Warty’s comment on “The Tale of the Top-Tier Intellect” is at −24/+24 (in 28 and 21 votes, respectively).
I don’t necessarily intend to hypothesize deep nonconsent as a terminal preference [...] deep-in-the-sense-of-appearing-in-lots-of-places preference
I think you should have chosen a different word than deep (“Inner, underlying, true; relating to one’s inner or private being rather than what is visible on the surface.”).
“Pervasive”, “recurrent”, “systematic” …?
As alluded to by the name of the website, part of Solomonoff/MDL is that there doesn’t necessarily have to be a unique “correct” explanation: theories are better to the extent that their predictions pay for their complexity. It’s not that compact generators are necessarily “true”; it’s that if a compact generator is yielding OK predictions, then more complex theories need to be making better predictions to single themselves out. You shouldn’t say that looking for compact generators of a complex phenomenon is asking to be wrong unless you have a way to be less wrong.
Thanks, it looks like I accidentally typed “connected” instead of “closed”; fixed.
This is well-executed satire; the author should be proud.
That said, it doesn’t belong among the top 50 posts of 2024, because this is not a satire website. Compare to Garfinkel et al.’s “On the Impossibility of Supersized Machines”. It’s a cute joke paper. It’s fine for it to be on the ArXiV. But someone who voted for it in a review of the best posts of 2017 in the cs.CY “Computers and Society” categorization on ArXiv is confessing (or perhaps, bragging) that the “Computers and Society” category is fake. Same thing with this website.
Thanks, I hate it