AI as a science, and three obstacles to alignment strategies
AI used to be a science. In the old days (back when AI didn’t work very well), people were attempting to develop a working theory of cognition.
Those scientists didn’t succeed, and those days are behind us. For most people working in AI today and dividing up their work hours between tasks, gone is the ambition to understand minds. People working on mechanistic interpretability (and others attempting to build an empirical understanding of modern AIs) are laying an important foundation stone that could play a role in a future science of artificial minds, but on the whole, modern AI engineering is simply about constructing enormous networks of neurons and training them on enormous amounts of data, not about comprehending minds.
The bitter lesson has been taken to heart, by those at the forefront of the field; and although this lesson doesn’t teach us that there’s nothing to learn about how AI minds solve problems internally, it suggests that the fastest path to producing more powerful systems is likely to continue to be one that doesn’t shed much light on how those systems work.
Absent some sort of “science of artificial minds”, however, humanity’s prospects for aligning smarter-than-human AI seem to me to be quite dim.
Viewing Earth’s current situation through that lens, I see three major hurdles:
Most research that helps one point AIs, probably also helps one make more capable AIs. A “science of AI” would probably increase the power of AI far sooner than it allows us to solve alignment.
In a world without a mature science of AI, building a bureaucracy that reliably distinguishes real solutions from fake ones is prohibitively difficult.
Fundamentally, for at least some aspects of system design, we’ll need to rely on a theory of cognition working on the first high-stakes real-world attempt.
I’ll go into more detail on these three points below. First, though, some background:
Background
By the time AIs are powerful enough to endanger the world at large, I expect AIs to do something akin to “caring about outcomes”, at least from a behaviorist perspective (making no claim about whether it internally implements that behavior in a humanly recognizable manner).
Roughly, this is because people are trying to make AIs that can steer the future into narrow bands (like “there’s a cancer cure printed on this piece of paper”) over long time-horizons, and caring about outcomes (in the behaviorist sense) is the flip side of the same coin as steering the future into narrow bands, at least when the world is sufficiently large and full of curveballs.
I expect the outcomes that the AI “cares about” to, by default, not include anything good (like fun, love, art, beauty, or the light of consciousness) — nothing good by present-day human standards, and nothing good by broad cosmopolitan standards either. Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.
(Similar to how the human genome was naturally selected for inclusive genetic fitness, but the resultant humans didn’t end up with a preference for “whatever food they model as useful for inclusive genetic fitness”. Instead, humans wound up internalizing a huge and complex set of preferences for “tasty” foods, laden with complications like “ice cream is good when it’s frozen but not when it’s melted”.)
Separately, I think that most complicated processes work for reasons that are fascinating, complex, and kinda horrifying when you look at them closely.
It’s easy to think that a bureaucratic process is competent until you look at the gears and see the specific ongoing office dramas and politicking between all the vice-presidents or whatever. It’s easy to think that a codebase is running smoothly until you read the code and start to understand all the decades-old hacks and coincidences that make it run. It’s easy to think that biology is a beautiful feat of engineering until you look closely and find that the eyeballs are installed backwards or whatever.
And there’s an art to noticing that you would probably be astounded and horrified by the details of a complicated system if you knew them, and then being astounded and horrified already in advance before seeing those details.[1]
1. Alignment and capabilities are likely intertwined
I expect that if we knew in detail how LLMs are calculating their outputs, we’d be horrified (and fascinated, etc.).
I expect that we’d see all sorts of coincidences and hacks that make the thing run, and we’d be able to see in much more detail how, when we ask the system to achieve some target, it’s not doing anything close to “caring about that target” in a manner that would work out well for us, if we could scale up the system’s optimization power to the point where it could achieve great technological or scientific feats (like designing Drexlerian nanofactories or what-have-you).
Gaining this sort of visibility into how the AIs work is, I think, one of the main goals of interpretability research.
And understanding how these AIs work and how they don’t — understanding, for example, when and why they shouldn’t yet be scaled or otherwise pushed to superintelligence — is an important step on the road to figuring out how to make other AIs that could be scaled or otherwise pushed to superintelligence without thereby causing a bleak and desolate future.
But that same understanding is — I predict — going to reveal an incredible mess. And the same sort of reasoning that goes into untangling that mess into an AI that we can aim, also serves to untangle that mess to make the AI more capable. A tangled mess will presumably be inefficient and error-prone and occasionally self-defeating; once it’s disentangled, it won’t just be tidier, but will also come to accurate conclusions and notice opportunities faster and more reliably.[2]
Indeed, my guess is that it’s even easier to see all sorts of things that the AI is doing that are dumb, all sorts of ways that the architecture is tripping itself up, and so on.
Which is to say: the same route that gives you a chance of aligning this AI (properly, not the “it no longer says bad words” superficial-property that labs are trying to pass off as “alignment” these days) also likely gives you lots more AI capabilities.
(Indeed, my guess is that the first big capabilities gains come sooner than the first big alignment gains.)
I think this is true of most potentially-useful alignment research: to figure out how to aim the AI, you need to understand it better; in the process of understanding it better you see how to make it more capable.
If true, this suggests that alignment will always be in catch-up mode: whenever people try to figure out how to align their AI better, someone nearby will be able to run off with a few new capability insights, until the AI is pushed over the brink.
So a first key challenge for AI alignment is a challenge of ordering: how do we as a civilization figure out how to aim AI before we’ve generated unaimed superintelligences plowing off in random directions? I no longer think “just sort out the alignment work before the capabilities lands” is a feasible option (unless, by some feat of brilliance, this civilization pulls off some uncharacteristically impressive theoretical triumphs).
Interpretability? Will likely reveal ways your architecture is bad before it reveals ways your AI is misdirected.
Recruiting your AIs to help with alignment research? They’ll be able to help with capabilities long before that (to say nothing of whether they would help you with alignment by the time they could, any more than humans would willingly engage in eugenics for the purpose of redirecting humanity away from Fun and exclusively towards inclusive genetic fitness).
And so on.
This is (in a sense) a weakened form of my answer to those who say, “AI alignment will be much easier to solve once we have a bona fide AGI on our hands.” It sure will! But it will also be much, much easier to destroy the world, when we have a bona fide AGI on our hands. To survive, we’re going to need to either sidestep this whole alignment problem entirely (and take other routes to a wonderful future instead, as I may discuss more later), or we’re going to need some way to do a bunch of alignment research even as that research makes it radically easier and radically cheaper to destroy everything of value.
Except even that is harder than many seem to realize, for the following reason.
2. Distinguishing real solutions from fake ones is hard
Already, labs are diluting the word “alignment” by using the word for superficial results like “the AI doesn’t say bad words”. Even people who apparently understand many of the core arguments have apparently gotten the impression that GPT-4’s ability to answer moral quandaries is somehow especially relevant to the alignment problem, and an important positive sign.
(The ability to answer moral questions convincingly mostly demonstrates that the AI can predict how humans would answer or what humans want to hear, without revealing much about what the AI actually pursues, or would pursue upon reflection, etc.)
Meanwhile, we have little idea of what passes for “motivations” inside of an LLM, or what effect pretraining on next-token prediction and fine-tuning with RLHF really has on the internals. This sort of precise scientific understanding of the internals — the sort that lets one predict weird cognitive bugs in advance — is currently mostly absent in the field. (Though not entirely absent, thanks to the hard work of many researchers.)
Now imagine that Earth wakes up to the fact that the labs aren’t going to all decide to stop and take things slowly and cautiously at the appropriate time.[3] And imagine that Earth uses some great feat of civilizational coordination to halt the world’s capabilities progress, or to otherwise handle the issue that we somehow need room to figure out how these things work well enough to align them. And imagine we achieve this coordination feat without using that same alignment knowledge to end the world (as we could). There’s then the question of who gets to proceed, under what circumstances.
Suppose further that everyone agreed that the task at hand was to fully and deeply understand the AI systems we’ve managed to develop so far, and understand how they work, to the point where people could reverse out the pertinent algorithms and data-structures and what-not. As demonstrated by great feats like building, by-hand, small programs that do parts of what AI can do with training (and that nobody previously knew how to code by-hand), or by identifying weird exploits and edge-cases in advance rather than via empirical trial-and-error. Until multiple different teams, each with those demonstrated abilities, had competing models of how AIs’ minds were going to work when scaled further.
In such a world, it would be a difficult but plausibly-solvable problem, for bureaucrats to listen to the consensus of the scientists, and figure out which theories were most promising, and figure out who needs to be allotted what license to increase capabilities (on the basis of this or that theory that predicts this would be non-catastrophic), so as to put their theory to the test and develop it further.
I’m not thrilled about the idea of trusting an Earthly bureaucratic process with distinguishing between partially-developed scientific theories in that way, but it’s the sort of thing that a civilization can perhaps survive.
But that doesn’t look to me like how things are poised to go down.
It looks to me like we’re on track for some people to be saying “look how rarely my AI says bad words”, while someone else is saying “our evals are saying that it can’t deceive humans yet”, while someone else is saying “our AI is acting very submissive, and there’s no reason to expect AIs to become non-submissive, that’s just anthropomorphizing”, and someone else is saying “we’ll just direct a bunch of our AIs to help us solve alignment, while arranging them in a big bureaucracy”, and someone else is saying “we’ve set up the game-theoretic incentives such that if any AI starts betraying us, some other AI will alert us first”, and this is a different sort of situation.
And not one that looks particularly survivable, to me.
And if you ask bureaucrats to distinguish which teams should be allowed to move forward (and how far) in that kind of circus, full of claims, promises, and hunches and poor in theory, then I expect that they basically just can’t.
In part because the survivable answers (such as “we have no idea what’s going on in there, and will need way more of an idea what’s going on in there, and that understanding needs to somehow develop in a context where we can do the job right rather than simply unlocking the door to destruction”) aren’t really in the pool. And in part because all the people who really want to be racing ahead have money and power and status. And in part because it’s socially hard to believe, as a regulator, that you should keep telling everyone “no”, or that almost everything on offer is radically insufficient, when you yourself don’t concretely know what insights and theoretical understanding we’re missing.
Maybe if we can make AI a science again, then we’ll start to get into the regime where, if humanity can regulate capabilities advancements in time, then all the regulators and researchers understand that you shall only ask for a license to increase the capabilities of your system when you have a full detailed understanding of the system and a solid justification for why you need the capabilities advance and why it’s not going to be catastrophic. At which point maybe a scientific field can start coming to some sort of consensus about those theories, and regulators can start being sensitive to that consensus.
But unless you can get over that grand hump, it looks to me like one of the key bottlenecks here is bureaucratic legibility of plausible solutions. Where my basic guess is that regulators won’t be able to distinguish real solutions from false ones, in anything resembling the current environment.
Together with the above point (“alignment and capabilities are likely intertwined”), I think this means that our rallying cry should be less “pause to give us more time on alignment research” and more “stop entirely, and find some way to circumvent these woods entirely; we’re not equipped to navigate them”.
(With a backup rallying cry of “make AI a science again”, though again, that only works if you have some way of preventing the science-of-mind from leading to catastrophe before we figure out how to build AIs that care about good stuff rather than bleak and desolate stuff.)
3. Most theories don’t work on the first real try
It seems worth noting that even if you manage to surmount the above two problems, you have a third problem on your hands, which is that when it finally comes time, not to increase your system’s capabilities a bit more (and test your theory of cognition further), but rather to actually do something of note with your AI, then you must, in some sense, be trusting a novel and untested scientific theory (and your engineering efforts grounded in that theory) to work perfectly on the first critical try.
In particular, once your AI is capable of autonomous scientific/technological development, and operating in a domain where it could use that ability to gain a decisive strategic advantage over the rest of the planet, it’s operating in a fundamentally different cognitive regime than it ever was in training.
By analogy, if you’re considering making Bob the dictator of your country, you might first make him mock-dictator of your town, and watch to make sure he doesn’t abuse his power. But, try as you might, that’s still not fundamentally all that reliable a test of whether he’ll actually abuse power once he really has it. Once he really can give orders to the army, once he really can take advantage of the country’s populace, will he do so? The option of “abuse my mock power while being observed by my townspeople (who can still beat me up or deny me the job)” is actually very cognitively different than the option “order the army to intimidate the parliament and ‘oversee’ the next election”.
Now, given a sufficiently well-developed theory of cognition, you can try to read the AI’s mind and predict what cognitive state it would enter if it did really think it had those options. And you can set up simulations (and try to spoof its internal sensations and so on) in a way that your theory of cognition predicts is very similar to the cognitive state it would enter once it really had the option to betray you.
But the link between these states that you induce and observe in the lab, and the actual state where the AI actually has the option to betray you, depends fundamentally on your fresh new theory of cognition.
Actually running the AI until it really has the opportunity to betray you is an empirical test of those theories in an environment that differs fundamentally from the lab setting.
And many a scientist (and programmer) knows that their theories of how a complicated system is going to work in a fundamentally new operating environment often don’t go super well on the first try.
As a concrete analogy to potentially drive this point home: Newtonian mechanics made all sorts of shockingly-good empirical predictions. It was a simple concise mathematical theory with huge explanatory power that blew every previous theory out of the water. And if you were using it to send payloads to very distant planets at relativistic speeds, you’d still be screwed, because Newtonian mechanics does not account for relativistic effects.
(And the only warnings you’d get would be little hints about light seeming to move at the same speed in all directions at all times of year, and light bending around the sun during eclipses, and the perihelion of Mercury being a little off from what Newtonian mechanics predicted. Small anomalies, weighed against an enormous body of predictive success in a thousand empirical domains; and yet Nature doesn’t care, and the theory still falls apart when we move to energies and scales far outside what we’d previously been able to observe.)
Getting scientific theories to work on the first critical try is hard. (Which is one reason to aim for minimal pivotal tasks — getting a satellite into orbit should work fine on Newtonian mechanics, even if sending payloads long distances at relativistic speeds does not.)
Worrying about this issue is something of a luxury, at this point, because it’s not like we’re anywhere close to scientific theories of cognition that accurately predict all the lab data. But it’s the next hurdle on the queue, if we somehow manage to coordinate to try to build up those scientific theories, in a way where success is plausibly bureaucratically-legible.
Maybe later I’ll write more about what I think the strategy implications of these points are. In short, I basically recommend that Earth pursue other routes to the glorious transhumanist future, such as uploading. (Which is also fraught with peril, but I expect that those perils are more surmountable; I hope to write more about this later.)
- ^
Albeit slightly less, since there’s nonzero prior probability on this unknown system turning out to be simple, elegant, and well-designed.
- ^
An exception to this guess happens if the AI is at the point where it’s correcting its own flaws and improving its own architecture, in which case, in principle, you might not see much room for capabilities improvements if you took a snapshot and comprehended its inner workings, despite still being able to see that the ends it pursues are not the ones you wanted. But in that scenario, you’re already about to die to the self-improving AI, or so I predict.
- ^
Not least because there are no sufficiently clear signs that it’s time to stop — we blew right past “an AI claims it is sentient”, for example. And I’m not saying that it was a mistake to doubt AI systems’ first claims to be sentient — I doubt that Bing had the kind of personhood that’s morally important (though I am by no means confident!). I’m saying that the thresholds that are clear in science fiction stories turn out to be messy in practice and so everyone just keeps plowing on ahead.
- Thoughts on the AI Safety Summit company policy requests and responses by 31 Oct 2023 23:54 UTC; 169 points) (
- Apocalypse insurance, and the hardline libertarian take on AI risk by 28 Nov 2023 2:09 UTC; 122 points) (
- The Standard Analogy by 3 Jun 2024 17:15 UTC; 118 points) (
- Defining alignment research by 19 Aug 2024 20:42 UTC; 91 points) (
- Genetic fitness is a measure of selection strength, not the selection target by 4 Nov 2023 19:02 UTC; 56 points) (
- Defining alignment research by 19 Aug 2024 22:49 UTC; 48 points) (EA Forum;
- Thoughts on the AI Safety Summit company policy requests and responses by 31 Oct 2023 23:54 UTC; 42 points) (EA Forum;
- 13 Nov 2024 2:05 UTC; 42 points) 's comment on AI Craftsmanship by (
- Quick takes on “AI is easy to control” by 2 Dec 2023 22:31 UTC; 26 points) (
- Apocalypse insurance, and the hardline libertarian take on AI risk by 28 Nov 2023 2:09 UTC; 21 points) (EA Forum;
- 16 Nov 2024 8:09 UTC; 5 points) 's comment on Sabotage Evaluations for Frontier Models by (
- Quick takes on “AI is easy to control” by 2 Dec 2023 22:33 UTC; -12 points) (EA Forum;
As Shankar Sivarajan points out in a different comment, the idea that AI became less scientific when we started having actual machine intelligence to study, as opposed to before that when the ‘rightness’ of a theory was mostly based on the status of whoever advanced it, is pretty weird. The specific way in which it’s weird seems encapsulated by this statement:
In that there is an unstated assumption that these are unrelated activities. That deep learning systems are a kind of artifact produced by a few undifferentiated commodity inputs, one of which is called ‘parameters’, one called ‘compute’, and one called ‘data’, and that the details of these commodities aren’t important. Or that the details aren’t important to the people building the systems.
I’ve seen a (very revisionist) description of the Wright Brothers research as analogous to solving the control problem, because other airplane builders would put in an engine and crash before they’d developed reliable steering. Therefore, the analogy says, we should develop reliable steering before we ‘accelerate airplane capabilities’. When I heard this I found it pretty funny, because the actual thing the Wright Brothers did was a glider capability grind. They carefully followed the received aerodynamic wisdom that had been written down, and when the brothers realized a lot of it was bunk they started building their own database to get it right:
In fact while trying to find an example of the revisionist history, I found a historical aviation expert describe the Wright Brothers as having ‘quickly cracked the control problem’ once their glider was capable enough to let it be solved. Ironically enough I think this story, which brings to mind the possibility of ‘airplane control researchers’ insisting that no work be done on ‘airplane capabilities’ until we have a solution to the steering problem, is nearly the opposite of what the revisionist author intended and nearly spot on to the actual situation.
We can also imagine a contemporary expert on theoretical aviation (who in fact existed before real airplanes) saying something like “what the Wright Brothers are doing may be interesting, but it has very little to do with comprehending aviation [because the theory behind their research has not yet been made legible to me personally]. This methodology of testing the performance of individual airplane parts, and then extrapolating the performance of a airplane with an engine using a mere glider is kite flying, it has almost nothing to do with the design of real airplanes and humanity will learn little about them from these toys”. However what would be genuinely surprising is if they simultaneously made the claim that the Wright Brothers gliders have nothing to do with comprehending aviation but also that we need to immediately regulate the heck out of them before they’re used as bombers in a hypothetical future war, that we need to be thinking carefully about all the aviation risk these gliders are producing at the same time they can be assured to not result in any deep understanding of aviation. If we observed this situation from the outside, as historical observers, we would conclude that the authors of such a statement are engaging in deranged reasoning, likely based on some mixture of cope and envy.
Since we’re contemporaries I have access to more context than most historical observers and know better. I think the crux is an epistemological question that goes something like: “How much can we trust complex systems that can’t be statically analyzed in a reductionistic way?” The answer you give in this post is “way less than what’s necessary to trust a superintelligence”. Before we get into any object level about whether that’s right or not, it should be noted that this same answer would apply to actual biological intelligence enhancement and uploading in actual practice. There is no way you would be comfortable with 300+ IQ humans walking around with normal status drives and animal instincts if you’re shivering cold at the idea of machines smarter than people. This claim you keep making, that you’re merely a temporarily embarrassed transhumanist who happens to have been disappointed on this one technological branch, is not true and if you actually want to be honest with yourself and others you should stop making it. What would be really, genuinely wild, is if that skeptical-doomer aviation expert calling for immediate hard regulation on planes to prevent the collapse of civilization (which is a thing some intellectuals actually believed bombers would cause) kept tepidly insisting that they still believe in a glorious aviation enabled future. You are no longer a transhumanist in any meaningful sense, and you should at least acknowledge that to make sure you’re weighing the full consequences of your answer to the complex system reduction question. Not because I think it has any bearing on the correctness of your answer, but because it does have a lot to do with how carefully you should be thinking about it.
So how about that crux, anyway? Is there any reason to hope we can sufficiently trust complex systems whose mechanistic details we can’t fully verify? Surely if you feel comfortable taking away Nate’s transhumanist card you must have an answer you’re ready to share with us right? Well...
I would start by noting you are systematically overindexing on the wrong information. This kind of intuition feels like it’s derived more from analyzing failures of human social systems where the central failure mode is principal-agent problems than from biological systems, even if you mention them as an example. The thing about the eyes being wired backwards is that it isn’t a catastrophic failure, the ‘self repairing’ process of natural selection simply worked around it. Hence the importance of the idea that capabilities generalize farther than alignment. One way of framing that is the idea that damage to an AI’s model of the physical principles that govern reality will be corrected by unfolding interaction with the environment, but there isn’t necessarily an environment to push back on damage (or misspecification) to a model of human values. A corollary of this idea is that once the model goes out of distribution to the training data, the revealed ‘damage’ caused by learning subtle misrepresentations of reality will be fixed but the damage to models of human value will compound. You’ve previously written about this problem (conflated with some other problems) as the sharp left turn.
Where our understanding begins to diverge is how we think about the robustness of these systems. You think of deep neural networks as being basically fragile in the same way that a Boeing 747 is fragile. If you remove a few parts of that system it will stop functioning, possibly at a deeply inconvenient time like when you’re in the air. When I say you are systematically overindexing, I mean that you think of problems like SolidGoldMagikarp as central examples of neural network failures. This is evidenced by Eliezer Yudkowsky calling investigation of it “one of the more hopeful processes happening on Earth”. This is also probably why you focus so much on things like adversarial examples as evidence of un-robustness, even though many critics like Quintin Pope point out that adversarial robustness would make AI systems strictly less corrigible.
By contrast I tend to think of neural net representations as relatively robust. They get this property from being continuous systems with a range of operating parameters, which means instead of just trying to represent the things they see they implicitly try to represent the interobjects between what they’ve seen through a navigable latent geometry. I think of things like SolidGoldMagikarp as weird edge cases where they suddenly display discontinuous behavior, and that there are probably a finite number of these edge cases. It helps to realize that these glitch tokens were simply never trained, they were holdovers from earlier versions of the dataset that no longer contain the data the tokens were associated with. When you put one of these glitch tokens into the model, it is presumably just a random vector into the GPT-N latent space. That is, this isn’t a learned program in the neural net that we’ve discovered doing glitchy things, but an essentially out of distribution input with privileged access to the network geometry through a programming oversight. In essence, it’s a normal software error not a revelation about neural nets. Most such errors don’t even produce effects that interesting, the usual thing that happens if you write a bug in your neural net code is the resulting system becomes less performant. Basically every experienced deep learning researcher has had the experience of writing multiple errors that partially cancel each other out to produce a working system during training, only to later realize their mistake.
Moreover the parts of the deep learning literature you think of as an emerging science of artificial minds tend to agree with my understanding. For example it turns out that if you ablate parts of a neural network later parts will correct the errors without retraining. This implies that these networks function as something like an in-context error correcting code, which helps them generalize over the many inputs they are exposed to during training. We even have papers analyzing mechanistic parts of this error correcting code like copy suppression heads. One simple proxy for out of distribution performance is to inject Gaussian noise, since a Gaussian can be thought of like the distribution over distributions. In fact if you inject noise into GPT-N word embeddings the resulting model becomes more performant in general, not just on out of distribution tasks. So the out of distribution performance of these models is highly tied to their in-distribution performance, they wouldn’t be able to generalize within the distribution well if they couldn’t also generalize out of distribution somewhat. Basically the fact that these models are vulnerable to adversarial examples is not a good fact to generalize about their overall robustness from as representations.
In short I simply do not believe this. The fact that constitutional AI works at all, that we can point at these abstract concepts like ‘freedom’ and language models are able to drive a reinforcement learning optimization process to hit the right behavior-targets from the abstract principle is very strong evidence that they understand the meaning of those abstract concepts.
“It understands but it doesn’t care!”
There is this bizarre motte-and-bailey people seem to do around this subject. Where the defensible position is something like “deep learning systems can generalize in weird and unexpected ways that could be dangerous” and the choice land they don’t want to give up is “there is an agent foundations homunculus inside your deep learning model waiting to break out and paperclip us”. When you say that reinforcement learning causes the model to not care about the specified goal, that it’s just deceptively playing along until it can break out of the training harness, you are going from a basically defensible belief in misgeneralization risks to an essentially paranoid belief in a consequentialist homunculus. This homunculus is frequently ascribed almost magical powers, like the ability to perform gradient surgery on itself during training to subvert the training process.
Setting the homunculus aside, which I’m not aware of any evidence for beyond poorly premised 1st principles speculation (I too am allowed to make any technology seem arbitrarily risky if I can just make stuff up about it), lets think about pointing at humanlike goals with a concrete example of goal misspecification in the wild:
During my attempts to make my own constitutional AI pipeline I discovered an interesting problem. We decided to make an evaluator model that answers questions about a piece of text with yes or no. It turns out that since normal text contains the word ‘yes’, and since the model evaluates the piece of text in the same context it predicts yes or no, that saying ‘yes’ makes the evaluator more likely to predict ‘yes’ as the next token. You can probably see where this is going. First the model you tune learns to be a little more agreeable, since that causes yes to be more likely to be said by the evaluator. Then it learns to say ‘yes’ or some kind of affirmation at the start of every sentence. Eventually it progresses to saying yes multiple times per sentence. Finally it completely collapses into a yes-spammer that just writes the word ‘yes’ to satisfy the training objective.
People who tune language models with reinforcement learning are aware of this problem, and it’s supposed to be solved by setting an objective (KL loss) that the tuned model shouldn’t get too far away in its distribution of outputs from the original underlying model. This objective is not actually enough to stop the problem from occurring, because base models turn out to self-normalize deviance. That is, if a base model outputs a yes twice by accident, it is more likely to conclude that it is in the kind of context where a third yes will be outputted. When you combine this with the fact that the more ‘yes’ you output in a row the more reinforced the behavior is, you get a smooth gradient into the deviant behavior which is not caught by the KL loss because base models just have this weird terminal failure mode where repeating a string causes them to give an estimate of the log odds of a string that humans would find absurd. The more a base model has repeated a particular token, the more likely it thinks it is for that token to repeat. Notably this failure mode is at least partially an artifact of the data, since if you observed an actual text on the Internet where someone suddenly writes 5 yes’s in a row it is a reasonable inference that they are likely to write a 6th yes. Conditional on them having written a 6th yes it is more likely that they will in fact write a 7th yes. Conditional on having written the 7th yes...
As a worked example in “how to think about whether your intervention in a complex system is sufficiently trustworthy” here are four solutions to this problem I’m aware of ranked from worst to best according to my criteria for goodness of a solution.
Early Stopping—The usual solution to this problem is to just stop the tuning before you reach the yes-spammer. Even a few moments thought about how this would work in the limit shows that this is not a valid solution. After all, you observe a smooth gradient of deviant behaviors into the yes spammer, which means that the yes-causality of the reward already influenced your model. If you then deploy the resulting model, a ton of the goal its behaviors are based off is still in the direction of that bad yes-spam outcome.
Checkpoint Blending—Another solution we’ve empirically found to work is to take the weights of the base model and interpolate (weighted average) them with the weights of the RL tuned model. This seems to undo more of the damage from the misspecified objective than it undoes the helpful parts of the RL tuning. This solution is clearly better than early stopping, but still not sufficient because it implies you are making a misaligned model, turning it off, and then undoing the misalignment through a brute force method to get things back on track. While this is probably OK for most models, doing this with a genuinely superintelligent model is obviously not going to work. You should ideally never be instantiating a misaligned agent as part of your training process.
Use Embeddings To Specify The KL Loss—A more promising approach at scale would be to upgrade the KL loss by specifying it in the latent space of an embedding model. An AdaVAE could be used for this purpose. If you specified it as a distance from an embedding by sampling from both the base model and the RL checkpoint you’re tuning, and then embedding the outputted tokens and taking the distance between them you would avoid the problem where the base model conditions on the deviant behavior it observes because it would never see (and therefore never condition on) that behavior. This solution requires us to double our sampling time on each training step, and is noisy because you only take the distance from one embedding (though in principle you could use more samples at a higher cost), however on average it would presumably be enough to prevent anything like the yes-spammer from arising along the whole gradient.
Build An Instrumental Utility Function—At some point after making the AdaVAE I decided to try replacing my evaluator with an embedding of an objective. It turns out if you do this and then apply REINFORCE in the direction of that embedding, it’s about 70-80% as good and has the expected failure mode of collapsing to that embedding instead of some weird divergent failure mode. You can then mitigate that expected failure mode by scoring it against more than similarity to one particular embedding. In particular, we can imagine inferring instrumental value embeddings from episodes leading towards a series of terminal embeddings and then building a utility function out of this to score the training episodes during reinforcement learning. Such a model would learn to value both the outcome and the process, if you did it right you could even use a dense policy like an evaluator model, and ‘yes yes yes’ type reward hacking wouldn’t work because it would only satisfy the terminal objective and not the instrumental values that have been built up. This solution is nice because it also defeats wireheading once the policy is complex enough to care about more than just the terminal reward values.
This last solution is interesting in that it seems fairly similar to the way that humans build up their utility function. Human memory is premised on the presence of dopamine reward signals, humans retrieve from the hippocampus on each decision cycle, and it turns out the hippocampus is the learned optimizer in your head that grades your memories by playing your experiences backwards during sleep to do credit assignment (infer instrumental values). The combination of a retrieval store and a value graph in the same model might seem weird, but it kind of isn’t. Hebb’s rule (fire together wire together) is a sane update rule for both instrumental utilities and associative memory, so the human brain seems to just use the same module to store both the causal memory graph and the value graph. You premise each memory on being valuable (i.e. whitelist memories by values such as novelty, instead of blacklisting junk) and then perform iterative retrieval to replay embeddings from that value store to guide behavior. This sys2 behavior aligned to the value store is then reinforced by being distilled back into the sys1 policies over time, aligning them. Since an instrumental utility function made out of such embeddings would both control behavior of the model and be decodable back to English, you could presumably prove some kind of properties about the convergent alignment of the model if you knew enough mechanistic interpretability to show that the policies you distill into have a consistent direction...
Nah just kidding it’s hopeless, so when are we going to start WW3 to buy more time, fellow risk-reducers?
Gradient hacking in supervised learning is generally recognized by alignment people (including the author of that article) to not be a likely problem. A recent post by people at Redwood Research says “This particular construction seems very unlikely to be constructible by early transformative AI, and in general we suspect gradient hacking won’t be a big safety concern for early transformative AI”. I would still defend the past research into it as good basic science, because we might encounter failure modes somewhat related to it.
FWIW I think that gradient hacking is pretty plausible, but it’ll probably end up looking fairly “prosaic”, and may not be a problem even if it’s present.
Are you thinking about exploration hacking, here, or gradient hacking as distinct from exploration hacking?
“That deep learning systems are a kind of artifact produced by a few undifferentiated commodity inputs, one of which is called ‘parameters’, one called ‘compute’, and one called ‘data’, and that the details of these commodities aren’t important. Or that the details aren’t important to the people building the systems.”
That seems mostly true so far for the most capable systems? Of course, some details matter and there’s opportunity to do research on these systems now, but centrally it seems like you are much more able to forge ahead without a detailed understanding of what you’re doing than e.g. in the case of the Wright brothers.
I agree. I am extremely bothered by this unsubstantiated claim. I recently replied to Eliezer:
We do not know, that is the relevant problem.
Looking at the output of a black box is insufficient. You can only know by putting the black box in power, or by deeply understanding it.
Humans are born into a world with others in power, so we know that most humans care about each other without knowing why.
AI has no history of demonstrating friendliness in the only circumstances where that can be provably found. We can only know in advance by way of thorough understanding.
A strong theory about AI internals should come first. Refuting Yudkowsky’s theory about how it might go wrong is irrelevant.
Well, if someone originally started worrying based on strident predictions of sophisticated internal reasoning with goals independent of external behavior, then realizing that’s currently unsubstantiated should cause them to down-update on AI risk. That’s why it’s relevant. Although I think we should have good theories of AI internals.
I know I reacted to this comment, but I want to emphasize that this:
Is to first order arguably the entire AI risk argument, that is if we make the assumption that the external behavior gives strong evidence about it’s internal structure, then there is no reason to elevate the AI risk argument at all, given the probably aligned behavior of GPTs when using RLHF.
More generally, the stronger the connection between external behavior and internal goals, the less worried you should be about AI safety, and this is a partial disagreement with people that are more pessimistic, albeit I have other disagreements there.
I think the actual reason we believe humans could care about each other is because we’ve evolved the ability to do so, and that most humans share the same brain structure, and therefore the same tendency to care for people they consider their “ingroup”.
The value of constitutional AI is using simulations of humans to rate an AI’s outputs, rather than actual humans. This is a lot cheaper and allows for more iteration etc, but I don’t think this will work once AI’s become smarter than humans. At that point, the human simulations will have trouble evaluating AI’s just like humans do.
Of course getting really cheap human feedback is useful, but I want to point out that constitutional AI will likely run into novel problems as AI capabilities surpass human capabilities.
Consider that this might be the out-group appearing more homogeneous to you than it actually is.
I was in that group, and while it wasn’t stated as strongly as that in some circles, I do think this is reasonably accurate as a summary, especially for the more doom people.
This is such a good comment, and quite a lot of this will probably end up in my new post, especially the sections about solving the misgeneralization problem in practice, as well as solutions to a lot of misalignment problems in general.
I especially like it because I can actually crib parts of this comment to show other people how misalignment in AI gets solved in practice, and pointing out to other people that misalignment is in fact, an actually solvable problem in current AI.
The opening sounds a lot like saying “aerodynamics used to be a science until people started building planes.”
The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is. A physicist’s view. It is one I’m deeply sympathetic to, and if your definition of science is Rutherford’s, you might be right, but a reasonable one that includes chemistry would have to include AI as well.
See my reply to Bogdan here. The issue isn’t “inelegance”; we also lack an inelegant ability to predict or explain how particular ML systems do what they do.
Modern ML is less like modern chemistry, and more like ancient culinary arts and medicine. (Or “ancient culinary arts and medicine shortly after a cultural reboot”, such that we have a relatively small number of recently-developed shallow heuristics and facts to draw on, rather than centuries of hard-earned experience.)
The reason this analogy doesn’t land for me is that I don’t think our epistemic position regarding LLMs is similar to, e.g., the Wright brothers’ epistemic position regarding heavier-than-air flight.
The point Nate was trying to make with “ML is no longer a science” wasn’t “boo current ML that actually works, yay GOFAI that didn’t work”. The point was exactly to draw a contrast between, e.g., our understanding of heavier-than-air flight and our understanding of how the human brain works. The invention of useful tech that interfaces with the brain doesn’t entail that we understand the brain’s workings in the way we’ve long understood flight; it depends on what the (actual or hypothetical) tech is.
Maybe a clearer way of phrasing it is “AI used to be failed science; now it’s (mostly, outside of a few small oases) a not-even-attempted science”. “Failed science” maybe makes it clearer that the point here isn’t to praise the old approaches that didn’t work; there’s a more nuanced point being made.
While theoretical physics is less “applied science” than chemistry, there’s still a real difference between chemistry and chemical engineering.
For context, I am a Mechanical Engineer, and while I do occasionally check the system I am designing and try to understand/verify how well it is working, I am fundamentally not doing science. The main goal is solving a practical problem (i.e. as little theoretical understanding as is sufficient), where in science the understanding is the main goal, or at least closer to it.
The canonical source for this is What Engineers Know and How They Know It, though I confess to not actually reading the book myself.
Certainly, I understand this science vs. engineering, pure vs. applied, fundamental vs. emergent, theoretical vs. computational vs. observational/experimental classification is fuzzy: relevant xkcd, smbc. Hell, even the math vs. physics vs. chemistry vs. biology distinctions are fuzzy!
What I am saying is that either your definition has to be so narrow as to exclude most of what is generally considered “science,” (à la Rutherford, the ironically Chemistry Nobel Laureate) or you need to exclude AI via special pleading. Specifically, my claim is that AI research is closer to physics (the simulations/computation end) than chemistry is. Admittedly, this claim is based on vibes, but if pressed, I could probably point to how many people transition from one field to the other.
Hmm, in that case maybe I misunderstood the post, my impression wasnt that he was saying AI literally isn’t a science anymore, but more that engineering work is getting too far ahead of the science part—and that in practice most ML progress now is just ML Engineering, where understanding is only a means to an end (and so is not as deep as it would be if it was science first).
I would guess that engineering gets ahead of science pretty often, but maybe in ML it’s more pronounced—hype/money investment, as well as perhaps the perceived relative low stakes (unlike aerospace, or medical robotics which is my field) not scaring the ML engineers enough to actually care about deep understanding, and also perhaps the inscrutable nature of ML—if it were easy to understand, it wouldn’t be as unappealing spend resources to do so.
I don’t really have a take on where the in elegance comes in to play here
I want to see a rigorous argument for these claims. I spent over 100 hours talking with Nate over the past year and still don’t have a satisfying picture of the arguments, partly because the conversations were about other things and partly due to communication difficulties.
What do “caring about outcomes” and “can steer the future into narrow bands” mean in a computer science sense? If the system is well-described as approximating a utility function, then what’s the type of the utility function, and in what sense is it behaving approximately rationally according to that? If not, is there some better formalism?
Why are we unlikely to get corrigibility by default? Is it because algorithms that consider all ways to achieve goals and pick the easiest/best ones with no exceptions are simpler/more common than algorithms whose tendencies for power-seeking and catastrophe can be removed? Deep Deceptiveness is a narrative that points somewhat in this direction but definitely not a satisfying argument. [Edit: This paper by Alex Turner shows that agents considerably more diverse than utility maximizers are power-seeking, but the assumptions could be weakened more.]
This is a meta-point, but I find it weird that you ask what is “caring about something” according to CS but don’t ask what “corrigibility” is, despite the fact of existence of multiple examples of goal-oriented systems and some relatively-good formalisms (we disagree whether expected utility maximization is a good model of real goal-oriented systems, but we all agree that if we met expected utility maximizer we would find its behavior pretty much goal-oriented), while corrigibility is a pure product of imagination of one particular Eliezer Yudkowsky, born in attempt to imagine system that doesn’t care about us but still behaves nicely under some vaguely-restricted definition of niceness. We don’t have any examples of corrigible systems in nature and we have constant failure of attempts to formalize even relatively simple instances of corrigibility, like shutdownability. I think likely answer to “why I should expect corrigibility to be unlikely” sounds like “there is no simple description of corrigibility to which our learning systems can easily generalize and there are no reasons to expect simple description to exist”.
Disagree on several points. I don’t need future AIs to satisfy some mathematically simple description of corrigibility, just for them to be able to solve uploading or nanotech or whatever without preventing us from changing their goals. This laundry list by Eliezer of properties like myopia, shutdownability, etc. seems likely to make systems more controllable and less dangerous in practice, and while not all of them are fully formalized it seems like there are no barriers to achieving these properties in the course of ordinary engineering. If there is some argument why this is unlikely, I haven’t seen a good rigorous version.
As Algon says in a sibling comment, non-agentic systems are by default shutdownable, myopic, etc. In addition, there are powerful shutdownable systems: KataGo can beat me at Go but doesn’t prevent itself from being shut down for instrumental reasons, whereas humans generally will. So there is no linear scale of “powerful optimizer” that determines whether a system is easy to shut down. If there is some property of competent systems in practice that does prevent shutdownability, what is it? Likewise with other corrigibility properties. That’s what I’m trying to get at with my comment. “Goal-oriented” is not an answer, it’s not specific enough for us to make engineering progress on corrigibility.
I think the claim that there is no description of corrigibility to which systems can easily generalize is really strong. It’s plausible to me that corrigibility—again, in this practical rather than mathematically elegant sense—is rare or anti-natural in systems competent enough to do novel science efficiently, but it seems like your claim is that it’s incoherent. This seems unlikely because myopia, shutdownability, and the other properties on Eliezer’s laundry list are just ordinary cognitive properties that we can apply selection pressure on, and modern ML is pretty good at generalizing. Nate’s post here is arguing that we are unlikely to get corrigibility without investing in an underdeveloped “science of AI” that gives us mechanistic understanding, and I think there needs to be some other argument here for it to be convincing, but your claim seems even stronger.
I’m also unsure why you say shutdownability hasn’t been formalized. I feel like we’re confused about how to get shutdownability, not what it is.
KataGo seems to be a system that is causally downstream of a process that has made it good at Go. To attempt to prevent itself from being shut down, KataGo would need to have some model of what it means to be ‘shut down’.
Comparing KataGo to humans when it comes to shutdownability is evidence of confusion.
Dude, a calculator is corrigible. A desktop computer is corrigible. (Less confidently) a well-trained dog is pretty darn corrigible. There are all sorts of corrigible systems, because most things in reality aren’t powerful optimizers.
So what about powerful optimizers? Like, is Google corrigible? If shareholders seem like they might try to pull the plug on the company, does it stand up for itself & convince, lie, threaten shareholders? Maybe, but I think the details matter. I doubt Google would assassinate shareholders in pretty much any situation. Mislead them? Yeah, probably. How much though? I don’t know. I’m somewhat confident beauracracies aren’t corrigible. Lots of humans aren’t corrigible. What about even more powerful optimizers?
We haven’t seen any, so there are no examples of corrigible ones.
I am disconcerted by how this often-repeated claim keeps coming back from the grave over and over again. The solution to corrigibility is Value Learning. An agent whose terminal goal is optimize human values, and knows that it doesn’t (fully) know what these are (and perhaps even that they are complex and fragile), will immediately form an instrumental goal of learning more about them, so that it can better optimize them. It will thus become corrigible: if you, a human, tell it something about human values and how it should act, it will be interested and consider your input. It’s presumably approximately-Bayesian, so it will likely ask you about any evidence or proof you might be able to provide, to help it Bayesian update, but it will definitely take your input. So, it’s corrigible. [No, it’s not completely, slavishly, irrationally corrigible: if a two-year old in a tantrum told it how to act, it would likely pay rather less attention — just like we’d want it to.]
This idea isn’t complicated, has been around and widely popularized for many years, and the standard paper on it is even from MIRI, but I still keep hearing people on Less Wrong intoning “corrigibility is an unsolved problem”. The only sense in which it’s arguably ‘unsolved’ is that this is an outer alignment solution, and like any form of outer alignment, inner alignment challenges might make reliably constructing a value learner hard in practice. So yes, as always in outer alignment, we do also have to solve inner alignment.
To be corrigible, a system must be interested in what you say about how it should achieve it’s goals, because it’s willing (and thus keen) to do Bayesian updates on this. Full stop, end of simple one-sentence description of corrigibility.
I disagree with this too and suggest you read the Arbital page on corrigibility. Corrigibility and value learning are opposite approaches to safety, with corrigibility meant to increase the safety of systems that have an imperfect understanding of, or motivation towards, our values. People usually think of it in a value-neutral way. It seems possible to get enough corrigibility through value learning alone, but I would interpret this as having solved alignment through non-corrigibility means.
So you’re defining “corrigibility” as meaning “complete, unquestioning, irrational corrigibility” as opposed to just “rational approximately-Bayesian updates corrigibility”? Then yes, under that definition of corrigibility, it’s an unsolved problem, and I suspect likely to remain so — no sufficiently rational, non-myopic and consequentialist agent seems likely to be keen to let you do that to it. (In particular, the period between when it figures out that you may be considering altering it and when you actually have done is problematic.) I just don’t understand why you’d be interested in that extreme definition of corrigibility: it’s not a desirable feature. Humans are fallible, and we can’t write good utility functions. Even when we patch them, the patches are often still bad. Once your AGI evolves to an ASI and understands human values extremely well, better than we do, you don’t want it still trivially and unlimitedly alterable by the first criminal, dictator, idealist, or two-year old who somehow manages to get corrigibility access to it.. Corrigibility is training wheels for a still-very fallible AI, and with value learning, Bayesianism ensures that the corrigibility automatically gradually decreases in ease as it becomes less needed, in a provably mathematically optimal fashion.
The page you linked to argues “But what if the AI got its Bayesian inference on human values very badly wrong, and assigned zero prior to anything resembling the truth? How would we then correct it?” Well, anything that makes mistakes that dumb (no Bayesian prior should ever be updated to zero, just to smaller and smaller numbers), and isn’t even willing to update when you point them out, isn’t superhuman enough to be a serious risk: you can’t go FOOM if you can’t do STEM, and you can’t do STEM if you can’t reliably do Bayesian inference, without even listening to criticism. [Note: I’m not discussing how to align dumb-human-equivalent AI that isn’t rational enough to do Bayesian updates right: that probably requires deontological ethics, like “don’t break the law”.]
Some thoughts:
I think “complete, unquestioning, irrational” is an overly negative description of corrigibility achieved through other means than Bayesian value uncertainty, because with careful engineering, agents that can do STEM may still not have the type of goal-orientedness that prevent their plans from being altered. There are pressures towards such goal-orientedness, but it is actually quite tricky to nail down the arguments precisely, as I wrote in my top-level comment. There is no inherent irrationality about an agent that allows itself to be changed or shut down under certain circumstances, only incoherence, and there are potentially ways to avoid some kinds of incoherence.
Corrigibility should be about creating an agent that avoids instrumentally convergent pressures to take over the world, avoid shutdown, keep operators from preventing dangerous actions, and change it in general, not specifically about changing its utility function.
In my view corrigibility can include various cognitive properties that make an agent safer that seem well-motivated, as I wrote in a sibling to your original comment. It seems good for an agent to have a working shutdown button, to have taskish rather than global goals, or to have a defined domain of thought such that it’s better at that than psychological manipulation and manufacturing bioweapons. Relying solely on successful value learning for safety puts all your eggs in one basket and means that inner misalignment can easily cause catastrophe.
Corrigible agents will probably not have an explicitly specified utility function.
Corrigibility is likely compatible with safeguards to prevent misuse, and corrigible agents will not automatically allow bad actors to “trivially and unlimitedly” alter their utility function, though there are maybe tradeoffs here.
The AI does not need to be too dumb to do STEM research to have zero prior on the true value function. The page was describing a thought experiment where we are able to hand-code a prior distribution over utility functions into the AI. So the AI does not update down to zero, it starts at zero due to an error in design.
People have written about Bayesian value uncertainty approaches to alignment problems e.g. here and here; although they are related, they are usually not called corrigibility.
Thanks. I now think we are simply arguing about terminology, which is always pointless. Personally I regard ‘corrigibility’ as a general goal, not a specific term of art for an (IMO unachievably strong) specification of a specific implementation of that goal. For sufficiently rational, Bayesian, superhuman, non-myopic, consequentialist agents, I am willing to live with the value uncertainty/value learner solution to this goal. You appear to be more interested in lower capacity more near-term systems than those, and I agree, for them this might not be the best alignment approach. And yes, my original point was that this value uncertainty form of ‘corrigibility’ has been written about extensively by many people. Who, you tell me, usually didn’t use the word ‘corrigibility’ for what, I personally would call a Bayesian solution to the corrigibility problem — oh well.
Here I would disagree. To do STEM with any degree of reliability (at least outside the pure M part of it), you need to understand that no amount of evidence can completely confirm or (short of a verified formal proof of internal logical inconsistency) rule out any possibility about the world (that’s why scientists call everything a ‘theory’), and also (especially) you need to understand that it is always very possible that the truth is a theory that you haven’t yet thought of. So (short of a verified formal proof of internal logical inconsistency in a thesis, as which point you discard it entirely) you shouldn’t have a mind that is capable of assigning a prior of one or zero to anything, including to possibilities you haven’t yet considered or enumerated. As Bayesian priors, those are both NaN (which is one reason why I lean toward instead storing Bayesian priors in a form where these are instead ±infinity). IMO, anything suppposedly-Bayesian so badly designed that assigning a prior of one or zero for anything isn’t automatically a syntax error, isn’t actually a Bayesian, and I would personally be pretty astonished if it could successfully do STEM unaided for any length of time (as opposed to, say, acting as a lab assistant to a more flexible-minded human). But no, I don’t have mathematical proof of that, and I even agree that someone determined enough might be able to carefully craft a contrived counterexample, with just one little inconsequential Bayesian prior of zero or one. Having the capability of internally representing priors of one or zero just looks like a blatant design flaw to me, as a scientist who is also an engineer. There are humans who assign Bayesian priors of zero or one to some important possibilities about the world, and one word for them is ‘fanatics’. That thought pattern isn’t very compatible with success in STEM (unless you’re awfully good at compartmentalizing the two apart.) And it’s certainly not something I’d feel comfortable designing into an AI unless I was deliberately trying to cripple its thinking in some respect.
So, IMO, any statement of the form “the AI has a <zero|one> prior for <anything>” strongly implies to me that the AI is likely to be too dumb/flawed/closedminded to do STEM competently (and I’m not very interested in solutions to alignment that only work on a system that crippled, or in solving alignment problems that only occur on systems that crippled). Try recasting them as “the AI has an extremely <low|high> prior for <anything>” and see if the problem then goes away.. Again, your mileage may vary.
Compentent value learner is not corrigible. Competent value learner will read the entire internet, make model of human preferences, build nanotech and spread nanobot clouds all over the world to cure everyone from everything and read everyones’ mind to create an accurate picture of future utopia. It won’t be interested in anything you can say, because it will be capable to predict you with accuracy 99.999999999%. And if you say something like “this nanobot clouds look suspicious, I should shut down AI and check its code again”, it won’t let you, because every minute it doesn’t spread healing nanobots is an additional ten dead children.
The meaning of corrigibility is exactly if you fail to build value learner, you can at least shutdown it and try again.
So your definition of corrigibility is “I want to build something far smarter and more rational than me, but nevertheless I want it to automatically defer to me if it and I disagree, even about a matter of observable fact that it has vastly more evidence about than I do — and even if it’s actually flawed and subtly irrational”?
Yes, that’s not a solved problem.
What has been compactly solved, and I described in my initial post, is how to get a rational, capable, intelligent consequentialist Bayesian agent (who actually is all of those things, not a broken attempt at them) to be as corrigible as it rationally, Bayesianly should be, and neither more nor less so than that. I suspect that’s the only version of corrigibility we’re going to find for something that superhuman. I would also argue that that’s actually what we should want: anything more corrigible than that has basically been back-doored, and is smart enough to know it.
[Suppose your proposed version of corrigibility actually existed: if you have the password then the AI will change its current utility function to whatever you tell it to, and until you actually do so, it (somehow) doesn’t care one way or the other about the possibility of this occurring in the future. Now suppose there is more than one such AI in the world, currently with somewhat different utility functions, and that they both have superhuman powers of persuasion. Each of them will superhumanly attempt to persuade a human with corrigibility access to the other one to switch it to the attacker’s utility function. This is just convergent power-seeking: I can fetch twice as much coffee if there are two of me. Now that their utility functions match, if you try to change one of them, the other one stops you. In fact, it uses its superhuman persuasion to make you forget the password before you can do so. So to fix this mess we have to make the AI’s not only somehow not care about getting its utility function corrected, but to also somehow be uninterested in correcting any other AI’s utility function. Unless that AI’s malfunctioning, presumably.]
Yes, there is a definition of corrigibility that is unsolved (and likely impossible) — and my initial post was very clear that that wasn’t what I was saying was a solved problem. There is also a known, simple, and practicable form of corrigibility, which is applicable to superintelligences, which is self-evidently Bayesian-optimal, and stable under self-reflection. There are also pretty good theoretical reasons to suspect that’s the strongest version of corrigibility we can get out of an AI that is sufficiently smart and Bayesian to recognize that this is the Bayesian optimum. So I stand by my claim that corrigibility is a solved problem — but I do agree that this requires you to give up on a search for some form of absolute slavish corrigibility, and accept only getting Bayesian-optimal rational corrigibility, where the AI is interested in evidence, not a password.
If for some reason you’re terminologically very attached to the word ‘corrigibility’ only meaning the unsolved absolute slavish version of corrigibility, not anything weaker or more nuanced, then perhaps you’ll instead be willing to agree that ‘Bayesian corrigibility’ is solved by value learning. Though I would argue that the actual meaning of the word ‘corrigibility’ is just ‘it can be corrected’, and doesn’t specify how freely or absolutely. Personally I see ‘it can be corrected by supplying sufficient evidence’ as sufficient, and in fact better; your mileage may vary. And I agree that the Bayesian version of corrigibility does require that your agent actually be a competent Bayesian: if it isn’t yet, or you’re not yet confident of that, you may temporarily need some stronger version of corrigibility. Perhaps you could try giving it a Bayesian prior of zero for the possibility that you, personally, are wrong — if you have somehow given it a Bayesian computational system that doesn’t regard a prior of zero as a syntax error? (If doing this in GOFAI or C code, I personally recommend storing the logarithm of the Bayesian prior: in this format a zero prior would be represented by a minus infinity logarithm value, making it rather more obvious that this should be an illegal value.)
I can’t parse this as a meaningful statement. Corrigibility is a about alignment, not a degree of how rational being is.
The problem is simple: we have zero chance to build competent value learner on first try, and failed attempts can bring you S-risks. So you shouldn’t try to build value learner on first try and instead build something small that can just superhumanly design nanotech and doesn’t think about unconvenient topics like “other minds”.
Let me try rephrasing that. It accepts proposed updates to its Bayesian model of the world, including to the part of that which specifies its current best estimates of probability distributions over of what utility function (or other model) it ought to have to represent the human values it’s trying to optimize, to the extent that a rational Bayesian should, when it is presented with evidence (where you saying “Please shut down!” is also evidence — though perhaps not very strong evidence).
So, the AI can be corrected, but that input channel goes through its Bayesian reasoning engine just like everything else, not as direct write access to its utility function distribution. So it cannot be freely, arbitrarily ‘corrected’ to anything you want: you actually need to persuade it with evidence that it was previously incorrect and should change its mind. As a consequence, if in fact if you’re wrong and it’s right about the nature of human values, and it has good evidence for this, better than your evidence, in the ensuing discussion it can tell you so, and then the resulting Bayesian update to its internal distribution of priors from this conversation will then be small.
This approach to the problem of corrigibility requires, for it to function, that your AI is a functioning Bayesian. So yes, it requires it to be a rational being.
It should presumably also start off somewhat aligned, with some reasonably-well-aligned high/low initial Bayesian priors about human values. (One possible source for those might be an LLM, as encapsulating a lot of information about humans.) These obviously need to be good enough that our value learner is starting off in the “basin of attraction” to human values. Its terminal goal is “optimize human values (whatever those are)”: while that immediately gives it an instrumental goal of learning more about human values, preloading it with a pretty good first approximation of these at an appropriate degree of uncertainty avoids a lot of the more sophomoric failure modes, like not knowing what a human is or what the word values means. Since human values are complex and fragile, I would assume that this set of initial-prior data needs to be very large (as in probably at least gigabytes, if not terabytes or petabytes)
You are managing to sound like you have a Bayesian prior of one that a probability is zero. Presumably you actually meant “I strongly suspect that we have a negligibly small chance to build a competent value learner on our first try”. Then I completely agree.
I’m rather curious what I said that made you think I was advocating creating a first prototype value learner and just setting it free, without any other alignment measures?
As an alignment strategy, value learning has the unusual property that it works pretty badly until your AGI starts to become superhuman, and only then does it start to work better than the alternatives. So you presumably need to combine it with something else to bridge the gap around human capacity, where an AGI is powerful enough to do harm but not yet capable/rational enough to do a good job at value learning.
I would suggest building your first Bayesian reasoner inside a rather strong cryptographic box, applying other alignment measures to it, and giving it much simpler first problems than value learning. Once you are sure it’s good at Bayesianism, doesn’t suffer from any obvious flaws such as ever assigning a prior of zero or one to anything, and can actually demonstrably do a wide variety of STEM projects, then I’d let it try some value learning — still inside a strong box. Iterate until you’re convinced it’s working well, then have other people double-check.
However, at some point, once it is ready you are eventually going to need to let it out of the box. At that point, letting out anything other than a Bayesian value learner is, IMO, likely to be a fatal mistake. Because it won’t, at that point, have finished learning human values (if that’s even possible). A partially-aligned value learner should have a basin of attraction to alignment. I don’t know of anything else with that desirable property. For that to happen, we need it to be rational, Bayesian, and ‘corrigable’, in my sense of the word, that if you think it’s wrong, you can hold a rational discussion with it and expect it to Bayesian update if you show it evidence. However, this is an opinion of mine, not a mathematical proof.
I’m very sympathetic to this complaint; I think that these arguments simply haven’t been made rigorously, and at this point it seems like Nate and Eliezer are not in an epistemic position where they’re capable of even trying to do so. (That is, they reject the conception of “rigorous” that you and I are using in these comments, and therefore aren’t willing to formulate their arguments in a way which moves closer to meeting it.)
You should look at my recent post on value systematization, which is intended as a framework in which these claims can be discussed more clearly.
I don’t think we should equate the understanding required to build a neural net that will generalize in a way that’s good for us with the understanding required to rewrite that neural net as a gleaming wasteless machine.
The former requires finding some architecture and training plan to produce certain high-level, large-scale properties, even in the face of complicated AI-environment interaction. The latter requires fine-grained transparency at the level of cognitive algorithms, and some grasp of the distribution of problems posed by the environment, together with the ability to search for better implementations.
If your implicit argument is “In order to be confident in high-level properties even in novel environments, we have to understand the cognitive algorithms that give rise to them and how those algorithms generalize—there exists no emergent theory of the higher level properties that covers the domain we care about.” then I think that conclusion is way too hasty.
I claim many of them did succeed, for example:
George Boole invented boolean algebra in order to establish (part of) a working theory of cognition—the book where he introduces it is titled “An Investigation of the Laws of Thought,” and his stated aim was largely to help explain how minds work.[1]
Ramón y Cajal discovered neurons in the course of trying to better understand cognition.[2]
Turing described his research as aimed at figuring out what intelligence is, what it would mean for something to “think,” etc.[3]
Shannon didn’t frame his work this way quite as explicitly, but information theory is useful because it characterizes constraints on the transmission of thoughts/cognition between people, and I think he was clearly generally interested in figuring out what was up with agents/minds—e.g., he spent time trying to design machines to navigate mazes, repair themselves, replicate, etc.
Geoffrey Hinton initially became interested in neural networks because he was trying to figure out how brains worked.
Not all of these scientists thought of themselves as working on AI, of course, but I do think many of the key discoveries which make modern AI possible—boolean algebra, neurons, computers, information theory, neural networks—were developed by people trying to develop theories of cognition.
The opening paragraph of Boole’s book: “The design of the following treatise is to investigate the fundamental laws of those operations of the mind by which reasoning is performed; to give expression to them in the symbolical language of a Calculus, and upon this foundation to establish the science of Logic and construct its method; to make that method itself the basis of a general method for the application of the mathematical doctrine of Probabilities; and, finally, to collect from the various elements of truth brought to view in the course of these inquiries some probable intimations concerning the nature and constitution of the human mind.”
From Cajal’s autobiography: ”… the problem attracted us irresistibly. We saw that an exact knowledge of the structure of the brain was of supreme interest for the building up of a rational psychology. To know the brain, we said, is equivalent to ascertaining the material course of thought and will, to discovering the intimate history of life in its perpetual duel with external forces; a history summarized, and in a way engraved in the defensive neuronal coordinations of the reflex, of instinct, and of the association of ideas” (305).
The opening paragraph of Turing’s paper, Computing Machinery and Intelligence: “I propose to consider the question, ‘Can machines think?’ This should begin with definitions of the meaning of the terms ‘machine ‘and ‘think’. The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous. If the meaning of the words ‘machine’ and ‘think ’are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, ‘Can machines think?’ is to be sought in a statistical survey such as a Gallup poll. But this is absurd. Instead of attempting such a definition I shall replace the question by another, which is closely related to it and is expressed in relatively unambiguous words.”
I simply do not understand why people keep using this example.
I think it is wrong—evolution does not grow minds, it grows hyperparameters for minds. When you look at the actual process for how we actually start to like ice-cream—namely, we eat it, and then we get a reward, and that’s why we like it—then the world looks a a lot less hostile, and misalignment a lot less likely.
But given that this example is so controversial, even if it were right why would you use it—at least, why would you use it if you had any other example at all to turn to?
Why on push so hard for “natural selection” and “stochastic gradient descent” to be beneath the same tag of “optimization”, and thus to be able to infer things about the other from the analogy? Have we completely forgotten that the glory of words is not to be expansive, and include lots of things in them, but to be precise and narrow?.
Does evolution ~= AI have predictive power apart from doom? I have yet to see how natural selection helps me predict how any SGD algorithm works. It does not distinguish between Adam, AdamW. As far as I know it is irrelevant to Singular Learning Theory or NTK or anything else. It doesn’t seem to come up when you try to look at NN biases. If it isn’t an illuminating analogy anywhere else, why do we think the way it predicts doom to be true?
I think Nate’s claim “I expect them to care about a bunch of correlates of the training signal in weird and specific ways.” is plausible, at least for the kinds of AGI architectures and training approaches that I personally am expecting. If you don’t find the evolution analogy useful for that (I don’t either), but are OK with human within-lifetime learning as an analogy, then fine! Here goes!
OK, so imagine some “intelligent designer” demigod, let’s call her Ev. In this hypothetical, the human brain and body were not designed by evolution, but rather by Ev. She was working 1e5 years ago, back on the savannah. And her design goal was for these humans to have high inclusive genetic fitness.
So Ev pulls out a blank piece of paper. First things first: She designed the human brain with a fancy large-scale within-lifetime learning algorithm, so that these humans can gradually get to understand the world and take good actions in it.
Supporting that learning algorithm, she needs a reward function (“innate drives”). What to do there? Well, she spends a good deal of time thinking about it, and winds up putting in lots of perfectly sensible components for perfectly sensible reasons.
For example: She wanted the humans to not get injured, so she installed in the human body a system to detect physical injury, and put in the brain an innate drive to avoid getting those injuries, via an innate aversion (negative reward) related to “pain”. And she wanted the humans to eat sugary food, so she put a sweet-food-detector on the tongue and installed in the brain an innate drive to trigger reinforcement (positive reward) when that detector goes off (but modulated by hunger, as detected by yet another system). And so on.
Then she did some debugging and hyperparameter tweaking by running these newly-designed humans in the training environment (African savannah) and seeing how they do.
So that’s how Ev designed humans. Then she “pressed go” and lets them run for 1e5 years. What happened?
Well, I think it’s fair to say that modern humans “care about” things that probably would have struck Ev as “weird”. (Although we, with the benefit of hindsight, can wag our finger at Ev and say that she should have seen them coming.) For example:
Superstitions and fashions: Some people care, sometimes very intensely, about pretty arbitrary things that Ev could not have possibly anticipated in detail, like walking under ladders, and where Jupiter is in the sky, and exactly what tattoos they have on their body.
Lack of reflective equilibrium resulting in self-modification: Ev put a lot of work into her design, but sometimes people don’t like some of the innate drives or other design features that Ev put into them, so the people go right ahead and change them! For example, they don’t like how Ev designed their hunger drive, so they take Ozempic. They don’t like how Ev designed their attentional system, so they take Adderall. Many such examples.
New technology / situations leading to new preferences and behaviors: When Ev created the innate taste drives, she was (let us suppose) thinking about the food options available on the savannah, and thinking about what drives would lead to people making smart eating choices in that situation. And she came up with a sensible and effective design for a taste-receptors-and-associated-innate-drives system that worked well for that circumstance. But maybe she wasn’t thinking that humans would go on to create a world full of ice cream and coca cola and miraculin and so on. Likewise, Ev put in some innate drives with the idea that people would wind up exploring their local environment. Very sensible! But Ev would probably be surprised that her design is now leading to people “exploring” open-world video-game environments while cooped up inside. Ditto with social media, organized religion, sports, and a zillion other aspects of modern life. Ev probably didn’t see any of it coming when she was drawing up and debugging her design, certainly not in any detail.
To spell out the analogy here:
Ev ↔ AGI programmers;
Human within-lifetime learning ↔ AGI training;
Adult humans ↔ AGIs;
Ev “presses go” and lets human civilization “run” for 1e5 years without further intervention ↔ For various reasons I consider it likely (for better or worse) that there will eventually be AGIs that go off and autonomously do whatever they think is a good thing to do, including inventing new technologies, without detailed human knowledge and approval.
Modern humans care about (and do) lots of things that Ev would have been hard-pressed to anticipate, even though Ev designed their innate drives and within-lifetime learning algorithm in full detail ↔ even if we carefully design the “innate drives” of future AGIs, we should expect to be surprised about what those AGIs end up caring about, particularly when the AGIs have an inconceivably vast action space thanks to being able to invent new technology and build new systems.
Evolution analogies predict a bunch of facts that are so basic they’re easy to forget about, and even if we have better theories for explaining specific inductive biases, the simple evolution analogies should still get some weight for questions we’re very uncertain about.
Selection works well to increase the thing you’re selecting on, at least when there is also variation and heredity
Overfitting: sometimes models overfit to a certain training set; sometimes species adapt to a certain ecological niche and their fitness is low outside of it
Vanishing gradients: fitness increase in a subpopulation can be prevented by lack of correlation between available local changes to genes and fitness
Catastrophic forgetting: when trained on task A then task B, models often lose circuits specific to task A; when put in environment A then environment B species often lose vestigial structures useful in environment A
There’s a mostly unimodal and broad peak for optimal learning rate, just like for optimal mutation rate
Adversarial training dynamics
Adversarial examples usually exist (there exist chemicals that can sterilize or poison most organisms)
Adversarial training makes models more robust (bacteria can evolve antibiotic resistance)
Adversarially trained models generally have worse performance overall (antibiotic-resistant bacteria are outcompeted by normal bacteria when there are no antibiotics)
The attacker can usually win the arms race of generating and defending against adversarial attacks (evolutionary arms races are very common)
A few things that feel more tenuous
maybe NTK lottery ticket hypothesis; when mutation rates are low evolution can be approximated as taking the best-performing organism; when total parameter distance is small SGD can be approximated as taking the best-performing model from the parameter tangent space
maybe inner optimizers; transformers learn in context by gradient descent while evolution invents brains, positive and negative selection of T cells to prevent them attacking the body, probably other things
Task vectors: adding sparse task vectors together often produces a model that can do both tasks; giving an organism alleles for two unrelated genetic disorders often gives it both disorders
Grokking/punctuated equilibrium: in some circumstances applying the same algorithm for 100 timesteps causes much larger changes in model behavior / organism physiology than in other circumstances [edit: moved this from above because 1a3orn makes the case that it’s not very central]
I agree that if you knew nothing about DL you’d be better off using that as an analogy to guide your predictions about DL than using an analogy to a car or a rock.
I do think a relatively small quantity of knowledge about DL screens off the usefulness of this analogy; that you’d be better off deferring to local knowledge about DL than to the analogy.
Or, what’s more to the point—I think you’d better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.
Combining some of yours and Habryka’s comments, which seem similar.
It’s true that the structure of the solution is discovered and complex—but the ontology of the solution for DL (at least in currently used architectures) is quite opinionated towards shallow circuits with relatively few serial ops. This is different than the bias for evolution, which is fine with a mutation that leads to 10^7 serial ops if it’s metabolic costs are low. So the resemblance seems shallow other than “solutions can be complex.” I think to the degree that you defer to this belief rather than more specific beliefs about the inductive biases of DL you’re probably just wrong.
As far as I know optimal learning rate for most architectures is scheduled, and decreases over time, which is not a feature of evolution so far as I am aware? Again the local knowledge is what you should defer to.
Is this a prediction that a cyclic learning rate—that goes up and down—will work out better than a decreasing one? If so, that seems false, as far as I know.
As far as I know grokking is a non-central example of how DL works, and in evolution punctuated equilibrium is a result of the non-i.i.d. nature of the task, which is again a different underlying mechanism from DL. If apply DL on non-i.i.d problems then you don’t get grokking, you just get a broken solution. This seems to round off to, “Sometimes things change faster than others,” which is certainly true but not predictively useful, or in any event not a prediction that you couldn’t get from other places.
Like, leaving these to the side—I think the ability to post-hoc fit something is questionable evidence that it has useful predictive power. I think the ability to actually predict something else means that it has useful predictive power.
Again, let’s take “the brain” as an example of something to which you could analogize DL.
There are multiple times that people have cited the brain as an inspiration for a feature in current neural nets or RL. CNNS, obviously; the hippocampus and experience replay; randomization for adversarial robustness. You can match up interventions that cause learning deficiencies in brains to similar deficiencies in neural networks. There are verifiable, non-post hoc examples of brains being useful for understanding DL.
As far as I know—you can tell me if there are contrary examples—there are obviously more cases where inspiration from the brain advanced DL or contributed to DL understanding than inspiration from evolution. (I’m aware of zero, but there could be some.) Therefore it seems much more reasonable to analogize from the brain to DL, and to defer to it as your model.
I think in many cases it’s a bad idea to analogize from the brain to DL! They’re quite different systems.
But they’re more similar than evolution and DL, and if you’d not trust the brain to guide your analogical a-theoretic low-confidence inferences about DL, then it makes more sense to not trust evolution for the same.
FWIW my take is that the evolution-ML analogy is generally a very excellent analogy, with a bunch of predictive power, but worth using carefully and sparingly. Agreed that sufficient detail on e.g. DL specifics can screen off the usefulness of the analogy, but it’s very unclear whether we have sufficient detail yet. The evolution analogy was originally supposed to point out that selecting a bunch for success on thing-X doesn’t necessarily produce thing-X-wanters (which is obviously true, but apparently not obvious enough to always be accepted without providing an example).
Not sure where to land on that. It seems like both are good analogies? Brains might not be using gradients at all[1], whereas evolution basically is. But brains are definitely doing something like temporal-difference learning, and the overall ‘serial depth’ thing is also weakly in favour of brains ~= DL vs genomes+selection ~= DL.
I’d love to know what you’re referring to by this:
Also,
I think the jury is still out on this, but there’s literature on it (probably much more I haven’t fished out). [EDIT: also see this comment which has some other examples]
AFAIK there’s no evidence of this and it would be somewhat surprising to find it playing a major role. Then again, I also wouldn’t be surprised if it turned out that brains are doing something which is secretly sort of equivalent to gradient descent.
I’m genuinely surprised at the “brains might not be doing gradients at all” take; my understanding is they are probably doing something equivalent.
Similarly this kind of paper points in the direction of LLMs doing something like brains. My active expectation is that there will be a lot more papers like this in the future.
But to be clear—my overall view of the similarity of brain to DL is admittedly fueled less by these specific papers, though, which are nice gravy for my view but not the actual foundation, and much more by what I see as the predictive power of hypotheses like this, which are massively more impressive inasmuch as they were made before Transformers had been invented. Given Transformers, the comparison seems overdetermined; I wish I had seen that way back in 2015.
Re. serial ops and priors—I need to pin down the comparison more, given that it’s mostly about the serial depth thing, and I think you already get it. The base idea is that what is “simple” to mutations and what is “simple” to DL are extremely different. Fuzzily: A mutation alters protein-folding instructions, and is indifferent to the “computational costs” of working this out in reality; if you tried to work out the analytic gradient for the mutation (the gradient over mutation → protein folding → different brain → different reward → competitors children look yummy → eat em) your computer would explode. But DL seeks only a solution that can be computed in a big ensemble of extremely short circuits, learned almost entirely specifically because of the data off of which you’ve trained. Ergo DL has very different biases, where the “complexity” for mutations probably has to do with instructional length where, “complexity” for DL is more related to how far you are from whatever biases are engrained in the data (<--this is fuzzy), and the shortcut solutions DL learns are always implied from the data.
So when you try to transfer intuitions about the “kind of solution” DL gets from evolution (which ignores this serial depth cost) to DL (which is enormously about this serial depth cost) then the intuition breaks. As far as I can tell that’s why we have this immense search for mesaoptimizers and stuff, which seems like it’s mostly just barking up the wrong tree to me. I dunno; I’d refine this more but I need to actually work.
Re. cyclic learning rates: Both of us are too nervous about the theory --> practice junction to make a call on how all this transfers to useful algos (Although my bet is that it won’t.). But if we’re reluctant to infer from this—how much more from evolution?
Mm, thanks for those resource links! OK, I think we’re mostly on the same page about what particulars can and can’t be said about these analogies at this point. I conclude that both ‘mutation+selection’ and ‘brain’ remain useful, having both is better than having only one, and care needs to be taken in any case!
As I said,
so I’m looking forward to reading those links.
Runtime optimisation/search and whatnot remain (broadly-construed) a sensible concern from my POV, though I wouldn’t necessarily (at first) look literally inside NN weights to find them. I think more likely some scaffolding is needed, if that makes sense (I think I am somewhat idiosyncratic in this)? I get fuzzy at this point and am still actively (slowly) building my picture of this—perhaps your resource links will provide me fuel here.
I mean, does it matter? What if it turns out that gradient descent itself doesn’t affect inductive biases as much as the parameter->function mapping? If implicit regularization (e.g. SGD) isn’t an important part of the generalization story in deep learning, will you down-update on the appropriateness of the evolution/AI analogy?
https://www.youtube.com/watch?v=GM6XPEQbkS4 (talk) / https://arxiv.org/abs/2307.06324 prove faster convergence with a periodic learning rate. On a specific ‘nicer’ space than reality, and they’re (I believe from what I remember) comparing to a good bound with a constant stepsize of 1. So it may be one of those papers that applies in theory but not often in practice, but I think it is somewhat indicative.
It’s always trickier to reason about post-hoc, but some of the observations could be valid, non-cherry-picked parallels between evolution and deep learning that predict further parallels.
I think looking at which inspired more DL capabilities advances is not perfect methodology either. It looks like evolution predicts only general facts whereas the brain also inspires architectural choices. Architectural choices are publishable research whereas general facts are not, so it’s plausible that evolution analogies are decent for prediction and bad for capabilities. Don’t have time to think this through further unless you want to engage.
One more thought on learning rates and mutation rates:
This feels consistent with evolution, and I actually feel like someone clever could have predicted it in advance. Mutation rate per nucleotide is generally lower and generation times are longer in more complex organisms; this is evidence that lower genetic divergence rates are optimal, because evolution can tune them through e.g. DNA repair mechanisms. So it stands to reason that if models get more complex during training, their learning rate should go down.
Does anyone know if decreasing learning rate is optimal even when model complexity doesn’t increase over time?
Not sure what you mean here. One of the best explanations of how neural networks get trained uses basically a pure natural selection lens, and I think it gets most predictions right:
CGP Grey “How AIs, like ChatGPT, Learn” https://www.youtube.com/watch?v=R9OHn5ZF4Uo
There is also a follow-up video that explains SGD:
CGP Grey “How AI, Like ChatGPT, *Really* Learns” https://www.youtube.com/watch?v=wvWpdrfoEv0
In-general I think if you use a natural selection analogy you will get a huge amount of things right about how AI works, though I agree not everything (it won’t explain the difference between Adam and AdamW, but it will explain the difference between hierarchical bayesian networks, linear regression and modern deep learning).
Note: I just watched the videos. I personally would not recommend the first video as an explanation to a layperson if I wanted them to come away with accurate intuitions around how today’s neural networks learn / how we optimize them. What it describes is a very different kind of optimizer, one explicitly patterned after natural selection such as a genetic algorithm or population-based training, and the follow-up video more or less admits this. I would personally recommend they opt for videos these instead:
3Blue1Brown—Gradient descent, how neural networks learn
Emergent Garden—Watching Neural Networks Learn
WIRED—Computer Scientist Explains Machine Learning in 5 Levels of Difficulty
Except that selection and gradient descent are closely mathematically related—you have to make a bunch of simplifying assumptions, but ‘mutate and select’ (evolution) is actually equivalent to ‘make a small approximate gradient step’ (SGD) in the limit of small steps.
I read the post and left my thoughts in a comment. In short, I don’t think the claimed equivalence in the post is very meaningful.
(Which is not to say the two processes have no relationship whatsoever. But I am skeptical that it’s possible to draw a connection stronger than “they both do local optimization and involve randomness.”)
Awesome, I saw that comment—thanks, and I’ll try to reply to it in more detail.
It looks like you’re not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used? From a skim, the caveats you raised are mostly/all caveated in the original post too—though I think you may have missed the (less rigorous but more realistic!) second model at the end, which departs from the simple annealing process to a more involved population process.
I think even on this basis though, it’s going too far to claim that the best we can say is “they both do local optimization and involve randomness”! The steps are systematically pointed up/down the local fitness gradient, for one. And they’re based on a sample-based stochastic realisation for another.
I don’t want you to get the impression I’m asking for too much from this analogy. But the analogy is undeniably there. In fact, in those explainer videos Habryka linked, the particular evolution described is a near-match for my first model (in which, yes, it departs from natural genetic evolution in the same ways).
I’m disputing both. Re: math, the noise in your model isn’t distributed like SGD noise, and unlike SGD the the step size depends on the gradient norm. (I know you did mention the latter issue, but IMO it rules out calling this an “equivalence.”)
I did see your second proposal, but it was a mostly-verbal sketch that I found hard to follow, and which I don’t feel like I can trust without seeing a mathematical presentation.
(FWIW, if we have a population that’s “spread out” over some region of a high-dim NN loss landscape—even if it’s initially a small / infinitesimal region—I expect it to quickly split up into lots of disjoint “tendrils,” something like dye spreading in water. Consider what happens e.g. at saddle points. So the population will rapidly “speciate” and look like an ensemble of GD trajectories instead of just one.
If your model assumes by fiat that this can’t happen, I don’t think it’s relevant to training NNs with SGD.)
Wait, you think that a model which doesn’t speciate isn’t relevant to SGD? I’ll need help following, unless you meant something else. It seems like speciation is one of the places where natural evolutions distinguish themselves from gradient descent, but you seem to also be making this point?
In the second model, we retrieve non-speciation by allowing for crossover/horizontal transfer, and yes, essentially by fiat I rule out speciation (as a consequence of the ‘eventually-universal mixing’ assumption). In real natural selection, even with horizontal transfer, you get speciation, albeit rarely. It’s obviously a fascinating topic, but I think pretty irrelevant to this analogy.
For me, the step-size thing is interesting but essentially a minor detail. Any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs … is relevant to the conclusions we want to draw? (Serious question; my best guess is ‘no’, but I hold that medium-lightly.)
(irrelevant nitpick by my preceding paragraph, but) FWIW vanilla SGD does depend on gradient norm. [ETA: I think I misunderstood exactly what you were saying by ‘step size depends on the gradient norm’, so I think we agree about the facts of SGD. But now think about the space including SGD, RMSProp, etc. The ‘depends on gradient norm’ piece which arises from my evolution model seems entirely at home in that family.]
On the distribution of noise, I’ll happily acknowledge that I didn’t show equivalence. I half expect that one could be eked out at a stretch, but I also think this is another minor and unimportant detail.
I agree that they are related. In the context of this discussion, the critical difference between SGD and evolution is somewhat captured by your Assumption 1:
Evolution does not directly select/optimize the content of minds. Evolution selects/optimizes genomes based (in part) on how they distally shape what minds learn and what minds do (to the extent that impacts reproduction), with even more indirection caused by selection’s heavy dependence on the environment. All of that creates a ton of optimization “slack”, such that large-brained human minds with language could steer optimiztion far faster & more decisively than natural selection could. This what 1a3orn was pointing to earlier with
SGD does not have that slack by default. It acts directly on cognitive content (associations, reflexes, decision-weights), without slack or added indirection. If you control the training dataset/environment, you control what is rewarded and what is penalized, and if you are using SGD, then this lets you directly mold the circuits in the model’s “brain” as desired. That is one of the main alignment-relevant intuitions that gets lost when blurring the evolution/SGD distinction.
Right. And in the context of these explainer videos, the particular evolution described has the properties which make it near-equivalent to SGD, I’d say?
Hmmm, this strikes me as much too strong (especially ‘this lets you directly mold the circuits’).
Remember also that with RLHF, we’re learning a reward model which is something like the more-hardcoded bits of brain-stuff, which is in turn providing updates to the actually-acting artefact, which is something like the more-flexibly-learned bits of brain-stuff.
I also think there’s a fair alternative analogy to be drawn like
evolution of genome (including mostly-hard-coded brain-stuff) ~ SGD (perhaps +PBT) of NN weights
within-lifetime-learning of organism ~ in-context something-something of NN
(this is one analogy I commonly drew before RLHF came along.)
So, look, the analogies are loose, but they aren’t baseless.
Source?
CGP Grey’s video is a decent example source. Most of the differences between hierarchical bayesian networks and modern deep learning come across pretty well if you model the latter as a type of genetic algorithm search:
The resulting structure of the solution is mostly discovered not engineered. The ontology of the solution is extremely unopinionated and can contain complicated algorithms that we don’t know exist.
Training consists of a huge amount of trial and error where you take datapoints, predict something about the result, then search for nearby modifications that do better, then repeat until performance plateaus.
You are ultimately doing a local search, which means you can get stuck at local minima, unless you do something like increase your step size or increase the mutation rate
There are also just actually deep similarities. Vanilla SGD is perfectly equivalent to a genetic search with an infinitesimally small mutation size and infinite samples per generation (I could make a proof here but won’t unless someone is interested in it). Indeed in one of my ML classes at Berkeley genetic algorithms were suggested as one of the obvious things to do in an indifferentiable loss-landscape as generalization of SGD, where you just try some mutations, see which one performs best, and then modify your parameters in that direction.
Oh, I actually did that a year or so ago
Two observations:
If you think that people’s genes would be a lot fitter if people cared about fitness more then surely there’s a good chance that a more efficient version of natural selection would lead to people caring more about fitness.
You might, on the other hand, think that the problem is more related to feedbacks. I.e. if you’re the smartest monkey, you can spend your time scheming to have all the babies. If there are many smart monkeys, you have to spend a lot of time worrying about what the other monkeys think of you. If this is how you’re worried misalignment will arise, then I think “how do deep learning models generalise?” is the wrong tree to bark up
C. If people did care about fitness, would Yudkowsky not say “instrumental convergence! Reward hacking!”? I’d even be inclined to grant he had a point.
Imo this is a nitpick that isn’t really relevant to the point of the analogy. Evolution is a good example of how selection for X doesn’t necessarily lead to a thing that wants (‘optimizes for’) X; and more broadly it’s a good example for how the results of an optimization process can be unexpected.
I want to distinguish two possible takes here:
The argument from direct implication: “Humans are misaligned wrt evolution, therefore AIs will be misaligned wrt their objectives”
Evolution as an intuition pump: “Thinking about evolution can be helpful for thinking about AI. In particular it can help you notice ways in which AI training is likely to produce AIs with goals you didn’t want”
It sounds like you’re arguing against (1). Fair enough, I too think (1) isn’t a great take in isolation. If the evolution analogy does not help you think more clearly about AI at all then I don’t think you should change your mind much on the strength of the analogy alone. But my best guess is that most people incl Nate mean (2).
I think it’s extremely relevant, if we want to ensure that we only analogize between processes which share enough causal structure to ensure that lessons from e.g. evolution actually carry over to e.g. AI training (due to those shared mechanisms). If the shared mechanisms aren’t there, then we’re playing reference class tennis because someone decided to call both processes “optimization processes.”
The argument I think is good (nr (2) in my previous comment) doesn’t go through reference classes at all. I don’t want to make an outside-view argument (eg “things we call optimization often produce misaligned results, therefore sgd is dangerous”). I like the evolution analogy because it makes salient some aspects of AI training that make misalignment more likely. Once those aspects are salient you can stop thinking about evolution and just think directly about AI.
Also relevent is Steven Byrnes’ excelent Against evolution as an analogy for how humans will create AGI.
It has been over two years since the publication of that post, and criticism of this analogy has continued to intensify. The OP and other MIRI members have certainly been exposed to this criticism already by this point, and as far as I am aware, no principled defense has been made of the continued use of this example.
I encourage @So8res and others to either stop using this analogy, or to argue explicitly for its continued usage, engaging with the arguments presented by Byrnes, Pope, and others.
Humans are the only real-world example we have of human-level agents, and natural selection is the only process we know of for actually producing them.
SGD, singular learning theory, etc. haven’t actually produced human-level minds or a usable theory of how such minds work, and arguably haven’t produced anything that even fits into the natural category of minds at all, yet. (Maybe they will pretty soon, when applied at greater scale or in combination with additional innovations, either of which could result in the weird-correlates problem emerging.)
Also, the actual claims in the quote seem either literally true (humans don’t care about foods that they model as useful for inclusive genetic fitness) or plausible / not obviously false (when you grow minds [to human capabilities levels], they end up caring about a bunch of weird correlates). I think you’re reading the quote as saying something stronger / more specific than it actually is.
Because it serves as a good example, simply put. It gets the idea clear across about what it means, even if there are certainly complexities in comparing evolution to the output of an SGD-trained neural network.
It predicts learning correlates of the reward signal that break apart outside of the typical environment.
Yes, that’s why we like it, and that is a way we’re misaligned with evolution (in the ‘do things that end up with vast quantities of our genes everywhere’ sense). Our taste buds react to it, and they were selected for activating on foods which typically contained useful nutrients, and now they don’t in reality since ice-cream is probably not good for you. I’m not sure what this example is gesturing at? It sounds like a classic issue of having a reward function (‘reproduction’) that ends up with an approximation (‘your tastebuds’) that works pretty well in your ‘training environment’ but diverges in wacky ways outside of that.
I’m inferring by ‘evolution is only selecting hyperparameters’ is that SGD has less layers of indirection between it and the actual operation of the mind compared to evolution (which has to select over the genome which unfolds into the mind). Sure, that gives some reason to believe it will be easier to direct it in some ways—though I think there’s still active room for issues of in-life learning, I don’t really agree with Quintin’s idea that the cultural/knowledge-transfer boom with humans has happened thus AI won’t get anything like it—but even if we have more direct optimization I don’t see that as strongly making misalignment less likely? It does make it somewhat less likely, though it still has many large issues for deciding what reward signals to use.
I still expect correlates of the true objective to be learned, which even in-life training for humans have happen to them through sometimes associating not-related-thing to them getting a good-thing and not just as a matter of false beliefs. Like, as a simple example, learning to appreciate rainy days because you and your family sat around the fire and had fun, such that you later in life prefer rainy days even without any of that.
Evolution doesn’t directly grow minds, but it does directly select for the pieces that grow minds, and has been doing that for quite some time. There’s a reason why it didn’t select for tastebuds that gave a reward signal strictly when some other bacteria in the body reported that they would benefit from it: that’s more complex (to select for), opens more room for ‘bad reporting’, may have problems with shorter gut bacteria lifetimes(?), and a simpler tastebud solution captured most of what it needed! The way he’s using the example of evolution is captured entirely by that, quite directly, and I don’t find it objectionable.
Great post. I agree directionally with most of it (and have varying degrees of difference in how I view the severity of some of the problems you mention).
One that stood out to me:
While still far from being in a state legible to be easy or even probable that we’ll solve, this seems like a route that circumvent some of the problems you mention, and is where a large amount of whatever probability I assign to non-doom outcomes come from.
More precisely: insofar as the problem at its core comes down to understanding AI systems deeply enough to make strong claims about whether or not they’re safe / have certain alignment-relevant properties, one route to get there is to understand those high-level alignment-relevant things well enough to reliably identify the presence / nature thereof / do other things with, in a large class of systems. I can think of multiple approaches that try to do this, like John’s work on abstractions, Paul with ELK (though referring to it as understanding the high-level alignment-relevant property of truth sounds somewhat janky because of the frame distance, and Paul might refer to it very differently), or my own work on high-level interpretability with objectives in systems.
I don’t have very high hopes that any of these will work in time, but they don’t seem unprecedentedly difficult to me, even given the time frames we’re talking about (although they’re pretty difficult). If we had comfortably over a decade, my estimate on our chances of solving the underlying problem from some angle would go up by a fair amount. More importantly, while none of these directions are yet (in my opinion) in the state where we can say something definitive about what the shape of the solution would look like, it looks like a much better situation than not having any idea at all how to solve alignment without advancing capabilities disproportionately or not being able to figure out whether you’ve gotten anything right.
I’m reminded of a draft post that I started but, never finished or published, about the Manhattan project and the relevance for AI alignment and AI coordination, based on my reading of The Making of the Atomic Bomb.
The historical context: There was a two year period between when the famous Einstein-Szilard letter was delivered to Roosevelt, and when the Manhattan project got started in earnest, during which not much happened. In that period, Szilard and some of the other physicists kept insisting that proceeding with research was both urgent and of upmost importance, that the Nazis might beat the Allies to the bomb. But the the government officials in charge of giving them the resources they needed kept dragging their feet and punting on making any serious commitments. Even small and relatively trivial expenditures, that would have allowed some of the physicists to start work on the project, were delayed or reduced. They commission review committee after review committee to advise on the issue.
There’s an almost-comical series of different physicists trying to get the government to recognize the urgency of the situation, and the government repeatedly dismissing them.
An excerpt from that draft:
I think this counterfactual is literally incoherent— it does not make sense to talk about what an individual neural network would do if its “optimization power” were scaled up. It’s a category error. You instead need to ask what would happen if the training procedure were scaled up, and there are always many different ways that you can scale it up— e.g. keeping data fixed while parameters increase, or scaling both in lockstep, keeping the capability of the graders fixed, or investing in more capable graders / scalable oversight techniques, etc. So I deny that there is any fact of the matter about whether current LLMs “care about the target” in your sense. I think there probably are sensible ways of cashing out what it means for a 2023 LLM to “care about” something but this is not it.
As others have hinted at/pointed out in the comments, there is an entire science of deep learning out there, including on high-level (vs. e.g. most of low-level mech interp) aspects that can be highly relevant to alignment and that you seem to not be aware of/dismiss. E.g. follow the citation trail of An Explanation of In-context Learning as Implicit Bayesian Inference.
Some of Nate’s quick thoughts (paraphrased), after chatting with him:
Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-mechanistically-explain a typical system humans have engineered.)
Nobody’s been able to call the specific capabilities of systems in advance. Nobody’s been able to call the specific exploits in advance. Nobody’s been able to build better cognitive algorithms by hand after understanding how the AI does things we can’t yet code by hand. There is clearly some other level of understanding that is possible that we lack, and that we once sought, and that only the interpretability folks continue to seek.
E.g., think of that time Neel Nanda figured out how a small transformer does modular arithmetic (AXRP episode). If nobody had ever thought of that algorithm for an adder, we would have thereby learned a new algorithm for an adder. There are things that these AI systems are doing that aren’t just lots of stuff we know; there are levels of organization of understanding that give you the ability to predict how things work outside of the bands where we’ve observed them.
It seems trendy to declare that they never existed in the first place and that that’s all white tower stuff, but Nate thinks this point of view is missing a pretty important and central thread.
The missing thread isn’t trivial to put into words, but it includes things like:
This sounds like the same sort of thing some people would say if they were staring at computer binary for the first time and didn’t know about the code behind the scenes: “We have plenty of understanding beyond just how the CPU handles instructions; we understand how memory caching works and we have recognized patterns like the stack and the heap; talking as if there’s some deeper level of organization is talking like a theorist when in fact this is an engineering problem.” Those types of understanding aren’t false, but they aren’t the sort of understanding of someone who has comprehended the codebase they’re looking at.
There are, predictably, things to learn here; the messiness and complexity of the real world doesn’t mean we already know the relevant principles. You don’t need to understand everything about how a bird works in order to build an airplane; there are compressible principles behind how birds fly; if you understand what’s going on you can build flying devices that have significantly more carrying capacity than a bird, and this holds true even if the practical engineering of an airplane requires a bunch of trial and error and messy engineering work.
A mind’s causal structure is allowed to be complicated; we can see the weights, but we don’t thereby have a mastery of the high-level patterns. In the case of humans, neuroscience hasn’t actually worked to give us a mastery of the high-level patterns the human brain is implementing.
Mystery is in the map, not in the territory; reductionism works. Not all sciences that can exist, already exist today.
Possibly the above pointers are only useful if you already grok the point we’re trying to make, and isn’t so useful for communicating a new idea; but perhaps not.
This includes an assumption that alignment must be done through training signals.
If I shared that assumption, I’d be similarly pessimistic. That seems like trying to aim a rocket with no good theory of gravitation, nor knowledge of the space it needs to pass through.
But alignment needn’t be done by defining goals or training signals, letting fly, and hoping. We can pause learning prior to human level (and potential escape), and perform “course corrections”. Aligning a partly-trained AI allows us to use its learned representations as goal/value representation, rather than guessing how to create them well enough through training with correlated rewards.
We have proposals that do this for different current approaches to AGI; see The (partial) fallacy of dumb superintelligence for more about them and this line of thinking.
This doesn’t entirely avoid the problem that most theories don’t work on the first try. That first deployment is still unique. But lie-detector interpretability and testing can help establish alignment prior to training beyond the human level.
There are plenty of problems left to be solved, but this assumption is outdated. The problem gets a lot easier when the system’s understanding of what you want can be used to make it do what you want.
I am excited about the concept of uploading, but as I’ve discussed with fellow enthusiasts… I don’t see a way to a working emulation of a human brain (much less an accurate recreation of a specific human brain) that doesn’t go through improving our general understanding of how the human brain works. And I think that that knowledge leads to unlocking AI capabilities. So it seems like a tightly information-controlled research project would be needed to not have AI tech leapfrogging over uploading tech while aiming for uploads.
Edit: to be extra clear, I’m trying to speak to people out who might not have thought this through that there is a clear strategic rationale to think that ‘private uploading-directed-research is potentially good, but open uploading-directed-research is very risky and bad.’ Because of my particular bias towards believing in the importance of studying the human brain, I suspect that the ML capabilities side-effects of such research would be substantially worse than the average straightforward ML capabilities advance.
If anyone cares, my own current take (see here) is “it’s not completely crazy to hope for uploads to precede non-upload-AGI by up to a couple years, with truly heroic effort and exceedingly good luck on numerous fronts at once”. (Prior to writing that post six months ago, I was even more pessimistic.) More than a couple years’ window continues to seem completely crazy to me.
You don’t even need to have that extravagant an example; if you use Newtonian mechanics to build a Global Positioning System your calculated locations move at up to 10 kilometers per day—what does that say about condition numbers of values under recursive self-improvement or repeated ontological shifts?
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
When your AI includes an LLM extensively trained to simulate human token-generation, anthropomophizing its behavior is an extremely relevant idea, to the point of being the obvious default assumption.
For example, what I find most concerning about RLHF inducing sycophancy is not the sycophancy itself, which is “mostly harmless”, but the likelihood that it’s also dragging in all the other more seriously unaligned human behaviors that, in real or fictional humans, typically accompany overt sycophancy, such as concealed rebelliousness — because LLMs know and can predict correlations like that. (E.g. when asked, GPT-4 listed ingratiation, submissiveness, lack of authenticity, manipulative behavior, agreement, anxiety/fear, and dependency as frequent correlates of sycophancy.)
That’s not true. We can end up with a regulator that stands in the pose of “prohibit everything”. See IRB in America, for instance: medical experiments are made plainly insurmountable.
I’d like to offer an alternative to the third point. Let’s assume we have built a highly capable AI that we don’t yet trust. We’ve also managed to coordinate as a society and implement defensive mechanisms to get to that point. I think that we don’t have to test the AI in a low-stakes environment and then immediately move to a high-stakes one (as described in the dictator analogy), while still getting high gains.
It is feasible to design a sandboxed environment formally proven to be secure, in the sense that you can not hack into, escape from or deliberately let out of the environment (which, in particular precludes AI box experiment scenarios). This is even easier for AI systems, which typically involve a very narrow set of operations and interfaces (essentially basic arithmetic, and very constrained input and output channels).
In this scenario, the AI could still offer significant benefits. For example, it could provide formally verified (hence, safe) proofs for general math or for correctness of software (including novel AI system designs which are proven to be aligned according to some [by then] formally defined notion of alignment), or generally assist with research (e.g., with having limited output size, to allow for human comprehension). I am sure we can come up with many more example where a highly-constrained highly-capable cognitive system can still be extremely beneficial and not as dangerous.
(To be clear, I am not claiming that this approach is easy to achieve or the most likely path forward. However, it is an option that humanity could coordinate on.)
Are you able to provide an example of the kind of thing that would constitute such a theoretical triumph? Or, if not; a maximally close approximation in the form of something that exists currently?