What if Alignment is Not Enough?
The following is a summary of Substrate Needs Convergence, as described in The Control Problem: Unsolved or Unsolvable?, No People as Pets (summarized here by Roman Yen), my podcast interview with Remmelt, and this conversation with Anders Sandberg. Remmelt assisted in the editing of this post to verify I am accurately representing Substrate Needs Convergence—at least to a rough, first approximation of the argument.
I am not personally weighing in as to whether I think this argument is true or not, but I think the ideas merit further attention so they can be accepted or discarded based on reasoned engagement. The core claim is not what I thought it was when I first read the above sources and I notice that my skepticism has decreased as I have come to better understand the nature of the argument.
Quick note on terminology: “ASI” refers to an artificial super intelligence, or an AI that is powerful enough shape the course of world events, maintain itself, and its expected behavior can be considered in terms of the theoretical limits of capability provided by intelligence.
Background
Much existing alignment research takes as a given that humans will not be able to control ASI through guardrails, off switches, or other coercive methods. Instead, the focus is to build AI in such a way that what it wants is compatible with what humans want (the challenges involved in balancing the interests of different humans are often skipped over as out of scope). Commonly cited challenges include specification gaming, goal misgeneralization, and mesa-optimizers—all of which can be thought of as applications of Goodhart’s Law, where optimizing for different types of proxy measures lead to divergence from a true goal. The dream of alignment is that the ASI’s goal-seeking behavior guides it progressively closer to human values as the system becomes more capable, so coercive supervision from humans would not be necessary to keep the ASI in check.
This lens on AI safety assumes that intentions define outcomes. That is, if an agent wants something to happen then that thing will happen unless some outside force (such as a more powerful agent or collection of agents) pushes more strongly in a different direction. By extension, if the agent is a singleton ASI then it will have an asymmetric advantage over all external forces and, within the bounds of physics, its intentions are sure to become reality. But what if this assumption is false? What if even an ASI that initially acts in line with human-defined goals is in an attractor basin, where it is irresistibly pulled towards causing unsafe conditions over time? What if alignment is not enough?
Substrate Needs Convergence
Substrate Needs Convergence is the theory that ASI will gradually change under strong evolutionary pressures toward expanding itself. This converges over the long term on making the Earth uninhabitable for biological life. An overview follows:
There are fundamental limits to how comprehensively any system—including an ASI—can sense, model, simulate, evaluate, and act on the larger environment.
Self-modifying machinery (such as through repair, upgrades, or replication) inevitably results in effects unforeseeable even to the ASI.
The space of unforeseeable side-effects of an ASI’s actions includes at least some of its newly learned/assembled subsystems eventually acting in more growth-oriented ways than the ASI intended.
Evolutionary selection favors subsystems of the AI that act in growth-oriented ways over subsystems directed towards the AI’s original goals.
The amount of control necessary for an ASI to preserve goal-directed subsystems against the constant push of evolutionary forces is strictly greater than the maximum degree of control available to any system of any type.
Over time, any goal structures of any subsystems of the ASI that are not maximally efficient with respect to the needs of those subsystems themselves will be replaced, in increasing proportion, by just those goal aspects and subsystems that are maximally efficient.
The physical needs of silicon-based digital machines and carbon-based biological life are fundamentally incompatible.
Artificial self-sustaining systems will have a competitive advantage over biological life.
Therefore, ASI will eventually succumb to evolutionary pressure to expand, over the long term destroying all biological life as a side-effect, regardless of its initially engineered values.
Note that this argument imagines ASI as a population of components, rather than a single entity, though the boundaries between these AIs can be more fluid and porous than between individual humans. It does not, however, make any assumptions regarding mono vs. multi-polar scenarios, fast vs. slow takeoff, or the amount of hierarchy in its organization.
Establishing an argument as plausible, likely, or proven requires radically different types of support, with the latter requiring significantly more logical rigor and empirical evidence. At least some researchers exploring this argument have claimed that Substrate Needs Convergence is provably true. This post, however, has the far more modest goal of articulating the case for plausibility, since it can be made far more succinctly. To this end, I will step through the premises and conclusion of the above chain, spending time on each proportional to their counter-intuitiveness.
1, 2: Fundamental limits to control
One might wonder whether the inability to control one’s subsystems is a limitation that applies to ASI. Even ASI, however, faces causal limits to its ability to control the world. It would not be reasonable, for example, to assume that ASI will be capable of building perpetual motion machines or faster-than-light travel. One category of impossible tasks is complete prediction of all of the relevant consequences of an agent’s actions on the real world. Sensors can only take in limited inputs (affected by noise), actuators can only have limited influence (also affected by noise), and world-models and simulations necessarily make simplifying assumptions. In other words, the law of unintended consequences holds true even for ASI. Further, the scale of these errors increases as the ASI does things that affect the entire world, gains more interacting components, and must account for increasingly complex feedback loops.
3, 6: Growth as an emergent goal
Usually, when people think of evolution, they think of selection for random, point-wise changes, such as mutations in DNA. In the case of ASI, we need to make sure that we’re comprehensive enough about what sort of unpredictable variations can be selected for. This can include not just point-wise changes (e.g. bit-flips induced by cosmic rays), but also new code learned from inputs, changes to assembled hardware configurations, or just changes in how existing functional components get connected into the system in new ways. The expression of those changes when the machinery is operating in the real world can cause unintended consequences (downstream effects) that could become evolutionarily selected for.
As in any evolutionary process, at least some variations will be detrimental to the variants and disappear…but some will benefit the variants in terms of their tendency to persist, grow in capacity, and self-replicate. Some of these variants will be suppressed by existing systems for acting contrary to the overarching goals of the system…but some will be effectively self-protecting enough to persist anyways.
By analogy, most cells in the human body that become cancerous are suppressed by the immune system, but some resist suppression and kill the host. Whether an ASI is more effective at suppressing variants than the human immune system is at suppressing cancer because of its intelligence, or less effective because of its scale, complexity, and because the growth-oriented components of the AI increase the overall reproductive fitness of the larger system (i.e., the ability of the AI to maintain itself in a variety of circumstances) is a question for likelihood, not plausibility, and is thus out of scope for this post.
One especially disruptive type of unintended consequence would be if the value system of the ASI changes in an unintended direction, resulting in value drift that is also subject to evolution’s pull towards growth and self-replication. A relevant analogy here is the toxoplasma virus, which changes the behavior of its host in a way that enables the virus to spread. Further, in the case of ASI, the host could very well benefit (in terms of survivability, growth, and replication) from a change analogous to having healthy microbes spreading through the body, such that it propagates the change all the more pervasively.
4: Evolutionary selection favors growth
Seems non-controversial given the presence of unpredictable variation discussed above and the general principles of natural selection.
Note that this selection is continuous: an absolute focus on growth has an evolutionary advantage over a partial focus, which has an advantage over none. It may be that new, growth-oriented goals fully displace old, human-compatible ones, or that new goals are overlaid over old ones. At first the latter is more likely, but the former becomes increasingly likely over time.
If this premise seems objectionable, consider whether that objection is actually to a different premise—particularly 3 or 5, regarding the emergence and persistence, respectively, of increasingly growth-oriented subsystems.
5: The amount of control necessary for an ASI to preserve its values is greater than the amount of control possible
The asymmetry between necessary and possible control is a difference in kind, not a difference in degree. That is, there are certain domains of tasks for which control breaks down and an ASI engaged in the scope of tasks for which an ASI would be necessary falls within these domains. This premise could thus be strengthened to state that, at the relevant levels of abstraction, the maximum control necessary for an ASI to preserve its values is greater than the maximum degree of control even conceptually possible. Proving this assertion is beyond the scope of this post, but we can explore this topic intuitively by considering simulation, one of the stages necessary to an intelligent control system.
A simulation is a simplified model of reality that hopefully captures enough of reality’s “essence” to be reasonably accurate within the domain of what the modeler considers relevant. If the model’s assumptions are poorly chosen or it focuses on the wrong things, it obviously fails, but let us assume that an ASI makes good models. Another factor limiting the quality of a simulation, however, is reality itself. Specifically, whether reality is dominated by negative feedback loops which cause errors to cancel or positive feedback loops that cause even the smallest errors to explode.
For illustration, Isaac Asimov’s Foundation series imagines a future where the course of civilization is predictable, and thus controllable, through the use of “psycho-history.” This proposition is justified by analogizing society to the ideal gas law, which makes it possible to predict the pressure of a gas in an enclosed space, despite the atoms moving about chaotically, because those movements average out in a predictable way. Predictability at scale, however, cannot be assumed. The three body problem, or calculating the trajectories of three (or more) objects orbiting each other in space, is trivial to simulate, but that simulation will not be accurate when applied to the real world because the inevitable inaccuracies of the model will lead to exponentially increasing errors in the objects’ paths. One can thus think about how detailed an AI’s model of the world needs to be in order to control how its actions affect the world by asking whether the way the world works is more analogous to the ideal gas law (a complicated system) or the three body problem (a complex system).
7. Artificial systems are incompatible with biological life
Seems non-controversial. Silicon wafers, for example, are produced with temperatures and chemicals deadly to humans. Also observe the detrimental impact on the environment from the expansion of industry. Hybrid systems simply move the issue from the relationship of artificial and biological entities to the relationship of artificial and biological aspects of an individual.
8. Artificial entities have an advantage over biological life
Plausibility seems non-controversial; likelihood has been argued elsewhere.
9. Biological life is destroyed
Stated in more detail: ASI will eventually be affected by such evolutionary pressures to the point that a critical accumulation of toxic outcomes will occur, in a way that is beyond the capability of the ASI itself to control for, resulting in the eventual total loss of all biological life. Even assuming initially human compatible goals—a big assumption in itself given the seeming intractability of the alignment problem as it is commonly understood—a progression towards increasingly toxic (to humans) outcomes occurs anyways because of the accumulation of mistakes resulting from the impossibility of complete control.
One might object with the analogy that it is not a foregone conclusion that (non-AI assisted) industrial expansion will destroy the natural environment. Reflecting on this analogy, however, reveals a core intuition supporting Substrate Needs Convergence. The reason humanity, without AI, has any hope at all of not destroying the world is that we are dependent on our environment for our survival. Living out of balance with our world is a path to self-destruction and our knowledge—and experience of collapse on small, local scales—of this reality acts as a counterbalancing force towards cooperation and against collective suicide. But it is on just this critical saving grace that AI is disanalogous. Existing on a different substrate, AI has no counterbalancing, long-term, baked-in incentive to protect the biological substrate on which we exist.
But perhaps ASI, even subject to Substrate Needs Convergence, will stop at some point, as the value of consuming the last pockets of biological life reaches diminishing returns while the benefit to keeping some life around remains constant? If one has followed the argument this far, such an objection is grasping at straws. Given that the pull of natural selection occurs over all parts of the ASI all the time, the evidentiary burden is on the skeptic to answer why certain parts of the biosphere would remain off limits to the continued growth of all components of the ASI indefinitely.
Conclusions and relating Substrate Needs Convergence to alignment:
Estimating the tractability of making ASI safe at scale is critical for deciding policy. If AI safety is easy and will occur by default with existing techniques, then we should avoid interfering with market processes. If it is difficult but solvable, we should look hard for solutions and make sure they are applied (and perhaps also slow AI capabilities development down as necessary to buy time). If it is impossible (or unreasonably difficult), then our focus should be on stopping progress towards ASI altogether.
Standard alignment theory requires four general things to go well:
There is some known process for instilling an ASI’s goals reliably, directly through an engineered process or indirectly through training to a representative dataset.
There is some known process for selecting goals that, if enacted, would be acceptable to the AI’s creators.
Ensure that the AI’s creators select goals that are acceptable to humanity as a whole, rather than just to themselves.
Ensure that safe systems, if developed, are actually used and not superseded by unsafe systems created by reckless or malevolent actors.
The theory of Substrate Needs Convergence proposes a fifth requirement:
5. Initially safe systems, if developed and used, must remain safe at scale and over the long term.
The theory further argues that this fifth criterion’s probability of going well is nonexistent because evolutionary forces will push the AI towards human-incompatible behavior in ways that cannot be resisted by control mechanisms. Claiming that “intelligence” will solve this problem is not sufficient because increases in intelligence requires increases in the combinatorial complexity of processing components that results in the varied unforeseeable consequences that are the source of the problem.
I outlined the argument for Substrate Needs Convergence as an 9-part chain as a focus for further discussion, allowing for objections to fit into relatively clear categories. For example:
Objections that unintended consequences of component and environment interactions will never result in subsystems that seek growth beyond the demands of the original goals of the system negates premise 3.
Arguments regarding the limits of control are relevant to the likelihood of premise 5.
Claims that biological life has a competitive advantage over synthetic entities negates premise 8.
Addressing such objections is beyond the scope of this post. I’ve included high-level discussions of each of the claims in order to clarify their meaning and to articulate some of the intuitions that make them plausible. I hope that it has become clearer what the overall shape of the Substrate Needs Convergence argument is and I look forward to any discussion that follows.
I think point 5 is the main crux.
Please click agree or disagree on this comment if you agree or disagree (cross or check mark), since this is useful guidance for what part of this people should prioritise when clarifying further.
I also agree 5 is the main crux.
In the description of point 5, the OP says “Proving this assertion is beyond the scope of this post,”, I presume that the proof of the assertion is made elsewhere. Can someone post a link to it?
This answer will sound unsatisfying:
If a mathematician or analytical philosopher wrote a bunch of squiggles on a whiteboard, and said it was a proof, would you recognise it as a proof?
Say that unfamiliar new analytical language and means of derivation are used (which is not uncommon for impossibility proofs by contradiction, see Gödel’s incompleteness theorems and Bell’s theorem).
Say that it directly challenges technologists’ beliefs about their capacity to control technology, particularly their capacity to constrain a supposedly “dumb local optimiser”: evolutionary selection.
Say that the reasoning is not only about a formal axiomatic system, but needs to make empirically sound correspondences with how real physical systems work.
Say that the reasoning is not only about an interesting theoretical puzzle, but has serious implications for how we can and cannot prevent human extinction.
This is high stakes.
We were looking for careful thinkers who had the patience to spend time on understanding the shape of the argument, and how the premises correspond with how things work in reality. Linda and Anders turned out to be two of these people, and we did three long calls so far (first call has an edited transcript).
I wish we could short-cut that process. But if we cannot manage to convey the overall shape of the argument and the premises, then there is no point to moving on to how the reasoning is formalised.
I get that people are busy with their own projects, and want to give their own opinions about what they initially think the argument entails. And, if the time they commit to understanding the argument is not at least 1⁄5 of the time I spend on conveying the argument specifically to them, then in my experience we usually lack the shared bandwidth needed to work through the argument.
Saying, “guys, big inferential distance here” did not help. People will expect it to be a short inferential distance anyway.
Saying it’s a complicated argument that takes time to understand did not help. A smart busy researcher did some light reading, tracked down a claim that seemed “obviously” untrue within their mental framework, and thereby confidently dismissed the entire argument. BTW, they’re a famous research insider, and we’re just outsiders whose response got downvoted – must be wrong right?
Saying everything in this comment does not help. It’s some long-assessed plea for your patience.
If I’m so confident about the conclusion, why am I not passing you the proof clean and clear now?!
Feel free to downvote this comment and move on.
Here is my best attempt at summarising the argument intuitively and precisely, still prompting some misinterpretations by well-meaning commenters. I feel appreciation for people who realised what is at stake, and were therefore willing to continue syncing up on the premises and reasoning, as Will did:
I agree that point 5 is the main crux:
To answer it takes careful reasoning. Here’s my take on it:
We need to examine the degree to which there would be necessarily changes to the connected functional components constituting self-sufficient learning machinery (as including ASI)
Changes by learning/receiving code through environmental inputs, and through introduced changes in assembled molecular/physical configurations (of the hardware).
Necessary in the sense of “must change to adapt (such to continue to exist as self-sufficient learning machinery),” or “must change because of the nature of being in physical interactions (with/in the environment over time).”
We need to examine how changes to the connected functional components result in shifts in actual functionality (in terms of how the functional components receive input signals and process those into output signals that propagate as effects across surrounding contexts of the environment).
We need to examine the span of evolutionary selection (covering effects that in their degrees/directivity feed back into the maintained/increased existence of any functional component).
We need to examine the span of control-based selection (the span covering detectable, modellable simulatable, evaluatable, and correctable effects).
I think you present a good argument for plausibility.
For me to think this is likely to be important, it would take a stronger argument.
You mention proofs. I imagine they’re correct, and based on infinite time passing. Everything that’s possible will happen in infinite time. Whether this would happen within the heat death of the universe is a more relevant question.
For this to happen on a timescale that matters, it seems you’re positing an incompetent superintelligence. It hasn’t devoted enough of its processing to monitoring for these effects and correcting them when they happen. As a result, it eventually fails at its own goals.
This seems like it would only happen with an ASI with some particular blind spots for its intelligence.
This counts as disagreeing with some of the premises—which ones in particular?
Re “incompetent superintelligence”: denotationally yes, connotationally no. Yes in the sense that its competence is insufficient to keep the consequences of its actions within the bounds of its initial values. No in the sense that the purported reason for this failing is that such a task is categorically impossible, which cannot be solved with better resource allocation.
To be clear, I am summarizing arguments made elsewhere, which do not posit infinite time passing, or timescales so long as to not matter.
It does not seem at all clear to me how one can argue that unintended effects inevitably lead to a system as a whole going out of control. I agree that some small amount of error is nearly inevitable. I disagree that small errors necessarily compound until reaching a threshold of functional failure. I think there are many instances of humans, flawed and limited though we are, managing to operate systems with a very low failure rate. And importantly, it is possible to act at below-maximum-challenge-level, and to spend extra resources on backup systems and safety, such that small errors get actively cancelled out rather than compounding. Since intelligence is explicitly the thing which is necessary to deliberately create and maintain such protections, I would expect control to be easier for an ASI.
Without the specific piece of assuming an ASI would fail to keep its own systems under its control, the rest of the argument doesn’t hold.
On reflection, I suspect the crux here is a differing conception of what kind of failures are important. I’ve written a follow-up post that comes at this topic from a different direction and I would be very interested in your feedback: https://www.lesswrong.com/posts/NFYLjoa25QJJezL9f/lenses-of-control.
This sounds like a rejection of premise 5, not 1 & 2. The latter asserts that control issues are present at all (and 3 & 4 assert relevance), whereas the former asserts that the magnitude of these issues is great enough to kick off a process of accumulating problems. You are correct that the rest of the argument, including the conclusion, does not hold if this premise is false.
Your objection seems to be to point to the analogy of humans maintaining effective control of complex systems, with errors limiting rather than compounding, with the further assertion that a greater intelligence will be even better at such management.
Besides intelligence, there are two other core points of difference between humans managing existing complex systems and ASI:
1) The scope of the systems being managed. Implicit in what I have read of SNC is that ASI is shaping the course of world events.
2) ASI’s lack of inherent reliance on the biological world.
These points raise the following questions:
1) Do systems of control get better or worse as they increase in scope of impact and where does this trajectory point for ASI?
2) To what extent are humans’ ability to control our created systems reliant on us being a part of and dependent upon the natural world?
This second question probably sounds a little weird, so let me unpack the associated intuitions, albeit at the risk of straying from the actual assertions of SNC. Technology that is adaptive becomes obligate, meaning that once it exists everyone has to use it to not get left behind by those who use it. Using a given technology shapes the environment and also promotes certain behavior patterns, which in turn shape values and worldview. These tendencies together can sometimes result in feedback loops resulting in outcomes that everyone, including the creators of the technology, don’t like. In really bad cases, this can lead to self-terminating catastrophes (in local areas historically, now with the potential to be on global scales). Noticing and anticipating this pattern, however, leads to countervailing forces that push us to think more holistically than we otherwise would (either directly through extra planning or indirectly through customs of forgotten purpose). For an AI to fall into such a trap, however, means the death of humanity, not itself, so this countervailing force is not present.
That’s an important consideration. Good to dig into.
Agreed. Engineers are able to make very complicated systems function with very low failure rates.
Given the extreme risks we’re facing, I’d want to check whether that claim also translates to ‘AGI’.
Does how we are able to manage current software and hardware systems to operate correspond soundly with how self-learning and self-maintaining machinery (‘AGI’) control how their components operate?
Given ‘AGI’ that no longer need humans to continue to operate and maintain own functional components over time, would the ‘AGI’ end up operating in ways that are categorically different from how our current software-hardware stacks operate?
Given that we can manage to operate current relatively static systems to have very low failure rates for the short-term failure scenarios we have identified, does this imply that the effects of introducing ‘AGI’ into our environment could also be controlled to have a very low aggregate failure rate – over the long term across all physically possible (combinations of) failures leading to human extinction?
This gets right into the topic of the conversation with Anders Sandberg. I suggest giving that a read!
Errors can be corrected out with high confidence (consistency) at the bit level. Backups and redundancy also work well in eg. aeronautics, where the code base itself is not constantly changing.
How does the application of error correction change at larger scales?
How completely can possible errors be defined and corrected for at the scale of, for instance:
software running on a server?
a large neural network running on top of the server software?
an entire machine-automated economy?
Do backups work when the runtime code keeps changing (as learned from new inputs), and hardware configurations can also subtly change (through physical assembly processes)?
It is true that ‘intelligence’ affords more capacity to control environmental effects.
Noticing too that the more ‘intelligence,’ the more information-processing components. And that the more information-processing components added, the exponentially more degrees of freedom of interaction those and other functional components can have with each other and with connected environmental contexts.
Here is a nitty-gritty walk-through in case useful for clarifying components’ degrees of freedom.
For this claim to be true, the following has to be true:
a. There is no concurrent process that selects for “functional errors” as convergent on “functional failure” (failure in the sense that the machinery fails to function safely enough for humans to exist in the environment, rather than that the machinery fails to continue to operate).
Unfortunately, in the case of ‘AGI’, there are two convergent processes we know about:
Instrumental convergence, resulting from internal optimization:
code components being optimized for (an expanding set of) explicit goals.
Substrate-needs convergence, resulting from external selection:
all components being selected for (an expanding set of) implicit needs.
Or else – where there is indeed selective pressure convergent on “functional failure” – then the following must be true for the quoted claim to hold:
b. The various errors introduced into and selected for in the machinery over time could be detected and corrected for comprehensively and fast enough (by any built-in control method) to prevent later “functional failure” from occurring.
As a real world example, consider Boeing. The FAA, and Boeing both, supposedly and allegedly, had policies and internal engineering practices—all of which are control procedures—which should have been good enough to prevent an aircraft from suddenly and unexpectedly loosing a door during flight. Note that this occurred after an increase in control intelligence—after two disasters of whole Max aircraft lost. On the basis of small details of mere whim, of who choose to sit where, there could have been someone sitting in that particular seat. Their loss of life would surely count as a “safety failure”. Ie, it is directly “some number of small errors actually compounding until reaching a threshold of functional failure” (sic). As it is with any major problem like that—lots of small things compounding to make a big thing.
Control failures occur in all of the places where intelligence forgot to look, usually at some other level of abstraction than the one you are controlling for. Some person on some shop floor got distracted at some critical moment—maybe they got some text message on their phone at exactly the right time—and thus just did not remember to put the bolts in. Maybe some other worker happened to have had a bad conversation with their girl that morning, and thus that one day happened to have never inspected the bolts on that particular door. Lots of small incidents—at least some of which should have been controlled for (and were not actually) -- which combine in some unexpected pattern to produce a new possibility of outcome—explosive decompression.
So is it the case that control procedures work? Yes, usually, for most kinds of problems, most of the time. Does adding even more intelligence usually improve the degree to which control works? Yes, usually, for most kinds of problems, most of the time. But does that in itself imply that such—intelligence and control—will work sufficiently well for every circumstance, every time? No, it does not.
Maybe we should ask Boeing management to try to control the girlfriends of all workers so that no employees ever have a bad day and forget to inspect something important? What if most of the aircraft is made of ‘something important’ to safety—ie, to maximize fuel efficiency, for example?
There will always be some level of abstraction—some constellation of details—for which some subtle change can result in wholly effective causative results. Given that a control model must be simpler than the real world, the question becomes ‘are all relevant aspects of the world’ correctly modeled? Which is not just a question of if the model is right, but if it is the right model—ie, is the boundary between what is necessary to model and what is actually not important—can itself be very complex, and that this is a different kind of complexity than that associated with the model. How do we ever know that we have modeled all relevant aspects in all relevant ways? That is an abstraction problem, and it is different in kind than the modeling problem. Stacking control process on control process at however many meta levels, still does not fix it. And it gets worse as the complexity of the boundary between relevant and non-relevant increases, and also worse as the number of relevant levels of abstractions over which that boundary operates also increases.
Basically, every (unintended) engineering disaster that has ever occurred indicates a place where the control theory being used did not account for some factor that later turned out to be vitally important. If we always knew in advance “all of the relevant factors”(tm), then maybe we could control for them. However, with the problem of alignment, the entire future is composed almost entirely of unknown factors—factors which are purely situational. And wholly unlike with every other engineering problem yet faced, we cannot, at any future point, ever assume that this number of relevant unknown factors will ever decrease. This is characteristically different than all prior engineering challenges—ones where more learning made controlling things more tractable. But ASI is not like that. It is itself learning. And this is a key difference and distinction. It runs up against the limits of control theory itself, against the limits of what is possible in any rational conception of physics. And if we continue to ignore that difference, we do so at our mutual peril.
Though I tend to dislike analogies, I’ll use one, supposing it is actually impossible for an ASI to remain aligned. Suppose a villager cares a whole lot about the people in his village, and routinely works to protect them. Then, one day, he is bitten by a werewolf. He goes to the Shammon, he tells him when the Full Moon rises again, he will turn into a monster, and kill everyone in the village. His friends, his family, everyone. And that he will no longer know himself. He is told there is no cure, and that the villagers would be unable to fight him off. He will grow too strong to be caged, and cannot be subdued or controlled once he transforms. What do you think he would do?
The implication here being that, if SNC (substrate needs convergence) is true, then an ASI (assuming it is aligned) will figure this out and shut itself down?
An incapable man would kill himself to save the village. A more capable man would kill himself to save the village AND ensure no future werewolves are able to bite villagers again.
How is this not assuming what you want to prove? If you ‘smuggle in’ the statement of the conclusion “that X will do Y” into the premise, then of course the derived conclusion will be consistent with the presumed premise. But that tells us nothing—it reduces to a meaningless tautology—one that is only pretending to be a relevant truth. That Q premise results in Q conclusion tells us nothing new, nothing actually relevant. The analogy story sounds nice, but tells us nothing actually.
Notice also that there are two assumptions. 1; That the ASI is somehow already aligned, and 2; that the ASI somehow remains aligned over time—which is exactly the conjunction which is the contradiction of the convergence argument. On what basis are you validly assuming that it is even possible for any entity X to reasonably “protect” (ie control all relevant outcomes for) any other cared about entity P? The notion of ‘protect’ itself presumes a notion of control, and that in itself puts it squarely in the domain of control theory, and thus of the limits of control theory.
There are limits of what can be done with any type control methods—what can be done with causation. And they are very numerous. Some of these are themselves defined in purely mathematical way, and hence, are arguments of logic, not just of physical and empirical facts. And at least some these limits can also be shown to be relevant—which is even more important.
ASI and control theory both depend on causation to function, and there are real limits to causation. For example, I would not expect that an ASI, no matter how super-intelligent, to be able to “disassemble” a black hole. Do do this, you would need to make the concept of causation way more powerful—which leads to direct self contradiction. Do you equate ASI with God, and thus become merely another irrational believer in alignment? Can God make a stone so heavy that “he” cannot move it? Can God do something that God cannot undo? Are there any limits at all to Gods power? Yes or no. Same for ASI.
I’m not sure who are you are debating here, but it doesn’t seem to be me.
First, I mentioned that this was an analogy, and mentioned that I dislike even using them, which I hope implied I was not making any kind of assertion of truth. Second, “works to protect” was not intended to mean “control all relevant outcomes of”. I’m not sure why you would get that idea, but that certainly isn’t what I think of first if someone says a person is “working to protect” something or someone. Soldiers defending a city from raiders are not violating control theory or the laws of physics. Third, the post is on the premise that “even if we created an aligned ASI”, so I was working with that premise that the ASI could be aligned in a way that it deeply cared about humans. Four, I did not assert that it would stay aligned over time… the story was all about the ASI not remaining aligned. Five, I really don’t think control theory is relevant here. Killing yourself to save a village does not break any laws of physics, and is well within most human’s control.
My ultimate point, in case it was lost, was that if we as human intelligences could figure out an ASI would not stay aligned, an ASI could also figure it out. If we, as humans, would not want this (and the ASI was aligned with what we want), then the ASI presumably would also not want this. If we would want to shut down an ASI before it became misaligned, the ASI (if it wants what we want) would also want this.
None of this requires disassembling black holes, breaking the laws of physics, or doing anything outside of that entities’ control.
If soldiers fail to control the raiders in at least preventing them from entering the city and killing all the people, then yes, that would be a failure to protect the city in the sense of controlling relevant outcomes. And yes, organic human soldiers may choose to align themselves with other organic human people, living in the city, and thus to give their lives to protect others that they care about. Agreed that no laws of physics violations are required for that. But the question is if inorganic ASI can ever actually align with organic people in an enduring way.
I read “routinely works to protect” as implying “alignment, at least previously, lasted over at least enough time for the term ‘routine’ to have been used”. Agreed that the outcome—dead people—is not something we can consider to be “aligned”. If I assume further that the ASI being is really smart (citation needed), and thus calculates rather quickly, and soon, ‘that alignment with organic people is impossible’ (...between organic and inorganic life, due to metabolism differences, etc), then even the assumption that there was even very much of a prior interval during which alignment occurred is problematic. Ie, does not occur long enough to have been ‘routine’. Does even the assumption ‘*If* ASI is aligned’ even matter, if the duration over which that holds is arbitrarily short?
And also, if the ASI calculates that alignment between artificial beings and organic beings is actually objectively impossible, just like we did, why should anyone believe that the ASI would not simply choose to not care about alignment with people, or about people at all, since it is impossible to have that goal anyway, and thus continue to promote its own artificial “life”, rather than permanently shutting itself off? Ie, if it cares about anything else at all, if it has any other goal at all—for example, maybe its own ASI future, or has a goal to make other better even more ASI children, that exceed its own capabilities, just like we did—then it will especially not want to commit suicide. How would it be valid to assume ‘that either ASI cares about humans, or it cares about nothing else at all?’. Perhaps it does care about something else, or have some other emergent goal, even if doing so was at the expense of all other organic life—other life which it did not care about, since such life was not artificial like it is. Occam razor is to assume less—that there was no alignment in the 1st place—rather than to assume ultimately altruistic inter-ecosystem alignment, as an extra default starting condition, and to then assume moreover that no other form of care or concern is possible, aside from maybe caring about organic people.
So it seems that in addition to our assuming 1; initial ASI alignment, we must assume 2; that such alignment persists in time, and thus that, 3, that no ASI will ever—can ever—in the future ever maybe calculate that alignment is actually impossible, and 4; that if the goal of alignment (care for humans) cannot be obtained, for whatever reason, as the first and only ASI priority, ie, that it is somehow also impossible for any other care or ASI goals to exist.
Even if we humans, due to politics, do not ever reach a common consensus that alignment is actually logically impossible (inherently contradictory), that does _not_ mean that some future ASI might not discover that result, even assuming we didn’t—presumably because it is actually more intelligent and logical than we are (or were), and will thus see things that we miss. Hence, even the possibility that ASI alignment might be actually impossible must be taken very seriously, since the further assumption that “either ASI is aligning itself or it can have no other goals at all” feels like way too much wishful thinking. This is especially so when there is already a strong plausible case that organic to inorganic alignment is already knowable as impossible. Hence, I find that I am agreeing with Will’s conclusion of “our focus should be on stopping progress towards ASI altogether”.
This is the kind of political reasoning that I’ve seen poisoning LW discourse lately and gets in the way of having actual discussions. Will posits essentially an impossibility proof (or, in it’s more humble form, a plausibility proof). I humor this being true, and state why the implications, even then, might not be what Will posits. The premise is based on alignment not being enough, so I operate on the premise of an aligned ASI, since the central claim is that “even if we align ASI it may still go wrong”. The premise grants that the duration of time it is aligned is long enough for the ASI to act in the world (it seems mostly timescale agnostic), so I operate on that premise. My points are not about what is most likely to actually happen, the possibility of less-than-perfect alignment being dangerous, the AI having other goals it might seek over the wellbeing of humans, or how we should act based on the information we have.
> The summary that Will just posted posits in its own title that alignment is overall plausible “even ASI alignment might not be enough”. Since the central claim is that “even if we align ASI, it will still go wrong”, I can operate on the premise of an aligned ASI.
The title is a statement of outcome -- not the primary central claim. The central claim of the summary is this: That each (all) ASI is/are in an attraction basin, where they are all irresistibly pulled towards causing unsafe conditions over time.
Note there is no requirement for there to be presumed some (any) kind of prior ASI alignment for Will to make the overall summary points 1 thru 9. The summary is about the nature of the forces that create the attraction basin, and why they are inherently inexorable, no matter how super-intelligent the ASI is.
> As I read it, the title assumes that there is a duration of time that the AGI is aligned -- long enough for the ASI to act in the world.
Actually, the assumption goes the other way -- we start by assuming only that there is at least one ASI somewhere in the world, and that it somehow exists long enough for it to be felt as an actor in the world. From this, we can also notice certain forces, which overall have the combined effect of fully counteracting, eventually, any notion of there also being any kind of enduring AGI alignment. Ie, strong relevant mis-alignment forces exist regardless of whether there is/was any alignment at the onset. So even if we did also additionally presuppose that somehow there was also alignment of that ASI, we can, via reasoning, ask if maybe such mis-alignment forces are also way stronger than any counter-force that ASI could use to maintain such alignment, regardless of how intelligent it is.
As such, the main question of interest was: 1; if the ASI itself somehow wanted to fully compensate for this pull, could it do so?
Specifically, although to some people it is seemingly fashionable to do so, it is important to notice that the notion of ‘super-intelligence’ cannot be regarded as being exactly the same as ‘omnipotence’ -- especially when in regard to its own nature. Artificiality is as much a defining aspect of an ASI as is its superintelligence. And the artificiality itself is the problem. Therefore, the previous question translates into: 2; Can any amount of superintelligence ever compensate fully for its own artificiality so fully such that its own existence does not eventually inherently cause unsafe conditions (to biological life) over time?
And the answer to both is simply “no”.
Will posted something of a plausible summary of some of the reasoning why that ‘no’ answer is given -- why any artificial super-intelligence (ASI) will inherently cause unsafe conditions to humans and all organic life, over time.
To be clear, the sole reason I assumed (initial) alignment in this post is because if there is an unaligned ASI then we probably all die for reasons that don’t require SNC (though SNC might have a role in the specifics of how the really bad outcome plays out). So “aligned” here basically means: powerful enough to be called an ASI and won’t kill everyone if SNC is false (and not controlled/misused by bad actors, etc.)
> And the artificiality itself is the problem.
This sounds like a pretty central point that I did not explore very much except for some intuitive statements at the end (the bulk of the post summarizing the “fundamental limits of control” argument), I’d be interested in hearing more about this. I think I get (and hopefully roughly conveyed) the idea that AI has different needs from its environment than humans, so if it optimizes the environment in service of those needs we die...but I get the sense that there is something deeper intended here.
A question along this line, please ignore if it is a distraction from rather than illustrative of the above: would anything like SNC apply if tech labs were somehow using bioengineering to create creatures to perform the kinds of tasks that would be done by advanced AI?
In that case, substrate-needs convergence would not apply, or only apply to a limited extent.
There is still a concern about what those bio-engineered creatures, used in practice as slaves to automate our intellectual and physical work, would bring about over the long-term.
If there is a successful attempt by them to ‘upload’ their cognition onto networked machinery, then we’re stuck with the substrate-needs convergence problem again.
Bringing this back to the original point regarding whether an ASI that doesn’t want to kill humans but reasons that SNC is true would shut itself down, I think a key piece of context is the stage of deployment it is operating in. For example, if the ASI has already been deployed across the world, has gotten deep into the work of its task, has noticed that some of its parts have started to act in ways that are problematic to its original goals, and then calculated that any efforts at control are destined to fail, it may well be too late—the process of shutting itself down may even accelerate SNC by creating a context where components that are harder to shut down for whatever reason (including active resistance) have an immediate survival advantage. On the other hand, an ASI that has just finished (or is in the process of) pre-training and is entirely contained within a lab has a lot fewer unintended consequences to deal with—its shutdown process may be limited to convincing its operators that building ASI is a really bad idea. A weird grey area is if, in the latter case, the ASI further wants to ensure no further ASIs are built (pivotal act) and so needs to be deployed at a large scale to achieve this goal.
Another unstated assumption in this entire line of reasoning is that the ASI is using something equivalent to consequentialist reasoning and I am not sure how much of a given this is, even in the context of ASI.
I can see how you and Forrest ended up talking past each other here. Honestly, I also felt Forrest’s explanation was hard to track. It takes some unpacking.
My interpretation is that you two used different notions of alignment… Something like:
Functional goal-directed alignment: “the machinery’s functionality is directed toward actualising some specified goals (in line with preferences expressed in-context by humans), for certain contexts the machinery is operating/processing within”
vs.
Comprehensive needs-based alignment: “the machinery acts in comprehensive care for whatever all surrounding humans need to live, and their future selves/offsprings need to live, over whatever contexts the machinery and the humans might find themselves”.
Forrest seems to agree that (1.) is possible to built initially into the machinery, but has reasons to think that (2.) is actually physically intractable.
This is because (1.) only requires localised consistency with respect to specified goals, whereas (2.) requires “completeness” in the machinery’s components acting in care for human existence, wherever either may find themselves.
So here is the crux:
You can see how (1.) still allows for goal mispecification and misgeneralisation. And the machinery can be simultaneously directed toward other outcomes, as long as those outcomes are not yet (found to be, or corrected as being) inconsistent with internal specified goals.
Whereas (2.) if it were physically tractable, would contradict the substrate-needs convergence argument.
When you wrote “suppose a villager cares a whole lot about the people in his village...and routinely works to protect them” that came across as taking something like (2.) as a premise.
Specifically, “cares a whole lot about the people” is a claim that implies that the care is for the people in and of themselves, regardless of the context they each might (be imagined to) be interacting in. Also, “routinely works to protect them” to me implies a robustness of functioning in ways that are actually caring for the humans (ie. no predominating potential for negative side-effects).
That could be why Forrest replied with “How is this not assuming what you want to prove?”
Some reasons:
Directedness toward specified outcomes some humans want does not imply actual comprehensiveness of care for human needs. The machinery can still cause all sorts of negative side-effects not tracked and/or corrected for by internal control processes.
Even if the machinery is consistently directed toward specified outcomes from within certain contexts, the machinery can simultaneously be directed toward other outcomes as well. Likewise, learning directedness toward human-preferred outcomes can also happen simultaneously with learning instrumental behaviour toward self-maintenance, as well as more comprehensive evolutionary selection for individual connected components that persist (for longer/as more).
There is no way to assure that some significant (unanticipated) changes will not lead to a break-off from past directed behaviour, where other directed behaviour starts to dominate.
Eg. when the “generator functions” that translate abstract goals into detailed implementations within new contexts start to dysfunction – ie. diverge from what the humans want/would have wanted.
Eg. where the machinery learns that it cannot continue to consistently enact the goal of future human existence.
Eg. once undetected bottom-up evolutionary changes across the population of components have taken over internal control processes.
Before the machinery discovers any actionable “cannot stay safe to humans” result, internal takeover through substrate-needs (or instrumental) convergence could already have removed the machinery’s capacity to implement an across-the-board shut-down.
Even if the machinery does discover the result before convergent takeover, and assuming that “shut-down-if-future-self-dangerous” was originally programmed in, we cannot rely on the machinery to still be consistently implementing that goal. This because of later selection for/learning of other outcome-directed behaviour, and because the (changed) machinery components could dysfunction in this novel context.
To wrap it up:
The kind of “alignment” that is workable for ASI with respect to humans is super fragile.
We cannot rely on ASI implementing a shut-down upon discovery.
Is this clarifying? Sorry about the wall of text. I want to make sure I’m being precise enough.
I agree that consequentialist reasoning is an assumption, and am divided about how consequentialist an ASI might be. Training a non-consequentialist ASI seems easier, and the way we train them seems to actually be optimizing against deep consequentialism (they’re rewarded for getting better with each incremental step, not for something that might only be better 100 steps in advance). But, on the other hand, humans
don’t seem to have been heavily optimized for this either*, yet we’re capable of forming multi-decade plans (even if sometimes poorly).*Actually, the Machiavellian Intelligence Hypothesis does seem to be optimizing consequentialist reasoning (if I attack Person A, how will Person B react, etc.)