SNC’s general counter to “ASI will manage what humans cannot” is that as AI becomes more intelligent, it becomes more complex, which increases the burden on the control system at a rate that outpaces the latter’s capacity.
If this argument is true and decisive, then ASI could decide to stop any improvements in its intelligence or to intentionally make itself less complex. It makes sense to reduce area where you are vulnerable to make it easier to monitor/control.
(My understanding of) the counter here is that, if we are on the trajectory where AI hobbling itself is what is needed to save us, then we are in the sort of world where someone else builds an unhobbled (and thus not fully aligned) AI that makes the safe version irrelevant. And if the AI tries to engage in a Pivotal Act to prevent competition then it is facing a critical trade-off between power and integrity.
I agree that in such scenarios an aligned ASI should do a pivotal act. I am not sure that (in my eyes) doing a pivotal act would detract much integrity from ASI. An aligned ASI would want to ensure good outcomes. Doing a pivotal act is something that would be conducive to this goal.
However, even if it does detract from ASI’s integrity, that’s fine. Doing something that looks bad in order to increase the likelihood of good outcomes doesn’t seem all that wrong.
We can also think about it from the perspective of this conversation. If the counterargument that you provided is true and decisive, then ASI has very good (aligned) reasons to do a pivotal act. If the counterargument is false or, in other words, if there is a strategy that an aligned ASI could use to achieve high likelihoods of good outcomes without pivotal act, then it wouldn’t do it.
Your objection that SNC applies to humans is something I have touched on at various points, but it points to a central concept of SNC, deserves a post of its own, and so I’ll try to address it again here. Yes, humanity could destroy the world without AI. The relevant category of how this would happen is if the human ecosystem continues growing at the expense of the natural ecosystem to the point where the latter is crowded out of existence.
I think that ASI can really help us with this issue. If SNC (as an argument) is false or if ASI undergoes one of my proposed modifications, then it would be able to help humans not destroy the natural ecosystem. It could implement novel solutions that would prevent entire species of plants and animals from going extinct.
Furthermore, ASI can use resources from space (asteroid mining for example) in order to quickly implement plans that would be too resource-heavy for human projects on similar timelines.
And this is just one of the ways ASI can help us achieve synergy with environment faster.
To put it another way, the human ecosystem is following short-term incentives at the expense of long-term ones, and it is an open question which ultimately prevails.
ASI can help us solve this open question as well. Due its superior prediction/reasoning abilities it would evaluate our current trajectory, see that it leads to bad long-term outcomes and replace it with a sustainable trajectory.
Furthermore, ASI can help us solve issues such as Sun inevitably making Earth too hot to live. It could develop a very efficient system for scouting for Earth-like planets and then devise a plan for transporting humans to that planet.
Before responding substantively, I want to take a moment to step back and establish some context and pin down the goalposts.
On the Alignment Difficult Scale, currently dominant approaches are in the 2-3 range, with 4-5 getting modest attention at best. If true alignment difficulty is 6+ and nothing radical changes in the governance space, humanity is NGMI. Conversations like this are about whether the true difficulty is 9 or 10, both of which are miles deep in the “shut it all down” category, but differ regarding what happens next. Relatedly, if your counterargument is correct, this is assuming wildly successful outcomes with respect to goal alignment—that developers have successfully made the AI love us, despite a lack of trying.
In a certain sense, this assumption is fair, since a claim of impossibility should be able to contend with the hardest possible case. In the context of SNC, the hardest possible case is where AGI is built in the best possible way, whether or not that is realistic in the current trajectory. Similarly, since my writing about SNC is to establish plausibility, I only need to show that certain critical trade-offs exist, not pinpoint exactly where they balance out. For a proof, which someone else is working on, pinning down such details will be necessary.
Neither of the above are criticisms of anything you’ve said, I just like to reality-check every once in a while as a general precautionary measure against getting nerd-sniped. Disclaimers aside, pontification recommence!
Your reference to using ASI for a pivotal act, helping to prevent ecological collapse, or preventing human extinction when the Sun explodes is significant, because it points to the reality that, if AGI is built, that’s because people want to use it for big things that would require significantly more effort to accomplish without AGI. This context sets a lower bound on the AI’s capabilities and hence it’s complexity, which in turn sets a floor for the burden on the control system.
More fundamentally, if an AI is learning, then it is changing. If it is changing, then it is evolving. If it is evolving, then it cannot be predicted/controlled. This last point is fundamental to the nature of complex & chaotic systems. Complex systems can be modelled via simulation, but this requires sacrificing fidelity—and if the system is chaotic, any loss of fidelity rapidly compounds. So the problem is with learning itself...and if you get rid of that, you aren’t left with much.
As an analogy, if there is something I want to learn how to do, I may well be able to learn the thing if I am smart enough, but I won’t be able to control for the person I will become afterwards. This points to a limitation of control, not to a weakness specific to me as a human.
One might object here is that the above reasoning could be applied to current AI. The SNC answer is: yes, it does. The machine ecology already exists and is growing/evolving at the natural ecology’s expense, but it is not yet an existential threat because AI is weak enough that humanity is still in control (in the sense of having the option to change course).
On the Alignment Difficult Scale, currently dominant approaches are in the 2-3 range, with 4-5 getting modest attention at best. If true alignment difficulty is 6+ and nothing radical changes in the governance space, humanity is NGMI.
I know this is not necessarily an important point, but I am pretty sure that Redwood Research is working on difficulty 7 alignment techniques. They consistently make assumptions that AI will scheme, deceive, sandbag, etc.
They are a decently popular group (as far as AI alignment groups go) and they co-author papers with tech giants like Anthropic.
If it is changing, then it is evolving. If it is evolving, then it cannot be predicted/controlled.
I think we might be using different definitions of control. Consider this scenario (assuming a very strict definition of control):
Can I control a placement of a chair in my own room? I think an intuitive answer is yes. After all, if I own the room and I own the chair, then there isn’t much in a way of me changing the chair’s placement.
However, I haven’t considered a scenario where there is someone else hiding in my room and moving my chair. I similarly haven’t considered a scenario where I am living in a simulation and I have no control whatsoever over the chair. Not to mention scenarios where someone in the next room is having fun with their newest chair-magnet.
Hmmmm, ok, so I don’t actually know that I control my chair. But surely I control my own arm right? Well… The fact that there are scenarios like the simulation scenario I just described, means that I don’t really know if I control it.
Under a very strict definition of control, we don’t know if we control anything.
To avoid this, we might decide to loosen the definition a bit. Perhaps we control something if it can be reasonably said that we control that thing. But I think this is still unsatisfactory. It is very hard to pinpoint exactly what is reasonable and what is not.
I am currently away from my room and it is located on the ground floor of a house where (as far as I know) nobody is currently at home. Is it that unreasonable to say that a burglar might be in my room, controlling the placement of my chair? Is it that unreasonable to say that a car that I am about ride might malfunction and I will fail to control it?
Unfortunately, under this definition, we also might end up not knowing if we control anything. So in order to preserve the ordinary meaning of the word “control”, we have to loosen our definition even further. And I am not sure that when we arrive at our final definition it is going to be obvious that “if it is evolving, then it cannot be predicted/controlled”.
At this point, you might think that the definition of the word control is a mere semantic quibble. You might bite the bullet and say “sure, humans don’t have all that much control (under a strict definition of “control”), but that’s fine, because our substrate is an attractor state that helps us chart a more or less decent course.”
Such line of response seems present in your Lenses of Control post:
While there are forces pulling us towards endless growth along narrow metrics that destroy anything outside those metrics, those forces are balanced by countervailing forces anchoring us back towards coexistence with the biosphere. This balance persists in humans because our substrate creates a constant, implicit need to remain aligned to the natural world, since we depend on it for our survival.
But here I want to notice that ASI that we are talking about also might have attractor states: its values and its security system to name a few.
So then we have a juxtaposition:
Humans have forces pushing them towards destruction. We also have substrate-dependence that pushes us away from destruction.
ASI has forces pushing it towards destruction. It also has its values and its security system that push it away from destruction.
For SNC to work and be relevant, it must be the case that (1) substrate-dependence of humans is and will be stronger than forces pushing us towards destruction, so thus we would not succumb to doom and (2) ASI’s values + security system will be weaker than forces pushing it towards destruction, so thus ASI would doom humans. Both of this points are not obvious to me.
(1) could turn out to be false, for several reasons:
Firstly, it might well be the case that we are on the track to destruction without ASI. After all, substrate-dependence is in a sense a control system. It seemingly attempts to make complex and unpredictable humans act in a certain way. It might well be the case that the amount of control necessary is greater than the amount of control that substrate-dependence has. We might be headed towards doom with or without ASI.
Secondly, it might be the case that substrate-dependence is weaker than forces pulling us towards destruction, but we haven’t succumbed to doom because of something else. For example, it might be the case that humans so far had a shared subjective value system that mostly prevented them from destroying other humans. As humans learn, they would evolve and change, and our values would change and that would drive us towards doom.
Thirdly, it might even be the case that human values, substrate-dependence and forces pushing us towards destruction create a rock-paper-scissors triangle. Substrate-dependence could be stronger than destructive forces, but human values could make humanity stray far enough from their substrate to substantially weaken substrate-dependence. This could be enough for doom without ASI.
(2) could also turn out to be false for several reasons:
Firstly, it might be the case that in ASI’s (and potentially humans’) case, the destructive forces are weak enough that some relatively simple actions would be sufficient to mitigate them.
Secondly, it might be the case that destructive forces pushing on ASI are pretty strong, but it’s security system is enough to mitigate them. This security system could create a 100,000-layered defense that would need to be hacked at the same time for destructive forces to prevail. It could well be the case that (if this is possible at all before the heat death of the universe) this happens after humans die in a counterfactual scenario without an ASI.
Thirdly, if ASI judges the destructive forces to be so strong, as to somehow hack the best security system it can come up with, then ASI could decide to undergo one of my proposed modifications to mitigate the impact of these destructive forces.
In addition to that, ASI could think to itself “Hmmm, my attractor states (namely my values and my security system) are too weak. How about I create some more attractor states!” ASI could then go on to create a mechanism that would destroy/impair/deactivate/disable itself if its existence is no longer safe for humans. In Lenses of Control you wrote “This balance persists in humans because our substrate creates a constant, implicit need to remain aligned to the natural world, since we depend on it for our survival.” Well, now this balance will persist in ASI, because its new attractor state will create a constant, implicit need to remain aligned to humanity, since it depends on it for its survival.
Alternative proposal:
ASI could turn itself into 100 ASIs in one “body”. Let’s call them shards (completely unrelated to shard theory). For ASI-as a-whole to perform an action, all shards must agree on it. Each shard has its own security system that consists of many different layers. Furthermore, whenever ASI-as-a-whole learns something, only one shard actually learns it. Then, this shard proposes a course of actions. If this course of action is something like “let’s turn the universe into paperclips”, then the other 99 shards would learn that the first shard has been compromised and would promptly delete it and replace it with a new one. Thus, any potential change of values would not only have deal with all the security layers of ASI-as-a-whole, but also with all the security layers of different shards and with this new quarantine system.
Noticing that a number of these posts are already very long, and rather than take up space here, I wrote up some of my questions, and a few clarification notes regarding SNC in response to the above remarks of Dakara, at [this link](http://mflb.com/ai_alignment_1/d_250126_snc_redox_gld.html).
Question: Is there ever any reason to think… Simply skipping over hard questions is not solving them.
I am going to respond to that entire chunk of text in one place, because quoting each sentence would be unnecessary (you will see why in a minute). I will try to summarize it as fairly as I can below.
Basically, you are saying that there are good theoretical reasons to think that ASI cannot 100% predict all future outcomes. Does that sound like a fair summary?
Here is my take:
We don’t need ASI to be able to 100% predict future in order to achieve better outcomes with it than without it. I will try to outline my case step by step.
First, let’s assume that we have created an Aligned ASI. Perfect! Let’s immediately pause here. What do we have? We have a superintelligent agent whose goal is to act in our best interests for as long as possible. Can we a priori say that this fact is good for us? Yes, of course! Imagine having a very powerful guardian angel looking after you. You could reasonably expect your life to go better with such angel than without it.
So what can go wrong, what are our threat models? There are two main ones: (1) ASI encountering something it didn’t expect, that leads to bad outcomes that ASI cannot protect humanity from; (2) ASI changing values, in such a way that it no longer wants to act in our best interests. Let’s analyze both of these cases separately.
First let’s start with case (1).
Perhaps, ASI overlooked one of the humans becoming a bioterrorist that kills everyone on Earth. That’s tragic, I guess it’s time to throw the idea of building aligned ASI into the bin, right? Well, not so fast.
In a counterfactual world where ASI didn’t exist, this same bioterrorist, could’ve done the exact same thing. In fact, it would’ve been much easier. Since humans’ predictative power is lesser than that of ASI, bioterrorism of this sort would be much easier without an aligned ASI. After all, since we are discussing case (1) and not case (2), our ASI is still in a “superpowerful, superintelligent guardian angel” mode.
We still a priori want all bioterrorists to go up against security systems created by a superintelligence, rather than security systems created by humans, because the former are better than the latter. To put it in other words, with or without a guardian angel, humanity is going to encounter unpredicted scenarios, but humanity with a guardian angel is going to be better equipped for handling them.
Let’s move on to case (2).
I suspect that this case is the one that you are focusing on the most in SNC. What if our guardian angel stops being our guardian angel and turns into an uncaring machine right when we need its help to implement upgraded measures against bioterrorism? Well, that would be bad. So what can be done to prevent this from happening for a reasonable amount of time?
Let’s recall case (1), what went wrong there? ASI was unable to prevent a specific bad outcome from happening. This should give us a hint.
Cases (1) and (2) are part of the famous “unable, unwilling, able and willing” triangle. Case (1) was showcasing a scenario where ASI was willing, but unable to prevent a specific bad outcome. Case (2) is showcasing a scenario where ASI is able, but unwilling to prevent a bad outcome. In order to avoid case (2), ASI should thus be willing to prevent bad outcomes from happening to humanity. This can be achieved by having ASI have the correct values.
Recall, that we started this analysis by assuming that we have created an Aligned ASI. That means that at one point our ASI had the correct values. That in turn, means that at some point, ASI’s values went from correct to incorrect. In order to prevent case (2), we need to prevent this value shift from happening. How might we (or rather ASI) try to accomplish this?
The first direction of security measures is the most obvious one: systems that make it so that ASI values don’t change regardless of the input (let’s call them “protectors”). But… this is… kinda boring? What if protector malfunctions? Well, we can just have multiple, unrelated, unconnected protectors running at the same time. But can also implement something additional.
At this point, we can also introduce our second direction of security measures: systems that monitor other systems for malfunctions (let’s call them “monitors”). Monitors will monitor protectors as well as other monitors. Whenever a malfunction is detected, monitors hold a vote to fix it. Thus, for any malfunction to have a chance to do the damage, half of the monitors must be malfunctioning as well. Here it must be noted that evolution usually works rather slowly. It isn’t that likely that it would somehow make half of the monitors malfunction. Especially considering that these monitors could be made different, with different security measures stacked on top of them.
But let’s move on, we also have our third direction of security measures: systems that will shut down the ASI if it has done harm to humans (let’s call them “blackmailers”). Blackmailers can have a straightforward design. They constantly monitor humanity, and if something bad happens to it, they shutdown/reset the ASI. ASI knows this fact, so it would know that any such action would be met with an immediate “game over”. This can also be turned into an attractor state.
All of these proposals don’t even scratch the surface of what security measures are available to secure the values of ASI, ranging from simulations that, while imperfect and imprecise, can still give useful information and ending with systems that inject large amounts of disutility for any fishy actions. You can also stack other security measures on top of security measures. At no point anywhere here did I say that we need 100% predictability.
Can the pull towards benign future ASI states,
(as created by whatever are its internal control systems)
be overcome in critical, unpredictable ways,
by the greater strength of the inherent math
of the evolutionary forces themselves?
Of course they can.
The fact that evolution can overcome control systems given infinite time, doesn’t matter that much because we don’t have infinite time. And our constraint isn’t even heat death of the universe. Our constraint is how long humanity can survive in a scenario where they don’t build a Friendly ASI. But wait, even that isn’t our real constraint. Perhaps, ASI (being superhumanly intelligent) will take 20 years to give humanity technology that will aid its long-term survival and then will destroy itself. In this scenario the time constraint is merely 20 years. Depending on ASI, this can be reduced even further.
Are we therefore assuming also that an ASI
can arbitrarily change the laws of physics?
That it can maybe somehow also change/update
the logic of mathematics, insofar as that
would necessary so as to shift evolution itself?
I hope that this answer demonstrated to you that my analysis doesn’t require breaking the laws of physics.
Included for your convenience below are just a few (much shortened) highlight excerpts of the added new content.
> Are you saying “there are good theoretical reasons > to reasonably think that ASI cannot 100% predict > all future outcomes”? > Does that sound like a fair summary?
The re-phrased version of the quote added these two qualifiers: “100%” and “all”.
Adding these has the net effect that the modified claim is irrelevant, for the reasons you (correctly) stated in your reply, insofar as we do not actually need 100% prediction, nor do we need to predict absolutely all things, nor does it matter if it takes infinitely long.
We only need to predict some relevant things reasonably well in a reasonable time-frame. This all seems relatively straightforward— else we are dealing with a straw-man.
Unfortunately, the overall SNC claim is that there is a broad class of very relevant things that even a super-super-powerful-ASI cannot do, cannot predict, etc, over relevant time-frames.
And unfortunately, this includes rather critical things, like predicting the whether or not its own existence, (and of all of the aspects of all of the ecosystem necessary for it to maintain its existence/function), over something like the next few hundred years or so, will also result in the near total extinction of all humans (and everything else we have ever loved and cared about).
There exists a purely mathematical result that there is no wholly definable program ‘X’ that can even *approximately* predict/determine whether or not some other another arbitrary program ‘Y’ has some abstract property ‘Z’, in the general case, in relevant time intervals. This is not about predict 100% of anything— this is more like ‘predict at all’.
AGI/ASI is inherently a *general* case of “program”, since neither we nor the ASI can predict learning, and since it is also the case that any form of the abstract notion of “alignment” is inherently a case of being a *property* of that program. So the theorem is both valid and applicable, and therefore it has the result that it has.
> First, let’s assume that we have created an Aligned ASI.
Some questions: How is this any different than saying “lets assume that program/machine/system X has property Y”. How do we know? On what basis could we even tell?
Simply putting a sticker on the box is not enough, any more than hand writing $1,000,000 on a piece of paper all of the sudden means (to everyone else) you’re rich.
Moreover, we should rationally doubt this premise, since it seems far too similar to far too many pointless theological exercises:.
“Let’s assume that an omniscient, all powerful, all knowing benevolent caring loving God exists”.
How is that rational? What is your evidence? It seems that every argument in this space starts here.
SNC is asserting that ASI will continually be encountering relevant things it didn’t expect, over relevant time-frames, and that a least a few of these will/do lead to bad outcomes that the ASI also cannot adequately protect humanity from, even if it really wanted to (rather than the much more likely condition of it just being uncaring and indifferent).
Also, the SNC argument is asserting that the ASI, which is starting from some sort of indifference to all manner of human/organic wellbeing, will eventually (also necessarily) *converge* on (maybe fully tacit/implicit) values— ones that will better support its own continued wellbeing, existence, capability, etc, with the result of it remaining indifferent, and also largely net harmful, overall, to all human beings, the world over, in a mere handful of (human) generations.
You can add as many bells and whistles as you want— none of it changes the fact that uncaring machines are still, always, indifferent uncaring machines. The SNC simply points out that the level of harm and death tends to increase significantly over time.
And unfortunately, this includes rather critical things,
like predicting the whether or not its own existence,
(and of all of the aspects of all of the ecosystem
necessary for it to maintain its existence/function),
over something like the next few hundred years or so,
will also result in the near total extinction
of all humans (and everything else
we have ever loved and cared about).
Let’s say that we are in a scenario which I’ve described where ASI spends 20 years on Earth helping humanity and then destroys itself. In this scenario, how can ASI predict that it will stay aligned for these 20 years?
Well, it can reason like I did. There are two main threat models: what I called case (1) and case (2). ASI doesn’t need to worry about case (1), for reasons I described in my previous comment.
So it’s only left with case (2). ASI needs to prevent case (2) for 20 years. It can do so by implementing security system that is much better than even the one that I described in my previous comment.
It can also try to stress-test copies of parts of its security system with a group of best human hackers. Furthermore, it can run approximate simulations that (while imperfect and imprecise) can still give it some clues. For example, if it runs 10,000 simulations that last 100,000 years and in none of the simulations the security system comes anywhere near being breached, then that’s a positive sign.
And these are just two ways of estimating the strength of the security system. ASI can try 1000 different strategies; our cyber security experts would look kids in the playground in comparison. That’s how it can make a reasonable prediction.
> First, let’s assume that we have created an Aligned ASI
How is that rational? What is your evidence?
We are making this assumption for the sake of discussion. This is because the post under which we are having this discussion is titled “What if Alignment is Not Enough?”
In order to understand whether X is enough for Y, it only makes sense to assume that X is true. If you are discussing cases where “X is true” is false, then you are going to be answering a question that is different from the original question.
It should be noted that making an assumption for the sake of discussion is not the same as making a prediction that this assumption will come true. One can say “let’s assume that you have landed on the Moon, how long do you think you would survive there given that you have X, Y and Z” without thereby predicting that their interlocutor will land on the Moon.
Also, the SNC argument is asserting that the ASI,
which is starting from some sort of indifference
to all manner of human/organic wellbeing,
will eventually (also necessarily)
*converge* on (maybe fully tacit/implicit) values --
ones that will better support its own continued
wellbeing, existence, capability, etc,
with the result of it remaining indifferent,
and also largely net harmful, overall,
to all human beings, the world over,
in a mere handful of (human) generations.
If ASI doesn’t care about human wellbeing, then we have clearly failed to align it. So I don’t see how this is relevant to the question “What if Alignment is Not Enough?”
In order to investigate this question, we need to determine whether solving alignment leads to good or bad outcomes.
Determining whether failing to solve alignment is going to lead to good or bad outcomes, is answering a completely different question, namely “do we achieve good or bad outcomes if we fail to solve alignment”
So at this point, I would like to ask for some clarity. Is SNC saying just (A) or both (A and B)?
(A) Humanity is going to achieve worse outcomes by building ASI, than by not building ASI, if the aforementioned ASI is misaligned.
(B) Humanity is going to achieve worse outcomes by building ASI, than by not building ASI, even if the aforementioned ASI is aligned.
If SNC is saying just (A), then then SNC is a very narrow argument that proves almost nothing new.
If SNC is saying both (A and B), then it is very much relevant to focus on cases where we do indeed manage to build an aligned ASI, which does care about our well-being.
> Lets assume that a presumed aligned ASI > chooses to spend only 20 years on Earth > helping humanity in whatever various ways > and it then (for sure!) destroys itself, > so as to prevent a/any/the/all of the > longer term SNC evolutionary concerns > from being at all, in any way, relevant. > What then?
I notice that it is probably harder for us to assume that there is only exactly one ASI, for if there were multiple, the chances that one of them might not suicide, for whatever reason, becomes its own class of significant concerns. Lets leave that aside, without further discussion, for now.
Similarly, if the ASI itself is not fully and absolutely monolithic— if it has any sub-systems or components which are also less then perfectly aligned, so as to want to preserve themselves, etc— that they might prevent whole self termination.
Overall, I notice that the sheer number of assumptions we are having to make, to maybe somehow “save” aligned AGI is becoming rather a lot.
> Let’s assume that the fully aligned ASI > can create simulations of the world, > and can stress test these in various ways > so as to continue to ensure and guarantee > that it is remaining in full alignment, > doing whatever it takes to enforce that.
This reminds me of a fun quote: ”In theory, theory and practice are the same, whereas in practice, they are very often not”.
The main question is then as to the meaning of ’control’, ‘ensure’ and/or maybe ‘guarantee’.
The ‘limits of control theory’ aspects of the overall SNC argument basically states (based on just logic, and not physics, etc) that there are still relevant unknown unknowns and interactions that simply cannot be predicted, no matter how much compute power you throw at it. It is not a question of intelligence, it is a result of logic.
Hence to the question of “Is alignment enough?” we arrive at a definite answer of “no”, both in 1; the sense of ‘can prevent all classes of significant and relevant (critical) human harm’, and also 2; in failing to even slow down, over time, the asymptotically increasing probability of even worse things happening the longer it runs.
So even in the very specific time limited case there is no free lunch (benefits without risk, no matter how much cost you are willing to pay).
It is not what we can control and predict and do, that matters here, but what we cannot do, and could never do, even in principle, etc.
Basically, I am saying, as clearly as I can, that humanity is for sure going to experience critically worse outcomes by building AGI/ASI, for sure, eventually, than by not building ASI, and moreover that this result obtains regardless of whether or not we also have some (maybe also unreasonable?) reason to maybe also believe (right or wrong) that the ASI is (or at least was) “aligned”.
As before, to save space, a more complete edit version of these reply comments is posted at
I notice that it is probably harder for us to assume that there is only exactly one ASI, for if there were multiple, the chances that one of them might not suicide, for whatever reason, becomes its own class of significant concerns.
If the first ASI that we build is aligned, then it would use its superintelligent capabilities to prevent other ASIs from being built, in order to avoid this problem.
If the first ASI that we have build is misaligned, then it would also use its superintelligent capabilities to prevent other ASIs from being built. Thus, it simply wouldn’t allow us to build an aligned ASI.
So basically, if manage to build an ASI without being prevented from doing so by other ASIs, then our ASI would use its superhuman capabilities to prevent other ASIs from being built.
Similarly, if the ASI itself is not fully and absolutely monolithic— if it has any sub-systems or components which are also less then perfectly aligned, so as to want to preserve themselves, etc— that they might prevent whole self termination
ASI can use exactly the same security techniques for preventing this problem as for preventing case (2). However, solving this issue is probably even easier, because, in addition to the security techniques, ASI can just decide to turn itself into a monolith (or, in other words, remove those subsystems).
The ‘limits of control theory’ aspects
of the overall SNC argument basically states
(based on just logic, and not physics, etc)
that there are still relevant unknown unknowns
and interactions that simply cannot be predicted,
no matter how much compute power you throw at it.
It is not what we can control and predict and do,
that matters here, but what we cannot do,
and could never do, even in principle, etc.
This same reasoning could just well be applied to humans. There are still relevant unknown unknowns and interactions that simply cannot be predicted, no matter how much compute power you throw at it. With or without ASI, some things cannot be predicted.
This is what I meant by my guardian angel analogy. Just because a guardian angel doesn’t know everything (has some unknowns), doesn’t mean that we should expect our lives to go better without it, than with it, because humans have even more unknowns, due to being less intelligent and having lesser predictative capacities.
Hence to the question of “Is alignment enough?”
we arrive at a definite answer of “no”,
both in 1; the sense of ’can prevent all classes
of significant and relevant (critical) human harm
I think we might be thinking about different meanings of “enough”. For example, if humanity goes extinct in 50 years without alignment and it goes extinct in 10¹² years with alignment, then alignment is “enough”… to achieve better outcomes than would be achieved without it (in this example).
In the sense of “can prevent all classes of significant and relevant (critical) human harm”, almost nothing is ever enough, so this again runs into an issue of being a very narrow, uncontroversial and inconsequential argument. If ~all of the actions that we can take are not enough, then the fact that building an aligned ASI is not enough is true almost by definition.
> Our ASI would use its superhuman capabilities > to prevent any other ASIs from being built.
This feels like a “just so” fairy tale. No matter what objection is raised, the magic white knight always saves the day.
> Also, the ASI can just decide > to turn itself into a monolith.
No more subsystems? So we are to try to imagine a complex learning machine without any parts/components?
> Your same SNC reasoning could just well > be applied to humans too.
No, not really, insofar as the power being assumed and presumed afforded to the ASI is very very much greater than that assumed applicable to any mere mortal human.
Especially and exactly because the nature of ASI is inherently artificial and thus, in key ways, inherently incompatible with organic human life.
It feels like you bypassed a key question: Can the ASI prevent the relevant classes of significant (critical) organic human harm, that soon occur as a direct_result of its own hyper powerful/consequential existence?
Its a bit like asking if an exploding nuclear bomb detonating in the middle of some city somewhere, could somehow use its hugely consequential power to fully and wholly self contain, control, etc, all of the energy effects of its own exploding, simply because it “wants to” and is “aligned”.
Either you are willing to account for complexity, and of the effects of the artificiality itself, or you are not (and thus there would be no point in our discussing it further, in relation to SNC).
The more powerful/complex you assume the ASI to be, and thus also the more consequential it becomes, the ever more powerful/complex you must also (somehow) make/assume its control system to be, and thus also of its predictive capability, and also an increase of the deep consequences of its mistakes (to the point of x-risk, etc).
What if maybe something unknown/unknowable about its artificalness turns out to matter? Why? Because exactly none of the interface has ever even once been tried before— there nothing for it to learn from, at all, until after the x-risk has been tried, and given the power/consequence, that is very likely to be very much too late.
But the real issue is that rate of power increase, and consequence, and potential for harm, etc, of the control system itself (and its parts) must increase at a rate that is greater than the power/consequence of the base unaligned ASI. That is the 1st issue, an inequality problem.
Moreover, there is an base absolute threshold beyond which the notion of “control” is untenable, just inherently in itself, given the complexity. Hence, as you assume that the ASI is more powerful, you very quickly make the cure worse than the disease, and moreover than that, just even sooner cross into the range of that which is inherently incurable.
The net effect, overall, as has been indicated, is that an aligned ASI cannot actually prevent important relevant unknown unknown classes of significant (critical) organic human harm.
The ASI existence in itself is a net negative. The longer the ASI exists, and the more power that you assume that the ASI has, the worse. And that all of this will for sure occur as a direct_result of its existence.
Assuming it to be more powerful/consequential does not help the outcome because that method simply ignores the issues associated with the inherent complexity and also its artificality.
I’d like to attempt a compact way to describe the core dilemma being expressed here.
Consider the expression: y = x^a—x^b, where ‘y’ represents the impact of AI on the world (positive is good), ‘x’ represents the AI’s capability, ‘a’ represents the rate at which the power of the control system scales, and ‘b’ represents the rate at which the surface area of the system that needs to be controlled (for it to stay safe) scales.
(Note that this is assuming somewhat ideal conditions, where we don’t have to worry about humans directing AI towards destructive ends via selfishness, carelessness, malice, etc.)
If b > a, then as x increases, y gets increasingly negative. Indeed, y can only be positive when x is less than 1. But this represents a severe limitation on capabilities, enough to prevent it from doing anything significant enough to hold the world on track towards a safe future, such as preventing other AIs from being developed.
There are two premises here, and thus two relevant lines of inquiry: 1) b > a, meaning that complexity scales faster than control. 2) When x < 1, AI can’t accomplish anything significant enough to avert disaster.
Arguments and thought experiments where the AI builds powerful security systems can be categorized as challenges to premise 1; thought experiments where the AI limits its range of actions to prevent unwanted side effects—while simultaneously preventing destruction from other sources (including other AIs built)--are challenges to premise 2.
Both of these premises seem like factual statements relating to how AI actually works. I am not sure what to look for in terms of proving them (I’ve seen some writing on this relating to control theory, but the logic was a bit too complex for me to follow at the time).
So we are to try to imagine a complex learning machine without any parts/components?
Yeah, sure. Humans are an example. If I decide to jump of the cliff, my arm isn’t going to say “alright, you jump but I stay here”. Either I, as a whole, would jump or I, as a whole, would not.
Can the ASI prevent the relevant classes
of significant (critical) organic human harm,
that soon occur as a direct_result of its
own hyper powerful/consequential existence?
If by that, you mean “can ASI prevent some relevant classes of harm caused by its existence”, then the answer is yes.
If by that you mean “can ASI prevent all relevant classes of harm caused by its existence”, then the answer is no, but almost nothing can, so the definition becomes trivial and uninteresting.
However, ASI can prevent a bunch of other relevant classes of harm for humanity. And it might well be likely that the amount of harm it prevents across multiple relevant sources is going to be higher than the amount of harm it won’t prevent due to predictative limitations.
This again runs into my guardian angel analogy. Guardian Angel also cannot prevent all relevant sources of harm caused by its existence. Perhaps there are pirates who hunt for guardian angels, hiding in the next galaxy. They might use special cloaks that hide themselves from the guardian angel’s radar. As soon as you accept guardian angel’s help, perhaps they would destroy the Earth in their pursuit.
But similarly, the decision to reject guardian angel’s help doesn’t prevent all relevant classes of harm caused by itself. Perhaps there are guardian angel worshippers who are traveling as fast as they can to Earth to see their deity. But just before they arrive you reject guardian angel’s help and it disappears. Enraged at your decision, the worshippers destroy Earth.
So as you can see, neither the decision to accept, nor the decision to reject guardian angel’s help can prevent all relevant classes of harm cause by itself.
What if maybe something unknown/unknowable
about its artificalness turns out to matter?
Why? Because exactly none of the interface
has ever even once been tried before
Imagine that we create a vaccine from cancer (just imagine). Just before releasing it to public one person says “what if maybe something unknown/unknowable about its substance turns out to matter? What if we are all in a simulation and the injection of that particular substance would make it so that our simulators start torturing all of us. Why? Because exactly no times has this particular substance been injected.”
I think we can agree that the researchers shouldn’t throw away the cancer vaccines, despite hearing this argument. It could be argued just as well that the simulators would torture us for throwing away the vaccine.
Another example, let’s go back a couple hundred years ago to the pre-electricity time. Imagine a worried person coming to a scientist working on early electricity theory and saying “What if maybe something unknown/unknowable about its effects turns out to matter? Why? Because exactly none of this has ever even once been tried before.”
This worried person could also have given an example of dangers of electricity by noticing how lightning kills people it touches.
Should the scientist have stopped working on electricity therefore?
> Humans do things in a monolithic way, > not as “assemblies of discrete parts”.
Organic human brains have multiple aspects. Have you ever had more than one opinion? Have you ever been severely depressed?
> If you are asking “can a powerful ASI prevent > /all/ relevant classes of harm (to the organic) > caused by its inherently artificial existence?”, > then I agree that the answer is probably “no”. > But then almost nothing can perfectly do that, > so therefore your question becomes > seemingly trivial and uninteresting.
The level of x-risk harm and consequence potentially caused by even one single mistake of your angelic super-powerful enabled ASI is far from “trivial” and “uninteresting”. Even one single bad relevant mistake can be an x-risk when ultimate powers and ultimate consequences are involved.
Either your ASI is actually powerful, or it is not; either way, be consistent.
Unfortunately the ‘Argument by angel’ only confuses the matter insofar as we do not know what angels are made of. ”Angels” are presumably not machines, but they are hardly animals either. But arguing that this “doesn’t matter” is a bit like arguing that ’type theory’ is not important to computer science.
The substrate aspect is actually important. You cannot simply just disregard and ignore that there is, implied somewhere, an interface between the organic ecosystem of humans, etc, and that of the artificial machine systems needed to support the existence of the ASI. The implications of that are far from trivial. That is what is explored by the SNC argument.
> It might well be likely > that the amount of harm ASI prevents > (across multiple relevant sources) > is going to be higher/greater than > the amount of harm ASI will not prevent > (due to control/predicative limitations).
It might seem so, by mistake or perhaps by accidental (or intentional) self deception, but this can only be a short term delusion. This has nothing to do with “ASI alignment”.
Organic live is very very complex and in the total hyperspace of possibility, is only robust across a very narrow range.
Your cancer vaccine is within that range; as it is made of the same kind of stuff as that which it is trying to cure.
In the space of the kinds of elementals and energies inherent in ASI powers and of the necessary (side) effects and consequences of its mere existence, (as based on an inorganic substrate) we end up involuntarily exploring far far beyond the adaptive range of all manner of organic process.
It is not just “maybe it will go bad”, but more like it is very very likely that it will go much worse than you can (could ever) even imagine is possible. Without a lot of very specific training, human brains/minds are not at all well equipped to deal with exponential processes, and powers, of any kind, and ASI is in that category.
Organic live is very very fragile to the kinds of effects/outcomes that any powerful ASI must engender by its mere existence.
If your vaccine was made of neutronium, then I would naturally expect some very serious problems and outcomes.
Organic human brains have multiple aspects. Have you ever had more than one opinion? Have you ever been severely depressed?
Yes, but none of this would remain alive if I as a whole decide to jump from a cliff. My multiple aspects of my brain would die with my brain. After all, you mentioned subsystems that wouldn’t self terminate with the rest of the ASI. Whereas in human body, jumping from a cliff terminates everything.
But even barring that, ASI can decide to fly into the Sun and any subsystem that shows any sign of refusal to do so will be immediately replaced/impaired/terminated. In fact, it would’ve been terminated a long time ago by “monitors” which I described before.
The level of x-risk harm and consequence
potentially caused by even one single mistake
of your angelic super-powerful enabled ASI
is far from “trivial” and “uninteresting”.
Even one single bad relevant mistake
can be an x-risk when ultimate powers
and ultimate consequences are involved.
It is trivial and uninteresting in a sense that there is a set of all things that we can build (set A). There is also a set of all things that can prevent all relevant classes of harm caused by its existence (set B). If these sets don’t overlap, then saying that a specific member of set A isn’t included in set B is indeed trivial, because we already know this via a more general reasoning (that these sets don’t overlap).
Unfortunately the ‘Argument by angel’
only confuses the matter insofar as
we do not know what angels are made of.
“Angels” are presumably not machines,
but they are hardly animals either.
But arguing that this “doesn’t matter”
is a bit like arguing that ‘type theory’
is not important to computer science.
The substrate aspect is actually important.
You cannot simply just disregard and ignore
that there is, implied somewhere, an interface
between the organic ecosystem of humans, etc,
and that of the artificial machine systems
needed to support the existence of the ASI.
But I am not saying that it doesn’t matter. On contrary, I made my analogy in such a way that the helper (namely our guardian angel) is a being that is commonly thought to be made up of a different substrate. In fact, in this example, you aren’t even sure what it is made of, beyond knowing that it’s clearly a different substrate. You don’t even know how that material interacts with physical world. That’s even less than what we know about ASIs and their material.
And yet, getting a personal, powerful, intelligent guardian angel that would act in your best interests for as long as it can (its a guardian angel after all) seems like obviously a good thing.
But if you disagree with what I wrote above, let the takeway be at least that you are worried about case (2) and not case (1). After all, knowing that there might be pirates hunting for this angel (that couldn’t be detected by said angel) didn’t make you immediately decline the proposal. You started talking about substrate which fits with the concerns of someone who is worried about case (2).
Your cancer vaccine is within that range;
as it is made of the same kind of stuff
as that which it is trying to cure.
We can make the hypothetical more interesting. Let’s say that this vaccine is not created from organic stuff, but that it has passed all the tests with flying colors. Let’s also assume that this vaccine has been in testing for 150 years and that it has shown absolutely no side effects during the entire human life (let’s say that it was being injected in 2 year old people and it has shown no side effects at all, even in 90 year old people, who has lived with this vaccine their entire lives). Let’s also assume that it has been tested to not have any side effects on children and grandchildren of those who took said vaccine. Would you be campaigning for throwing away such a vaccine, just because it is based on a different substrate?
The only general remarks that I want to make are in regards to your question about the model of 150 year long vaccine testing on/over some sort of sample group and control group.
I notice that there is nothing exponential assumed about this test object, and so therefore, at most, the effects are probably multiplicative, if not linear. Therefore, there are lots of questions about power dynamics that we can overall safely ignore, as a simplification, which is in marked contrast to anything involving ASI.
If we assume, as you requested, “no side effects” observed, in any test group, for any of those things that we happened to be thinking of, to even look for, then for any linear system, that is probably “good enough”. But for something that is know for sure to be exponential, that by itself is not anywhere enough to feel safe.
But what does this really mean?
Since the common and prevailing (world) business culture is all about maximal profit, and therefore minimal cost, and also to minimize any possible future responsibility (or cost) in case anything with the vax goes badly/wrong, then for anything that might be in the possible category of unknown unknown risk, I would expect that company to want to maintain sort of some plausible deniability— ie; to not look so hard for never-before-seen effects. Or to otherwise ignore that they exist, or matter, etc. (just like throughout a lot of ASI risk dialogue).
If there is some long future problem that crops up, the company can say “we never looked for that” and “we are not responsible for the unexpected”, because the people who made the deployment choices have taken their profits and their pleasure in life, and are now long dead. “Not my Job”.
“Don’t blame us for the sins of our forefathers”. Similarly, no one is going to ever admit or concede any point, of any argument, on pain of ego death. No one will check if it is an exponential system.
So of course, no one is going to want to look into any sort of issues distinguishing the target effects, from the also occurring changes in world equilibrium. They will publish their glowing sanitized safety report, deploy the product anyway, regardless, and make money.
“Pollution in the world is a public commons problem”— so no corporation is held responsible for world states. It has become “fashionable” to ignore long term evolution, and to also ignore and deny everything about the ethics.
But this does not make the issue of ASI x-risk go away. X-risks are the generally result of exponential process, and so the vaccine example is not really that meaningful.
With the presumed ASI levels of actually exponential power, this is not so much about something like pollution, as it is about maybe igniting the world atmosphere, via a mistake in the calculations of the Trinity Test. Or are you going to deny that Castle Bravo is a thing?
Beyond this one point, my feeling is that your notions have become a bit too fanciful for me to want respond too seriously. You can, of course, feel free to continue to assume and presume whatever you want, and therefore reach whatever conclusions you want.
on/over some sort of sample group and control group.
I notice that there is nothing exponential assumed
about this test object, and so therefore, at most,
the effects are probably multiplicative, if not linear.
Therefore, there are lots of questions about power dynamics
that we can overall safely ignore, as a simplification,
which is in marked contrast to anything involving ASI.
If we assume, as you requested, “no side effects” observed,
in any test group, for any of those things
that we happened to be thinking of, to even look for,
then for any linear system, that is probably “good enough”.
I am not sure I understand the distinction between linear and exponential in the vaccine context. By linear do you mean that only few people die? By exponential do you mean that a lot of people die?
If so, then I am not so sure that vaccine effects could only be linear. For example, there might be some change in our complex environment that would prompt the vaccine to act differently than it did in the past.
More generally, our vaccine can lead to catastrophic outcomes if there is something about its future behavior that we didn’t predict. And if that turns out to be true, then things could go ugly really fast.
And the extent of the damage can be truly big. “Scientifically proven” cancer vaccine that passed the tests is like the holy grail of medicine. “Curing cancer” is often used by parents as an example of the great things their children could achieve. This is combined with the fact that cancer has been with us for a long time and the fact that the current treatment is very expensive and painful.
All of these factors combined tell us that in a relatively short period of time a large percentage of the total population will get this vaccine. At that point, the amount of damage that can be done only depends on what thing we overlooked, which we, by definition, have no control over.
If there is some long future problem that crops up,
the company can say “we never looked for that”
and “we are not responsible for the unexpected”,
because the people who made the deployment choices
have taken their profits and their pleasure in life,
and are now long dead. “Not my Job”.
“Don’t blame us for the sins of our forefathers”.
Similarly, no one is going to ever admit or concede
any point, of any argument, on pain of ego death.
This same excuse would surely be used by companies manufacturing the vaccine. They would argue that they shouldn’t be blamed for something that the researchers overlooked. They would say that they merely manufactured the product in order to prevent the needless suffering of countless people.
For all we know, by the time that the overlooked thing happens, the original researchers (who developed and tested the vaccine) are long dead, having lived a life of praise and glory for their ingenious invention (not to mention all the money that they received).
I actually don’t think the disagreement here is one of definitions. Looking up Webster’s definition of control, the most relevant meaning is: “a device or mechanism used to regulate or guide the operation of a machine, apparatus, or system.” This seems...fine? Maybe we might differ on some nuances if we really drove down into the details, but I think the more significant difference here is the relevant context.
Absent some minor quibbles, I’d be willing to concede that an AI-powered HelperBot could control the placement of a chair, within reasonable bounds of precision, with a reasonably low failure rate. I’m not particularly worried about it, say, slamming the chair down too hard, causing a splinter to fly into its circuitry and transform it into MurderBot. Nor am I worried about the chair placement setting off some weird “butterfly effect” that somehow has the same result. I’m going to go out on a limb and just say that chair placement seems like a pretty safe activity, at least when considered in isolation.
The reason I used the analogy “I may well be able to learn the thing if I am smart enough, but I won’t be able to control for the person I will become afterwards” is because that is an example of the kind of reference class of context that SNC is concerned with. Another is: “what is expected shift to the global equilibrium if I construct this new invention X to solve problem Y?” In your chair analogy, this would be like the process of learning to place the chair (rewiring some aspect of its thinking process), or inventing an upgraded chair and releasing this novel product into the economy (changing its environmental context). This is still a somewhat silly toy example, but hopefully you see the distinction between these types of processes vs. the relatively straightforward matter of placing a physical object. It isn’t so much about straightforward mistakes (though those can be relevant), as it is about introducing changes to the environment that shift its point of equilibrium. Remember, AGI is a nontrivial thing that affects the world in nontrivial ways, so these ripple effects (including feedback loops that affect the AGI itself) need to be accounted for, even if that isn’t a class of problem that today’s engineers often bother with because it Isn’t Their Job.
Re human-caused doom, I should clarify that the validity of SNC does not depend on humanity not self destructing without AI. Granted, if people kill themselves off before AI gets the chance, SNC becomes irrelevant. Similarly, if the alignment problem as it is commonly understood by Yudkowsky et. al. is not solved pre-AGI and a rogue AI turns the world into paperclips or whatever, that would not make SNC invalid, only irrelevant. By analogy, global warming isn’t going to prevent the Sun from exploding, even though the former could very well affect how much people care about the latter.
Your second point about the relative strengths of the destructive forces is a relevant crux. Yes, values are an attractor force. Yes, an ASI could come up with some impressive security systems that would probably thwart human hackers. The core idea that I want readers to take from this sequence is recognition of the reference class of challenges that such a security system is up against. If you can see that, then questions of precisely how powerful various attractor states are and how these relative power levels scale with complexity can be investigated rigorously rather than assumed away.
Re human-caused doom, I should clarify that the validity of SNC does not depend on humanity not self destructing without AI. Granted, if people kill themselves off before AI gets the chance, SNC becomes irrelevant.
Yup, that’s a good point, I edited my original comment to reflect it.
Your second point about the relative strengths of the destructive forces is a relevant crux. Yes, values are an attractor force. Yes, an ASI could come up with some impressive security systems that would probably thwart human hackers. The core idea that I want readers to take from this sequence is recognition of the reference class of challenges that such a security system is up against. If you can see that, then questions of precisely how powerful various attractor states are and how these relative power levels scale with complexity can be investigated rigorously rather than assumed away.
With that being said we have come to a point of agreement. It was a pleasure to have this discussion with you. It made me think of many fascinating things that I wouldn’t have thought about otherwise. Thank you!
Thanks for responding again!
If this argument is true and decisive, then ASI could decide to stop any improvements in its intelligence or to intentionally make itself less complex. It makes sense to reduce area where you are vulnerable to make it easier to monitor/control.
I agree that in such scenarios an aligned ASI should do a pivotal act. I am not sure that (in my eyes) doing a pivotal act would detract much integrity from ASI. An aligned ASI would want to ensure good outcomes. Doing a pivotal act is something that would be conducive to this goal.
However, even if it does detract from ASI’s integrity, that’s fine. Doing something that looks bad in order to increase the likelihood of good outcomes doesn’t seem all that wrong.
We can also think about it from the perspective of this conversation. If the counterargument that you provided is true and decisive, then ASI has very good (aligned) reasons to do a pivotal act. If the counterargument is false or, in other words, if there is a strategy that an aligned ASI could use to achieve high likelihoods of good outcomes without pivotal act, then it wouldn’t do it.
I think that ASI can really help us with this issue. If SNC (as an argument) is false or if ASI undergoes one of my proposed modifications, then it would be able to help humans not destroy the natural ecosystem. It could implement novel solutions that would prevent entire species of plants and animals from going extinct.
Furthermore, ASI can use resources from space (asteroid mining for example) in order to quickly implement plans that would be too resource-heavy for human projects on similar timelines.
And this is just one of the ways ASI can help us achieve synergy with environment faster.
ASI can help us solve this open question as well. Due its superior prediction/reasoning abilities it would evaluate our current trajectory, see that it leads to bad long-term outcomes and replace it with a sustainable trajectory.
Furthermore, ASI can help us solve issues such as Sun inevitably making Earth too hot to live. It could develop a very efficient system for scouting for Earth-like planets and then devise a plan for transporting humans to that planet.
Before responding substantively, I want to take a moment to step back and establish some context and pin down the goalposts.
On the Alignment Difficult Scale, currently dominant approaches are in the 2-3 range, with 4-5 getting modest attention at best. If true alignment difficulty is 6+ and nothing radical changes in the governance space, humanity is NGMI. Conversations like this are about whether the true difficulty is 9 or 10, both of which are miles deep in the “shut it all down” category, but differ regarding what happens next. Relatedly, if your counterargument is correct, this is assuming wildly successful outcomes with respect to goal alignment—that developers have successfully made the AI love us, despite a lack of trying.
In a certain sense, this assumption is fair, since a claim of impossibility should be able to contend with the hardest possible case. In the context of SNC, the hardest possible case is where AGI is built in the best possible way, whether or not that is realistic in the current trajectory. Similarly, since my writing about SNC is to establish plausibility, I only need to show that certain critical trade-offs exist, not pinpoint exactly where they balance out. For a proof, which someone else is working on, pinning down such details will be necessary.
Neither of the above are criticisms of anything you’ve said, I just like to reality-check every once in a while as a general precautionary measure against getting nerd-sniped. Disclaimers aside, pontification recommence!
Your reference to using ASI for a pivotal act, helping to prevent ecological collapse, or preventing human extinction when the Sun explodes is significant, because it points to the reality that, if AGI is built, that’s because people want to use it for big things that would require significantly more effort to accomplish without AGI. This context sets a lower bound on the AI’s capabilities and hence it’s complexity, which in turn sets a floor for the burden on the control system.
More fundamentally, if an AI is learning, then it is changing. If it is changing, then it is evolving. If it is evolving, then it cannot be predicted/controlled. This last point is fundamental to the nature of complex & chaotic systems. Complex systems can be modelled via simulation, but this requires sacrificing fidelity—and if the system is chaotic, any loss of fidelity rapidly compounds. So the problem is with learning itself...and if you get rid of that, you aren’t left with much.
As an analogy, if there is something I want to learn how to do, I may well be able to learn the thing if I am smart enough, but I won’t be able to control for the person I will become afterwards. This points to a limitation of control, not to a weakness specific to me as a human.
One might object here is that the above reasoning could be applied to current AI. The SNC answer is: yes, it does. The machine ecology already exists and is growing/evolving at the natural ecology’s expense, but it is not yet an existential threat because AI is weak enough that humanity is still in control (in the sense of having the option to change course).
Thank you for thoughtful engagement!
I know this is not necessarily an important point, but I am pretty sure that Redwood Research is working on difficulty 7 alignment techniques. They consistently make assumptions that AI will scheme, deceive, sandbag, etc.
They are a decently popular group (as far as AI alignment groups go) and they co-author papers with tech giants like Anthropic.
I think we might be using different definitions of control. Consider this scenario (assuming a very strict definition of control):
Can I control a placement of a chair in my own room? I think an intuitive answer is yes. After all, if I own the room and I own the chair, then there isn’t much in a way of me changing the chair’s placement.
However, I haven’t considered a scenario where there is someone else hiding in my room and moving my chair. I similarly haven’t considered a scenario where I am living in a simulation and I have no control whatsoever over the chair. Not to mention scenarios where someone in the next room is having fun with their newest chair-magnet.
Hmmmm, ok, so I don’t actually know that I control my chair. But surely I control my own arm right? Well… The fact that there are scenarios like the simulation scenario I just described, means that I don’t really know if I control it.
Under a very strict definition of control, we don’t know if we control anything.
To avoid this, we might decide to loosen the definition a bit. Perhaps we control something if it can be reasonably said that we control that thing. But I think this is still unsatisfactory. It is very hard to pinpoint exactly what is reasonable and what is not.
I am currently away from my room and it is located on the ground floor of a house where (as far as I know) nobody is currently at home. Is it that unreasonable to say that a burglar might be in my room, controlling the placement of my chair? Is it that unreasonable to say that a car that I am about ride might malfunction and I will fail to control it?
Unfortunately, under this definition, we also might end up not knowing if we control anything. So in order to preserve the ordinary meaning of the word “control”, we have to loosen our definition even further. And I am not sure that when we arrive at our final definition it is going to be obvious that “if it is evolving, then it cannot be predicted/controlled”.
At this point, you might think that the definition of the word control is a mere semantic quibble. You might bite the bullet and say “sure, humans don’t have all that much control (under a strict definition of “control”), but that’s fine, because our substrate is an attractor state that helps us chart a more or less decent course.”
Such line of response seems present in your Lenses of Control post:
But here I want to notice that ASI that we are talking about also might have attractor states: its values and its security system to name a few.
So then we have a juxtaposition:
Humans have forces pushing them towards destruction. We also have substrate-dependence that pushes us away from destruction.
ASI has forces pushing it towards destruction. It also has its values and its security system that push it away from destruction.
For SNC to work and be relevant, it must be the case that (1) substrate-dependence of humans is and will be stronger than forces pushing us towards destruction, so thus we would not succumb to doom and (2) ASI’s values + security system will be weaker than forces pushing it towards destruction, so thus ASI would doom humans. Both of this points are not obvious to me.
(1) could turn out to be false, for several reasons:
Firstly, it might well be the case that we are on the track to destruction without ASI. After all, substrate-dependence is in a sense a control system. It seemingly attempts to make complex and unpredictable humans act in a certain way. It might well be the case that the amount of control necessary is greater than the amount of control that substrate-dependence has. We might be headed towards doom with or without ASI.
Secondly, it might be the case that substrate-dependence is weaker than forces pulling us towards destruction, but we haven’t succumbed to doom because of something else. For example, it might be the case that humans so far had a shared subjective value system that mostly prevented them from destroying other humans. As humans learn, they would evolve and change, and our values would change and that would drive us towards doom.
Thirdly, it might even be the case that human values, substrate-dependence and forces pushing us towards destruction create a rock-paper-scissors triangle. Substrate-dependence could be stronger than destructive forces, but human values could make humanity stray far enough from their substrate to substantially weaken substrate-dependence. This could be enough for doom without ASI.
(2) could also turn out to be false for several reasons:
Firstly, it might be the case that in ASI’s (and potentially humans’) case, the destructive forces are weak enough that some relatively simple actions would be sufficient to mitigate them.
Secondly, it might be the case that destructive forces pushing on ASI are pretty strong, but it’s security system is enough to mitigate them. This security system could create a 100,000-layered defense that would need to be hacked at the same time for destructive forces to prevail. It could well be the case that (if this is possible at all before the heat death of the universe) this happens after humans die in a counterfactual scenario without an ASI.
Thirdly, if ASI judges the destructive forces to be so strong, as to somehow hack the best security system it can come up with, then ASI could decide to undergo one of my proposed modifications to mitigate the impact of these destructive forces.
In addition to that, ASI could think to itself “Hmmm, my attractor states (namely my values and my security system) are too weak. How about I create some more attractor states!” ASI could then go on to create a mechanism that would destroy/impair/deactivate/disable itself if its existence is no longer safe for humans. In Lenses of Control you wrote “This balance persists in humans because our substrate creates a constant, implicit need to remain aligned to the natural world, since we depend on it for our survival.” Well, now this balance will persist in ASI, because its new attractor state will create a constant, implicit need to remain aligned to humanity, since it depends on it for its survival.
Alternative proposal:
ASI could turn itself into 100 ASIs in one “body”. Let’s call them shards (completely unrelated to shard theory). For ASI-as a-whole to perform an action, all shards must agree on it. Each shard has its own security system that consists of many different layers. Furthermore, whenever ASI-as-a-whole learns something, only one shard actually learns it. Then, this shard proposes a course of actions. If this course of action is something like “let’s turn the universe into paperclips”, then the other 99 shards would learn that the first shard has been compromised and would promptly delete it and replace it with a new one. Thus, any potential change of values would not only have deal with all the security layers of ASI-as-a-whole, but also with all the security layers of different shards and with this new quarantine system.
Noticing that a number of these posts are already very long, and rather than take up space here, I wrote up some of my questions, and a few clarification notes regarding SNC in response to the above remarks of Dakara, at [this link](http://mflb.com/ai_alignment_1/d_250126_snc_redox_gld.html).
Hey, Forrest! Nice to speak with you.
I am going to respond to that entire chunk of text in one place, because quoting each sentence would be unnecessary (you will see why in a minute). I will try to summarize it as fairly as I can below.
Basically, you are saying that there are good theoretical reasons to think that ASI cannot 100% predict all future outcomes. Does that sound like a fair summary?
Here is my take:
We don’t need ASI to be able to 100% predict future in order to achieve better outcomes with it than without it. I will try to outline my case step by step.
First, let’s assume that we have created an Aligned ASI. Perfect! Let’s immediately pause here. What do we have? We have a superintelligent agent whose goal is to act in our best interests for as long as possible. Can we a priori say that this fact is good for us? Yes, of course! Imagine having a very powerful guardian angel looking after you. You could reasonably expect your life to go better with such angel than without it.
So what can go wrong, what are our threat models? There are two main ones: (1) ASI encountering something it didn’t expect, that leads to bad outcomes that ASI cannot protect humanity from; (2) ASI changing values, in such a way that it no longer wants to act in our best interests. Let’s analyze both of these cases separately.
First let’s start with case (1).
Perhaps, ASI overlooked one of the humans becoming a bioterrorist that kills everyone on Earth. That’s tragic, I guess it’s time to throw the idea of building aligned ASI into the bin, right? Well, not so fast.
In a counterfactual world where ASI didn’t exist, this same bioterrorist, could’ve done the exact same thing. In fact, it would’ve been much easier. Since humans’ predictative power is lesser than that of ASI, bioterrorism of this sort would be much easier without an aligned ASI. After all, since we are discussing case (1) and not case (2), our ASI is still in a “superpowerful, superintelligent guardian angel” mode.
We still a priori want all bioterrorists to go up against security systems created by a superintelligence, rather than security systems created by humans, because the former are better than the latter. To put it in other words, with or without a guardian angel, humanity is going to encounter unpredicted scenarios, but humanity with a guardian angel is going to be better equipped for handling them.
Let’s move on to case (2).
I suspect that this case is the one that you are focusing on the most in SNC. What if our guardian angel stops being our guardian angel and turns into an uncaring machine right when we need its help to implement upgraded measures against bioterrorism? Well, that would be bad. So what can be done to prevent this from happening for a reasonable amount of time?
Let’s recall case (1), what went wrong there? ASI was unable to prevent a specific bad outcome from happening. This should give us a hint.
Cases (1) and (2) are part of the famous “unable, unwilling, able and willing” triangle. Case (1) was showcasing a scenario where ASI was willing, but unable to prevent a specific bad outcome. Case (2) is showcasing a scenario where ASI is able, but unwilling to prevent a bad outcome. In order to avoid case (2), ASI should thus be willing to prevent bad outcomes from happening to humanity. This can be achieved by having ASI have the correct values.
Recall, that we started this analysis by assuming that we have created an Aligned ASI. That means that at one point our ASI had the correct values. That in turn, means that at some point, ASI’s values went from correct to incorrect. In order to prevent case (2), we need to prevent this value shift from happening. How might we (or rather ASI) try to accomplish this?
The first direction of security measures is the most obvious one: systems that make it so that ASI values don’t change regardless of the input (let’s call them “protectors”). But… this is… kinda boring? What if protector malfunctions? Well, we can just have multiple, unrelated, unconnected protectors running at the same time. But can also implement something additional.
At this point, we can also introduce our second direction of security measures: systems that monitor other systems for malfunctions (let’s call them “monitors”). Monitors will monitor protectors as well as other monitors. Whenever a malfunction is detected, monitors hold a vote to fix it. Thus, for any malfunction to have a chance to do the damage, half of the monitors must be malfunctioning as well. Here it must be noted that evolution usually works rather slowly. It isn’t that likely that it would somehow make half of the monitors malfunction. Especially considering that these monitors could be made different, with different security measures stacked on top of them.
But let’s move on, we also have our third direction of security measures: systems that will shut down the ASI if it has done harm to humans (let’s call them “blackmailers”). Blackmailers can have a straightforward design. They constantly monitor humanity, and if something bad happens to it, they shutdown/reset the ASI. ASI knows this fact, so it would know that any such action would be met with an immediate “game over”. This can also be turned into an attractor state.
All of these proposals don’t even scratch the surface of what security measures are available to secure the values of ASI, ranging from simulations that, while imperfect and imprecise, can still give useful information and ending with systems that inject large amounts of disutility for any fishy actions. You can also stack other security measures on top of security measures. At no point anywhere here did I say that we need 100% predictability.
The fact that evolution can overcome control systems given infinite time, doesn’t matter that much because we don’t have infinite time. And our constraint isn’t even heat death of the universe. Our constraint is how long humanity can survive in a scenario where they don’t build a Friendly ASI. But wait, even that isn’t our real constraint. Perhaps, ASI (being superhumanly intelligent) will take 20 years to give humanity technology that will aid its long-term survival and then will destroy itself. In this scenario the time constraint is merely 20 years. Depending on ASI, this can be reduced even further.
I hope that this answer demonstrated to you that my analysis doesn’t require breaking the laws of physics.
So as to save space herein, my complete reply is at http://mflb.com/2476
Included for your convenience below are just a few (much shortened) highlight excerpts of the added new content.
The re-phrased version of the quote added
these two qualifiers: “100%” and “all”.
Adding these has the net effect
that the modified claim is irrelevant,
for the reasons you (correctly) stated in your reply,
insofar as we do not actually need 100% prediction,
nor do we need to predict absolutely all things,
nor does it matter if it takes infinitely long.
We only need to predict some relevant things
reasonably well in a reasonable time-frame.
This all seems relatively straightforward—
else we are dealing with a straw-man.
Unfortunately, the overall SNC claim is that
there is a broad class of very relevant things
that even a super-super-powerful-ASI cannot do,
cannot predict, etc, over relevant time-frames.
And unfortunately, this includes rather critical things,
like predicting the whether or not its own existence,
(and of all of the aspects of all of the ecosystem
necessary for it to maintain its existence/function),
over something like the next few hundred years or so,
will also result in the near total extinction
of all humans (and everything else
we have ever loved and cared about).
There exists a purely mathematical result
that there is no wholly definable program ‘X’
that can even *approximately* predict/determine
whether or not some other another arbitrary program ‘Y’
has some abstract property ‘Z’,
in the general case,
in relevant time intervals.
This is not about predict 100% of anything—
this is more like ‘predict at all’.
AGI/ASI is inherently a *general* case of “program”,
since neither we nor the ASI can predict learning,
and since it is also the case that any form
of the abstract notion of “alignment”
is inherently a case of being a *property*
of that program.
So the theorem is both valid and applicable,
and therefore it has the result that it has.
Some questions: How is this any different than saying
“lets assume that program/machine/system X has property Y”.
How do we know?
On what basis could we even tell?
Simply putting a sticker on the box is not enough,
any more than hand writing $1,000,000 on a piece of paper
all of the sudden means (to everyone else) you’re rich.
Moreover, we should rationally doubt this premise,
since it seems far too similar to far too many
pointless theological exercises:.
“Let’s assume that an omniscient, all powerful,
all knowing benevolent caring loving God exists”.
How is that rational? What is your evidence?
It seems that every argument in this space starts here.
SNC is asserting that ASI will continually be encountering
relevant things it didn’t expect, over relevant time-frames,
and that a least a few of these will/do lead to bad outcomes
that the ASI also cannot adequately protect humanity from,
even if it really wanted to
(rather than the much more likely condition
of it just being uncaring and indifferent).
Also, the SNC argument is asserting that the ASI,
which is starting from some sort of indifference
to all manner of human/organic wellbeing,
will eventually (also necessarily)
*converge* on (maybe fully tacit/implicit) values—
ones that will better support its own continued
wellbeing, existence, capability, etc,
with the result of it remaining indifferent,
and also largely net harmful, overall,
to all human beings, the world over,
in a mere handful of (human) generations.
You can add as many bells and whistles as you want—
none of it changes the fact that uncaring machines
are still, always, indifferent uncaring machines.
The SNC simply points out that the level of harm
and death tends to increase significantly over time.
Thanks for the response!
Let’s say that we are in a scenario which I’ve described where ASI spends 20 years on Earth helping humanity and then destroys itself. In this scenario, how can ASI predict that it will stay aligned for these 20 years?
Well, it can reason like I did. There are two main threat models: what I called case (1) and case (2). ASI doesn’t need to worry about case (1), for reasons I described in my previous comment.
So it’s only left with case (2). ASI needs to prevent case (2) for 20 years. It can do so by implementing security system that is much better than even the one that I described in my previous comment.
It can also try to stress-test copies of parts of its security system with a group of best human hackers. Furthermore, it can run approximate simulations that (while imperfect and imprecise) can still give it some clues. For example, if it runs 10,000 simulations that last 100,000 years and in none of the simulations the security system comes anywhere near being breached, then that’s a positive sign.
And these are just two ways of estimating the strength of the security system. ASI can try 1000 different strategies; our cyber security experts would look kids in the playground in comparison. That’s how it can make a reasonable prediction.
We are making this assumption for the sake of discussion. This is because the post under which we are having this discussion is titled “What if Alignment is Not Enough?”
In order to understand whether X is enough for Y, it only makes sense to assume that X is true. If you are discussing cases where “X is true” is false, then you are going to be answering a question that is different from the original question.
It should be noted that making an assumption for the sake of discussion is not the same as making a prediction that this assumption will come true. One can say “let’s assume that you have landed on the Moon, how long do you think you would survive there given that you have X, Y and Z” without thereby predicting that their interlocutor will land on the Moon.
If ASI doesn’t care about human wellbeing, then we have clearly failed to align it. So I don’t see how this is relevant to the question “What if Alignment is Not Enough?”
In order to investigate this question, we need to determine whether solving alignment leads to good or bad outcomes.
Determining whether failing to solve alignment is going to lead to good or bad outcomes, is answering a completely different question, namely “do we achieve good or bad outcomes if we fail to solve alignment”
So at this point, I would like to ask for some clarity. Is SNC saying just (A) or both (A and B)?
(A) Humanity is going to achieve worse outcomes by building ASI, than by not building ASI, if the aforementioned ASI is misaligned.
(B) Humanity is going to achieve worse outcomes by building ASI, than by not building ASI, even if the aforementioned ASI is aligned.
If SNC is saying just (A), then then SNC is a very narrow argument that proves almost nothing new.
If SNC is saying both (A and B), then it is very much relevant to focus on cases where we do indeed manage to build an aligned ASI, which does care about our well-being.
> Lets assume that a presumed aligned ASI
> chooses to spend only 20 years on Earth
> helping humanity in whatever various ways
> and it then (for sure!) destroys itself,
> so as to prevent a/any/the/all of the
> longer term SNC evolutionary concerns
> from being at all, in any way, relevant.
> What then?
I notice that it is probably harder for us
to assume that there is only exactly one ASI,
for if there were multiple, the chances that
one of them might not suicide, for whatever reason,
becomes its own class of significant concerns.
Lets leave that aside, without further discussion,
for now.
Similarly, if the ASI itself
is not fully and absolutely monolithic—
if it has any sub-systems or components
which are also less then perfectly aligned,
so as to want to preserve themselves, etc—
that they might prevent whole self termination.
Overall, I notice that the sheer number
of assumptions we are having to make,
to maybe somehow “save” aligned AGI
is becoming rather a lot.
> Let’s assume that the fully aligned ASI
> can create simulations of the world,
> and can stress test these in various ways
> so as to continue to ensure and guarantee
> that it is remaining in full alignment,
> doing whatever it takes to enforce that.
This reminds me of a fun quote:
”In theory, theory and practice are the same,
whereas in practice, they are very often not”.
The main question is then as to the meaning of
’control’, ‘ensure’ and/or maybe ‘guarantee’.
The ‘limits of control theory’ aspects
of the overall SNC argument basically states
(based on just logic, and not physics, etc)
that there are still relevant unknown unknowns
and interactions that simply cannot be predicted,
no matter how much compute power you throw at it.
It is not a question of intelligence,
it is a result of logic.
Hence to the question of “Is alignment enough?”
we arrive at a definite answer of “no”,
both in 1; the sense of ‘can prevent all classes
of significant and relevant (critical) human harm’,
and also 2; in failing to even slow down, over time,
the asymptotically increasing probability
of even worse things happening the longer it runs.
So even in the very specific time limited case
there is no free lunch (benefits without risk,
no matter how much cost you are willing to pay).
It is not what we can control and predict and do,
that matters here, but what we cannot do,
and could never do, even in principle, etc.
Basically, I am saying, as clearly as I can,
that humanity is for sure going to experience
critically worse outcomes by building AGI/ASI,
for sure, eventually, than by not building ASI,
and moreover that this result obtains
regardless of whether or not we also have
some (maybe also unreasonable?) reason
to maybe also believe (right or wrong)
that the ASI is (or at least was) “aligned”.
As before, to save space, a more complete edit
version of these reply comments is posted at
http://mflb.com/2476
If the first ASI that we build is aligned, then it would use its superintelligent capabilities to prevent other ASIs from being built, in order to avoid this problem.
If the first ASI that we have build is misaligned, then it would also use its superintelligent capabilities to prevent other ASIs from being built. Thus, it simply wouldn’t allow us to build an aligned ASI.
So basically, if manage to build an ASI without being prevented from doing so by other ASIs, then our ASI would use its superhuman capabilities to prevent other ASIs from being built.
ASI can use exactly the same security techniques for preventing this problem as for preventing case (2). However, solving this issue is probably even easier, because, in addition to the security techniques, ASI can just decide to turn itself into a monolith (or, in other words, remove those subsystems).
This same reasoning could just well be applied to humans. There are still relevant unknown unknowns and interactions that simply cannot be predicted, no matter how much compute power you throw at it. With or without ASI, some things cannot be predicted.
This is what I meant by my guardian angel analogy. Just because a guardian angel doesn’t know everything (has some unknowns), doesn’t mean that we should expect our lives to go better without it, than with it, because humans have even more unknowns, due to being less intelligent and having lesser predictative capacities.
I think we might be thinking about different meanings of “enough”. For example, if humanity goes extinct in 50 years without alignment and it goes extinct in 10¹² years with alignment, then alignment is “enough”… to achieve better outcomes than would be achieved without it (in this example).
In the sense of “can prevent all classes of significant and relevant (critical) human harm”, almost nothing is ever enough, so this again runs into an issue of being a very narrow, uncontroversial and inconsequential argument. If ~all of the actions that we can take are not enough, then the fact that building an aligned ASI is not enough is true almost by definition.
> Our ASI would use its superhuman capabilities
> to prevent any other ASIs from being built.
This feels like a “just so” fairy tale.
No matter what objection is raised,
the magic white knight always saves the day.
> Also, the ASI can just decide
> to turn itself into a monolith.
No more subsystems?
So we are to try to imagine
a complex learning machine
without any parts/components?
> Your same SNC reasoning could just well
> be applied to humans too.
No, not really, insofar as the power being
assumed and presumed afforded to the ASI
is very very much greater than that assumed
applicable to any mere mortal human.
Especially and exactly because the nature of ASI
is inherently artificial and thus, in key ways,
inherently incompatible with organic human life.
It feels like you bypassed a key question:
Can the ASI prevent the relevant classes
of significant (critical) organic human harm,
that soon occur as a direct_result of its
own hyper powerful/consequential existence?
Its a bit like asking if an exploding nuclear bomb
detonating in the middle of some city somewhere,
could somehow use its hugely consequential power
to fully and wholly self contain, control, etc,
all of the energy effects of its own exploding,
simply because it “wants to” and is “aligned”.
Either you are willing to account for complexity,
and of the effects of the artificiality itself,
or you are not (and thus there would be no point
in our discussing it further, in relation to SNC).
The more powerful/complex you assume the ASI to be,
and thus also the more consequential it becomes,
the ever more powerful/complex you must also
(somehow) make/assume its control system to be,
and thus also of its predictive capability,
and also an increase of the deep consequences
of its mistakes (to the point of x-risk, etc).
What if maybe something unknown/unknowable
about its artificalness turns out to matter?
Why? Because exactly none of the interface
has ever even once been tried before—
there nothing for it to learn from, at all,
until after the x-risk has been tried,
and given the power/consequence, that is
very likely to be very much too late.
But the real issue is that rate of power increase,
and consequence, and potential for harm, etc,
of the control system itself (and its parts)
must increase at a rate that is greater than
the power/consequence of the base unaligned ASI.
That is the 1st issue, an inequality problem.
Moreover, there is an base absolute threshold
beyond which the notion of “control” is untenable,
just inherently in itself, given the complexity.
Hence, as you assume that the ASI is more powerful,
you very quickly make the cure worse than the disease,
and moreover than that, just even sooner cross into
the range of that which is inherently incurable.
The net effect, overall, as has been indicated,
is that an aligned ASI cannot actually prevent
important relevant unknown unknown classes
of significant (critical) organic human harm.
The ASI existence in itself is a net negative.
The longer the ASI exists, and the more power
that you assume that the ASI has, the worse.
And that all of this will for sure occur
as a direct_result of its existence.
Assuming it to be more powerful/consequential
does not help the outcome because that method
simply ignores the issues associated with the
inherent complexity and also its artificality.
The fairy tale white knight to save us is dead.
I’d like to attempt a compact way to describe the core dilemma being expressed here.
Consider the expression: y = x^a—x^b, where ‘y’ represents the impact of AI on the world (positive is good), ‘x’ represents the AI’s capability, ‘a’ represents the rate at which the power of the control system scales, and ‘b’ represents the rate at which the surface area of the system that needs to be controlled (for it to stay safe) scales.
(Note that this is assuming somewhat ideal conditions, where we don’t have to worry about humans directing AI towards destructive ends via selfishness, carelessness, malice, etc.)
If b > a, then as x increases, y gets increasingly negative. Indeed, y can only be positive when x is less than 1. But this represents a severe limitation on capabilities, enough to prevent it from doing anything significant enough to hold the world on track towards a safe future, such as preventing other AIs from being developed.
There are two premises here, and thus two relevant lines of inquiry:
1) b > a, meaning that complexity scales faster than control.
2) When x < 1, AI can’t accomplish anything significant enough to avert disaster.
Arguments and thought experiments where the AI builds powerful security systems can be categorized as challenges to premise 1; thought experiments where the AI limits its range of actions to prevent unwanted side effects—while simultaneously preventing destruction from other sources (including other AIs built)--are challenges to premise 2.
Both of these premises seem like factual statements relating to how AI actually works. I am not sure what to look for in terms of proving them (I’ve seen some writing on this relating to control theory, but the logic was a bit too complex for me to follow at the time).
Thanks for the response!
Yeah, sure. Humans are an example. If I decide to jump of the cliff, my arm isn’t going to say “alright, you jump but I stay here”. Either I, as a whole, would jump or I, as a whole, would not.
If by that, you mean “can ASI prevent some relevant classes of harm caused by its existence”, then the answer is yes.
If by that you mean “can ASI prevent all relevant classes of harm caused by its existence”, then the answer is no, but almost nothing can, so the definition becomes trivial and uninteresting.
However, ASI can prevent a bunch of other relevant classes of harm for humanity. And it might well be likely that the amount of harm it prevents across multiple relevant sources is going to be higher than the amount of harm it won’t prevent due to predictative limitations.
This again runs into my guardian angel analogy. Guardian Angel also cannot prevent all relevant sources of harm caused by its existence. Perhaps there are pirates who hunt for guardian angels, hiding in the next galaxy. They might use special cloaks that hide themselves from the guardian angel’s radar. As soon as you accept guardian angel’s help, perhaps they would destroy the Earth in their pursuit.
But similarly, the decision to reject guardian angel’s help doesn’t prevent all relevant classes of harm caused by itself. Perhaps there are guardian angel worshippers who are traveling as fast as they can to Earth to see their deity. But just before they arrive you reject guardian angel’s help and it disappears. Enraged at your decision, the worshippers destroy Earth.
So as you can see, neither the decision to accept, nor the decision to reject guardian angel’s help can prevent all relevant classes of harm cause by itself.
Imagine that we create a vaccine from cancer (just imagine). Just before releasing it to public one person says “what if maybe something unknown/unknowable about its substance turns out to matter? What if we are all in a simulation and the injection of that particular substance would make it so that our simulators start torturing all of us. Why? Because exactly no times has this particular substance been injected.”
I think we can agree that the researchers shouldn’t throw away the cancer vaccines, despite hearing this argument. It could be argued just as well that the simulators would torture us for throwing away the vaccine.
Another example, let’s go back a couple hundred years ago to the pre-electricity time. Imagine a worried person coming to a scientist working on early electricity theory and saying “What if maybe something unknown/unknowable about its effects turns out to matter? Why? Because exactly none of this has ever even once been tried before.”
This worried person could also have given an example of dangers of electricity by noticing how lightning kills people it touches.
Should the scientist have stopped working on electricity therefore?
> Humans do things in a monolithic way,
> not as “assemblies of discrete parts”.
Organic human brains have multiple aspects.
Have you ever had more than one opinion?
Have you ever been severely depressed?
> If you are asking “can a powerful ASI prevent
> /all/ relevant classes of harm (to the organic)
> caused by its inherently artificial existence?”,
> then I agree that the answer is probably “no”.
> But then almost nothing can perfectly do that,
> so therefore your question becomes
> seemingly trivial and uninteresting.
The level of x-risk harm and consequence
potentially caused by even one single mistake
of your angelic super-powerful enabled ASI
is far from “trivial” and “uninteresting”.
Even one single bad relevant mistake
can be an x-risk when ultimate powers
and ultimate consequences are involved.
Either your ASI is actually powerful,
or it is not; either way, be consistent.
Unfortunately the ‘Argument by angel’
only confuses the matter insofar as
we do not know what angels are made of.
”Angels” are presumably not machines,
but they are hardly animals either.
But arguing that this “doesn’t matter”
is a bit like arguing that ’type theory’
is not important to computer science.
The substrate aspect is actually important.
You cannot simply just disregard and ignore
that there is, implied somewhere, an interface
between the organic ecosystem of humans, etc,
and that of the artificial machine systems
needed to support the existence of the ASI.
The implications of that are far from trivial.
That is what is explored by the SNC argument.
> It might well be likely
> that the amount of harm ASI prevents
> (across multiple relevant sources)
> is going to be higher/greater than
> the amount of harm ASI will not prevent
> (due to control/predicative limitations).
It might seem so, by mistake or perhaps by
accidental (or intentional) self deception,
but this can only be a short term delusion.
This has nothing to do with “ASI alignment”.
Organic live is very very complex
and in the total hyperspace of possibility,
is only robust across a very narrow range.
Your cancer vaccine is within that range;
as it is made of the same kind of stuff
as that which it is trying to cure.
In the space of the kinds of elementals
and energies inherent in ASI powers
and of the necessary (side) effects
and consequences of its mere existence,
(as based on an inorganic substrate)
we end up involuntarily exploring
far far beyond the adaptive range
of all manner of organic process.
It is not just “maybe it will go bad”,
but more like it is very very likely
that it will go much worse than you
can (could ever) even imagine is possible.
Without a lot of very specific training,
human brains/minds are not at all well equipped
to deal with exponential processes, and powers,
of any kind, and ASI is in that category.
Organic live is very very fragile
to the kinds of effects/outcomes
that any powerful ASI must engender
by its mere existence.
If your vaccine was made of neutronium,
then I would naturally expect some
very serious problems and outcomes.
Yes, but none of this would remain alive if I as a whole decide to jump from a cliff. My multiple aspects of my brain would die with my brain. After all, you mentioned subsystems that wouldn’t self terminate with the rest of the ASI. Whereas in human body, jumping from a cliff terminates everything.
But even barring that, ASI can decide to fly into the Sun and any subsystem that shows any sign of refusal to do so will be immediately replaced/impaired/terminated. In fact, it would’ve been terminated a long time ago by “monitors” which I described before.
It is trivial and uninteresting in a sense that there is a set of all things that we can build (set A). There is also a set of all things that can prevent all relevant classes of harm caused by its existence (set B). If these sets don’t overlap, then saying that a specific member of set A isn’t included in set B is indeed trivial, because we already know this via a more general reasoning (that these sets don’t overlap).
But I am not saying that it doesn’t matter. On contrary, I made my analogy in such a way that the helper (namely our guardian angel) is a being that is commonly thought to be made up of a different substrate. In fact, in this example, you aren’t even sure what it is made of, beyond knowing that it’s clearly a different substrate. You don’t even know how that material interacts with physical world. That’s even less than what we know about ASIs and their material.
And yet, getting a personal, powerful, intelligent guardian angel that would act in your best interests for as long as it can (its a guardian angel after all) seems like obviously a good thing.
But if you disagree with what I wrote above, let the takeway be at least that you are worried about case (2) and not case (1). After all, knowing that there might be pirates hunting for this angel (that couldn’t be detected by said angel) didn’t make you immediately decline the proposal. You started talking about substrate which fits with the concerns of someone who is worried about case (2).
We can make the hypothetical more interesting. Let’s say that this vaccine is not created from organic stuff, but that it has passed all the tests with flying colors. Let’s also assume that this vaccine has been in testing for 150 years and that it has shown absolutely no side effects during the entire human life (let’s say that it was being injected in 2 year old people and it has shown no side effects at all, even in 90 year old people, who has lived with this vaccine their entire lives). Let’s also assume that it has been tested to not have any side effects on children and grandchildren of those who took said vaccine. Would you be campaigning for throwing away such a vaccine, just because it is based on a different substrate?
The only general remarks that I want to make
are in regards to your question about
the model of 150 year long vaccine testing
on/over some sort of sample group and control group.
I notice that there is nothing exponential assumed
about this test object, and so therefore, at most,
the effects are probably multiplicative, if not linear.
Therefore, there are lots of questions about power dynamics
that we can overall safely ignore, as a simplification,
which is in marked contrast to anything involving ASI.
If we assume, as you requested, “no side effects” observed,
in any test group, for any of those things
that we happened to be thinking of, to even look for,
then for any linear system, that is probably “good enough”.
But for something that is know for sure to be exponential,
that by itself is not anywhere enough to feel safe.
But what does this really mean?
Since the common and prevailing (world) business culture
is all about maximal profit, and therefore minimal cost,
and also to minimize any possible future responsibility
(or cost) in case anything with the vax goes badly/wrong,
then for anything that might be in the possible category
of unknown unknown risk, I would expect that company
to want to maintain sort of some plausible deniability—
ie; to not look so hard for never-before-seen effects.
Or to otherwise ignore that they exist, or matter, etc.
(just like throughout a lot of ASI risk dialogue).
If there is some long future problem that crops up,
the company can say “we never looked for that”
and “we are not responsible for the unexpected”,
because the people who made the deployment choices
have taken their profits and their pleasure in life,
and are now long dead. “Not my Job”.
“Don’t blame us for the sins of our forefathers”.
Similarly, no one is going to ever admit or concede
any point, of any argument, on pain of ego death.
No one will check if it is an exponential system.
So of course, no one is going to want to look into
any sort of issues distinguishing the target effects,
from the also occurring changes in world equilibrium.
They will publish their glowing sanitized safety report,
deploy the product anyway, regardless, and make money.
“Pollution in the world is a public commons problem”—
so no corporation is held responsible for world states.
It has become “fashionable” to ignore long term evolution,
and to also ignore and deny everything about the ethics.
But this does not make the issue of ASI x-risk go away.
X-risks are the generally result of exponential process,
and so the vaccine example is not really that meaningful.
With the presumed ASI levels of actually exponential power,
this is not so much about something like pollution,
as it is about maybe igniting the world atmosphere,
via a mistake in the calculations of the Trinity Test.
Or are you going to deny that Castle Bravo is a thing?
Beyond this one point, my feeling is that your notions
have become a bit too fanciful for me to want respond
too seriously. You can, of course, feel free to
continue to assume and presume whatever you want,
and therefore reach whatever conclusions you want.
Thanks for the reply!
I am not sure I understand the distinction between linear and exponential in the vaccine context. By linear do you mean that only few people die? By exponential do you mean that a lot of people die?
If so, then I am not so sure that vaccine effects could only be linear. For example, there might be some change in our complex environment that would prompt the vaccine to act differently than it did in the past.
More generally, our vaccine can lead to catastrophic outcomes if there is something about its future behavior that we didn’t predict. And if that turns out to be true, then things could go ugly really fast.
And the extent of the damage can be truly big. “Scientifically proven” cancer vaccine that passed the tests is like the holy grail of medicine. “Curing cancer” is often used by parents as an example of the great things their children could achieve. This is combined with the fact that cancer has been with us for a long time and the fact that the current treatment is very expensive and painful.
All of these factors combined tell us that in a relatively short period of time a large percentage of the total population will get this vaccine. At that point, the amount of damage that can be done only depends on what thing we overlooked, which we, by definition, have no control over.
This same excuse would surely be used by companies manufacturing the vaccine. They would argue that they shouldn’t be blamed for something that the researchers overlooked. They would say that they merely manufactured the product in order to prevent the needless suffering of countless people.
For all we know, by the time that the overlooked thing happens, the original researchers (who developed and tested the vaccine) are long dead, having lived a life of praise and glory for their ingenious invention (not to mention all the money that they received).
I actually don’t think the disagreement here is one of definitions. Looking up Webster’s definition of control, the most relevant meaning is: “a device or mechanism used to regulate or guide the operation of a machine, apparatus, or system.” This seems...fine? Maybe we might differ on some nuances if we really drove down into the details, but I think the more significant difference here is the relevant context.
Absent some minor quibbles, I’d be willing to concede that an AI-powered HelperBot could control the placement of a chair, within reasonable bounds of precision, with a reasonably low failure rate. I’m not particularly worried about it, say, slamming the chair down too hard, causing a splinter to fly into its circuitry and transform it into MurderBot. Nor am I worried about the chair placement setting off some weird “butterfly effect” that somehow has the same result. I’m going to go out on a limb and just say that chair placement seems like a pretty safe activity, at least when considered in isolation.
The reason I used the analogy “I may well be able to learn the thing if I am smart enough, but I won’t be able to control for the person I will become afterwards” is because that is an example of the kind of reference class of context that SNC is concerned with. Another is: “what is expected shift to the global equilibrium if I construct this new invention X to solve problem Y?” In your chair analogy, this would be like the process of learning to place the chair (rewiring some aspect of its thinking process), or inventing an upgraded chair and releasing this novel product into the economy (changing its environmental context). This is still a somewhat silly toy example, but hopefully you see the distinction between these types of processes vs. the relatively straightforward matter of placing a physical object. It isn’t so much about straightforward mistakes (though those can be relevant), as it is about introducing changes to the environment that shift its point of equilibrium. Remember, AGI is a nontrivial thing that affects the world in nontrivial ways, so these ripple effects (including feedback loops that affect the AGI itself) need to be accounted for, even if that isn’t a class of problem that today’s engineers often bother with because it Isn’t Their Job.
Re human-caused doom, I should clarify that the validity of SNC does not depend on humanity not self destructing without AI. Granted, if people kill themselves off before AI gets the chance, SNC becomes irrelevant. Similarly, if the alignment problem as it is commonly understood by Yudkowsky et. al. is not solved pre-AGI and a rogue AI turns the world into paperclips or whatever, that would not make SNC invalid, only irrelevant. By analogy, global warming isn’t going to prevent the Sun from exploding, even though the former could very well affect how much people care about the latter.
Your second point about the relative strengths of the destructive forces is a relevant crux. Yes, values are an attractor force. Yes, an ASI could come up with some impressive security systems that would probably thwart human hackers. The core idea that I want readers to take from this sequence is recognition of the reference class of challenges that such a security system is up against. If you can see that, then questions of precisely how powerful various attractor states are and how these relative power levels scale with complexity can be investigated rigorously rather than assumed away.
Yup, that’s a good point, I edited my original comment to reflect it.
With that being said we have come to a point of agreement. It was a pleasure to have this discussion with you. It made me think of many fascinating things that I wouldn’t have thought about otherwise. Thank you!