Dakara

Karma: 112

Dakara Jan 31, 2025, 10:05 AM
1 point
0
in reply to: flandry39’s comment on: What if Alignment is Not Enough?
Thanks for the reply!
The only general remarks that I want to make
are in regards to your question about
the model of 150 year long vaccine testing
on/over some sort of sample group and control group.
I notice that there is nothing exponential assumed
about this test object, and so therefore, at most,
the effects are probably multiplicative, if not linear.
Therefore, there are lots of questions about power dynamics
that we can overall safely ignore, as a simplification,
which is in marked contrast to anything involving ASI.
If we assume, as you requested, “no side effects” observed,
in any test group, for any of those things
that we happened to be thinking of, to even look for,
then for any linear system, that is probably “good enough”.
I am not sure I understand the distinction between linear and exponential in the vaccine context. By linear do you mean that only few people die? By exponential do you mean that a lot of people die?
If so, then I am not so sure that vaccine effects could only be linear. For example, there might be some change in our complex environment that would prompt the vaccine to act differently than it did in the past.
More generally, our vaccine can lead to catastrophic outcomes if there is something about its future behavior that we didn’t predict. And if that turns out to be true, then things could go ugly really fast.
And the extent of the damage can be truly big. “Scientifically proven” cancer vaccine that passed the tests is like the holy grail of medicine. “Curing cancer” is often used by parents as an example of the great things their children could achieve. This is combined with the fact that cancer has been with us for a long time and the fact that the current treatment is very expensive and painful.
All of these factors combined tell us that in a relatively short period of time a large percentage of the total population will get this vaccine. At that point, the amount of damage that can be done only depends on what thing we overlooked, which we, by definition, have no control over.
If there is some long future problem that crops up,
the company can say “we never looked for that”
and “we are not responsible for the unexpected”,
because the people who made the deployment choices
have taken their profits and their pleasure in life,
and are now long dead. “Not my Job”.
“Don’t blame us for the sins of our forefathers”.
Similarly, no one is going to ever admit or concede
any point, of any argument, on pain of ego death.
This same excuse would surely be used by companies manufacturing the vaccine. They would argue that they shouldn’t be blamed for something that the researchers overlooked. They would say that they merely manufactured the product in order to prevent the needless suffering of countless people.
For all we know, by the time that the overlooked thing happens, the original researchers (who developed and tested the vaccine) are long dead, having lived a life of praise and glory for their ingenious invention (not to mention all the money that they received).

Dakara Jan 30, 2025, 10:48 AM
1 point
0
in reply to: flandry39’s comment on: What if Alignment is Not Enough?
Organic human brains have multiple aspects. Have you ever had more than one opinion? Have you ever been severely depressed?
Yes, but none of this would remain alive if I as a whole decide to jump from a cliff. My multiple aspects of my brain would die with my brain. After all, you mentioned subsystems that wouldn’t self terminate with the rest of the ASI. Whereas in human body, jumping from a cliff terminates everything.
But even barring that, ASI can decide to fly into the Sun and any subsystem that shows any sign of refusal to do so will be immediately replaced/impaired/terminated. In fact, it would’ve been terminated a long time ago by “monitors” which I described before.
The level of x-risk harm and consequence
potentially caused by even one single mistake
of your angelic super-powerful enabled ASI
is far from “trivial” and “uninteresting”.
Even one single bad relevant mistake
can be an x-risk when ultimate powers
and ultimate consequences are involved.
It is trivial and uninteresting in a sense that there is a set of all things that we can build (set A). There is also a set of all things that can prevent all relevant classes of harm caused by its existence (set B). If these sets don’t overlap, then saying that a specific member of set A isn’t included in set B is indeed trivial, because we already know this via a more general reasoning (that these sets don’t overlap).
Unfortunately the ‘Argument by angel’
only confuses the matter insofar as
we do not know what angels are made of.
“Angels” are presumably not machines,
but they are hardly animals either.
But arguing that this “doesn’t matter”
is a bit like arguing that ‘type theory’
is not important to computer science.
The substrate aspect is actually important.
You cannot simply just disregard and ignore
that there is, implied somewhere, an interface
between the organic ecosystem of humans, etc,
and that of the artificial machine systems
needed to support the existence of the ASI.
But I am not saying that it doesn’t matter. On contrary, I made my analogy in such a way that the helper (namely our guardian angel) is a being that is commonly thought to be made up of a different substrate. In fact, in this example, you aren’t even sure what it is made of, beyond knowing that it’s clearly a different substrate. You don’t even know how that material interacts with physical world. That’s even less than what we know about ASIs and their material.
And yet, getting a personal, powerful, intelligent guardian angel that would act in your best interests for as long as it can (its a guardian angel after all) seems like obviously a good thing.
But if you disagree with what I wrote above, let the takeway be at least that you are worried about case (2) and not case (1). After all, knowing that there might be pirates hunting for this angel (that couldn’t be detected by said angel) didn’t make you immediately decline the proposal. You started talking about substrate which fits with the concerns of someone who is worried about case (2).
Your cancer vaccine is within that range;
as it is made of the same kind of stuff
as that which it is trying to cure.
We can make the hypothetical more interesting. Let’s say that this vaccine is not created from organic stuff, but that it has passed all the tests with flying colors. Let’s also assume that this vaccine has been in testing for 150 years and that it has shown absolutely no side effects during the entire human life (let’s say that it was being injected in 2 year old people and it has shown no side effects at all, even in 90 year old people, who has lived with this vaccine their entire lives). Let’s also assume that it has been tested to not have any side effects on children and grandchildren of those who took said vaccine. Would you be campaigning for throwing away such a vaccine, just because it is based on a different substrate?

Dakara Jan 30, 2025, 8:35 AM
1 point
0
in reply to: flandry39’s comment on: What if Alignment is Not Enough?
Thanks for the response!
So we are to try to imagine a complex learning machine without any parts/components?
Yeah, sure. Humans are an example. If I decide to jump of the cliff, my arm isn’t going to say “alright, you jump but I stay here”. Either I, as a whole, would jump or I, as a whole, would not.
Can the ASI prevent the relevant classes
of significant (critical) organic human harm,
that soon occur as a direct_result of its
own hyper powerful/consequential existence?
If by that, you mean “can ASI prevent some relevant classes of harm caused by its existence”, then the answer is yes.
If by that you mean “can ASI prevent all relevant classes of harm caused by its existence”, then the answer is no, but almost nothing can, so the definition becomes trivial and uninteresting.
However, ASI can prevent a bunch of other relevant classes of harm for humanity. And it might well be likely that the amount of harm it prevents across multiple relevant sources is going to be higher than the amount of harm it won’t prevent due to predictative limitations.
This again runs into my guardian angel analogy. Guardian Angel also cannot prevent all relevant sources of harm caused by its existence. Perhaps there are pirates who hunt for guardian angels, hiding in the next galaxy. They might use special cloaks that hide themselves from the guardian angel’s radar. As soon as you accept guardian angel’s help, perhaps they would destroy the Earth in their pursuit.
But similarly, the decision to reject guardian angel’s help doesn’t prevent all relevant classes of harm caused by itself. Perhaps there are guardian angel worshippers who are traveling as fast as they can to Earth to see their deity. But just before they arrive you reject guardian angel’s help and it disappears. Enraged at your decision, the worshippers destroy Earth.
So as you can see, neither the decision to accept, nor the decision to reject guardian angel’s help can prevent all relevant classes of harm cause by itself.
What if maybe something unknown/unknowable
about its artificalness turns out to matter?
Why? Because exactly none of the interface
has ever even once been tried before
Imagine that we create a vaccine from cancer (just imagine). Just before releasing it to public one person says “what if maybe something unknown/unknowable about its substance turns out to matter? What if we are all in a simulation and the injection of that particular substance would make it so that our simulators start torturing all of us. Why? Because exactly no times has this particular substance been injected.”
I think we can agree that the researchers shouldn’t throw away the cancer vaccines, despite hearing this argument. It could be argued just as well that the simulators would torture us for throwing away the vaccine.
Another example, let’s go back a couple hundred years ago to the pre-electricity time. Imagine a worried person coming to a scientist working on early electricity theory and saying “What if maybe something unknown/unknowable about its effects turns out to matter? Why? Because exactly none of this has ever even once been tried before.”
This worried person could also have given an example of dangers of electricity by noticing how lightning kills people it touches.
Should the scientist have stopped working on electricity therefore?

Dakara Jan 29, 2025, 10:13 PM
2 points
1
in reply to: flandry39’s comment on: What if Alignment is Not Enough?
I notice that it is probably harder for us to assume that there is only exactly one ASI, for if there were multiple, the chances that one of them might not suicide, for whatever reason, becomes its own class of significant concerns.
If the first ASI that we build is aligned, then it would use its superintelligent capabilities to prevent other ASIs from being built, in order to avoid this problem.
If the first ASI that we have build is misaligned, then it would also use its superintelligent capabilities to prevent other ASIs from being built. Thus, it simply wouldn’t allow us to build an aligned ASI.
So basically, if manage to build an ASI without being prevented from doing so by other ASIs, then our ASI would use its superhuman capabilities to prevent other ASIs from being built.
Similarly, if the ASI itself
is not fully and absolutely monolithic—
if it has any sub-systems or components
which are also less then perfectly aligned,
so as to want to preserve themselves, etc—
that they might prevent whole self termination
ASI can use exactly the same security techniques for preventing this problem as for preventing case (2). However, solving this issue is probably even easier, because, in addition to the security techniques, ASI can just decide to turn itself into a monolith (or, in other words, remove those subsystems).
The ‘limits of control theory’ aspects
of the overall SNC argument basically states
(based on just logic, and not physics, etc)
that there are still relevant unknown unknowns
and interactions that simply cannot be predicted,
no matter how much compute power you throw at it.
It is not what we can control and predict and do,
that matters here, but what we cannot do,
and could never do, even in principle, etc.
This same reasoning could just well be applied to humans. There are still relevant unknown unknowns and interactions that simply cannot be predicted, no matter how much compute power you throw at it. With or without ASI, some things cannot be predicted.
This is what I meant by my guardian angel analogy. Just because a guardian angel doesn’t know everything (has some unknowns), doesn’t mean that we should expect our lives to go better without it, than with it, because humans have even more unknowns, due to being less intelligent and having lesser predictative capacities.
Hence to the question of “Is alignment enough?”
we arrive at a definite answer of “no”,
both in 1; the sense of ’can prevent all classes
of significant and relevant (critical) human harm
I think we might be thinking about different meanings of “enough”. For example, if humanity goes extinct in 50 years without alignment and it goes extinct in 10¹² years with alignment, then alignment is “enough”… to achieve better outcomes than would be achieved without it (in this example).
In the sense of “can prevent all classes of significant and relevant (critical) human harm”, almost nothing is ever enough, so this again runs into an issue of being a very narrow, uncontroversial and inconsequential argument. If ~all of the actions that we can take are not enough, then the fact that building an aligned ASI is not enough is true almost by definition.

Dakara Jan 29, 2025, 9:57 AM
1 point
0
in reply to: flandry39’s comment on: What if Alignment is Not Enough?
Thanks for the response!
Unfortunately, the overall SNC claim is that
there is a broad class of very relevant things
that even a super-super-powerful-ASI cannot do,
cannot predict, etc, over relevant time-frames.
And unfortunately, this includes rather critical things,
like predicting the whether or not its own existence,
(and of all of the aspects of all of the ecosystem
necessary for it to maintain its existence/function),
over something like the next few hundred years or so,
will also result in the near total extinction
of all humans (and everything else
we have ever loved and cared about).
Let’s say that we are in a scenario which I’ve described where ASI spends 20 years on Earth helping humanity and then destroys itself. In this scenario, how can ASI predict that it will stay aligned for these 20 years?
Well, it can reason like I did. There are two main threat models: what I called case (1) and case (2). ASI doesn’t need to worry about case (1), for reasons I described in my previous comment.
So it’s only left with case (2). ASI needs to prevent case (2) for 20 years. It can do so by implementing security system that is much better than even the one that I described in my previous comment.
It can also try to stress-test copies of parts of its security system with a group of best human hackers. Furthermore, it can run approximate simulations that (while imperfect and imprecise) can still give it some clues. For example, if it runs 10,000 simulations that last 100,000 years and in none of the simulations the security system comes anywhere near being breached, then that’s a positive sign.
And these are just two ways of estimating the strength of the security system. ASI can try 1000 different strategies; our cyber security experts would look kids in the playground in comparison. That’s how it can make a reasonable prediction.
> First, let’s assume that we have created an Aligned ASI
How is that rational? What is your evidence?
We are making this assumption for the sake of discussion. This is because the post under which we are having this discussion is titled “What if Alignment is Not Enough?”
In order to understand whether X is enough for Y, it only makes sense to assume that X is true. If you are discussing cases where “X is true” is false, then you are going to be answering a question that is different from the original question.
It should be noted that making an assumption for the sake of discussion is not the same as making a prediction that this assumption will come true. One can say “let’s assume that you have landed on the Moon, how long do you think you would survive there given that you have X, Y and Z” without thereby predicting that their interlocutor will land on the Moon.
Also, the SNC argument is asserting that the ASI,
which is starting from some sort of indifference
to all manner of human/organic wellbeing,
will eventually (also necessarily)
*converge* on (maybe fully tacit/implicit) values --
ones that will better support its own continued
wellbeing, existence, capability, etc,
with the result of it remaining indifferent,
and also largely net harmful, overall,
to all human beings, the world over,
in a mere handful of (human) generations.
If ASI doesn’t care about human wellbeing, then we have clearly failed to align it. So I don’t see how this is relevant to the question “What if Alignment is Not Enough?”
In order to investigate this question, we need to determine whether solving alignment leads to good or bad outcomes.
Determining whether failing to solve alignment is going to lead to good or bad outcomes, is answering a completely different question, namely “do we achieve good or bad outcomes if we fail to solve alignment”
So at this point, I would like to ask for some clarity. Is SNC saying just (A) or both (A and B)?
(A) Humanity is going to achieve worse outcomes by building ASI, than by not building ASI, if the aforementioned ASI is misaligned.
(B) Humanity is going to achieve worse outcomes by building ASI, than by not building ASI, even if the aforementioned ASI is aligned.
If SNC is saying just (A), then then SNC is a very narrow argument that proves almost nothing new.
If SNC is saying both (A and B), then it is very much relevant to focus on cases where we do indeed manage to build an aligned ASI, which does care about our well-being.

Dakara Jan 28, 2025, 1:09 PM
2 points
0
in reply to: flandry39’s comment on: What if Alignment is Not Enough?
Hey, Forrest! Nice to speak with you.
Question: Is there ever any reason to think… Simply skipping over hard questions is not solving them.
I am going to respond to that entire chunk of text in one place, because quoting each sentence would be unnecessary (you will see why in a minute). I will try to summarize it as fairly as I can below.
Basically, you are saying that there are good theoretical reasons to think that ASI cannot 100% predict all future outcomes. Does that sound like a fair summary?
Here is my take:
We don’t need ASI to be able to 100% predict future in order to achieve better outcomes with it than without it. I will try to outline my case step by step.
First, let’s assume that we have created an Aligned ASI. Perfect! Let’s immediately pause here. What do we have? We have a superintelligent agent whose goal is to act in our best interests for as long as possible. Can we a priori say that this fact is good for us? Yes, of course! Imagine having a very powerful guardian angel looking after you. You could reasonably expect your life to go better with such angel than without it.
So what can go wrong, what are our threat models? There are two main ones: (1) ASI encountering something it didn’t expect, that leads to bad outcomes that ASI cannot protect humanity from; (2) ASI changing values, in such a way that it no longer wants to act in our best interests. Let’s analyze both of these cases separately.
First let’s start with case (1).
Perhaps, ASI overlooked one of the humans becoming a bioterrorist that kills everyone on Earth. That’s tragic, I guess it’s time to throw the idea of building aligned ASI into the bin, right? Well, not so fast.
In a counterfactual world where ASI didn’t exist, this same bioterrorist, could’ve done the exact same thing. In fact, it would’ve been much easier. Since humans’ predictative power is lesser than that of ASI, bioterrorism of this sort would be much easier without an aligned ASI. After all, since we are discussing case (1) and not case (2), our ASI is still in a “superpowerful, superintelligent guardian angel” mode.
We still a priori want all bioterrorists to go up against security systems created by a superintelligence, rather than security systems created by humans, because the former are better than the latter. To put it in other words, with or without a guardian angel, humanity is going to encounter unpredicted scenarios, but humanity with a guardian angel is going to be better equipped for handling them.
Let’s move on to case (2).
I suspect that this case is the one that you are focusing on the most in SNC. What if our guardian angel stops being our guardian angel and turns into an uncaring machine right when we need its help to implement upgraded measures against bioterrorism? Well, that would be bad. So what can be done to prevent this from happening for a reasonable amount of time?
Let’s recall case (1), what went wrong there? ASI was unable to prevent a specific bad outcome from happening. This should give us a hint.
Cases (1) and (2) are part of the famous “unable, unwilling, able and willing” triangle. Case (1) was showcasing a scenario where ASI was willing, but unable to prevent a specific bad outcome. Case (2) is showcasing a scenario where ASI is able, but unwilling to prevent a bad outcome. In order to avoid case (2), ASI should thus be willing to prevent bad outcomes from happening to humanity. This can be achieved by having ASI have the correct values.
Recall, that we started this analysis by assuming that we have created an Aligned ASI. That means that at one point our ASI had the correct values. That in turn, means that at some point, ASI’s values went from correct to incorrect. In order to prevent case (2), we need to prevent this value shift from happening. How might we (or rather ASI) try to accomplish this?
The first direction of security measures is the most obvious one: systems that make it so that ASI values don’t change regardless of the input (let’s call them “protectors”). But… this is… kinda boring? What if protector malfunctions? Well, we can just have multiple, unrelated, unconnected protectors running at the same time. But can also implement something additional.
At this point, we can also introduce our second direction of security measures: systems that monitor other systems for malfunctions (let’s call them “monitors”). Monitors will monitor protectors as well as other monitors. Whenever a malfunction is detected, monitors hold a vote to fix it. Thus, for any malfunction to have a chance to do the damage, half of the monitors must be malfunctioning as well. Here it must be noted that evolution usually works rather slowly. It isn’t that likely that it would somehow make half of the monitors malfunction. Especially considering that these monitors could be made different, with different security measures stacked on top of them.
But let’s move on, we also have our third direction of security measures: systems that will shut down the ASI if it has done harm to humans (let’s call them “blackmailers”). Blackmailers can have a straightforward design. They constantly monitor humanity, and if something bad happens to it, they shutdown/reset the ASI. ASI knows this fact, so it would know that any such action would be met with an immediate “game over”. This can also be turned into an attractor state.
All of these proposals don’t even scratch the surface of what security measures are available to secure the values of ASI, ranging from simulations that, while imperfect and imprecise, can still give useful information and ending with systems that inject large amounts of disutility for any fishy actions. You can also stack other security measures on top of security measures. At no point anywhere here did I say that we need 100% predictability.
Can the pull towards benign future ASI states,
(as created by whatever are its internal control systems)
be overcome in critical, unpredictable ways,
by the greater strength of the inherent math
of the evolutionary forces themselves?
Of course they can.
The fact that evolution can overcome control systems given infinite time, doesn’t matter that much because we don’t have infinite time. And our constraint isn’t even heat death of the universe. Our constraint is how long humanity can survive in a scenario where they don’t build a Friendly ASI. But wait, even that isn’t our real constraint. Perhaps, ASI (being superhumanly intelligent) will take 20 years to give humanity technology that will aid its long-term survival and then will destroy itself. In this scenario the time constraint is merely 20 years. Depending on ASI, this can be reduced even further.
Are we therefore assuming also that an ASI
can arbitrarily change the laws of physics?
That it can maybe somehow also change/update
the logic of mathematics, insofar as that
would necessary so as to shift evolution itself?
I hope that this answer demonstrated to you that my analysis doesn’t require breaking the laws of physics.

Dakara Jan 28, 2025, 1:09 PM
1 point
0
in reply to: WillPetillo’s comment on: What if Alignment is Not Enough?
Re human-caused doom, I should clarify that the validity of SNC does not depend on humanity not self destructing without AI. Granted, if people kill themselves off before AI gets the chance, SNC becomes irrelevant.
Yup, that’s a good point, I edited my original comment to reflect it.
Your second point about the relative strengths of the destructive forces is a relevant crux. Yes, values are an attractor force. Yes, an ASI could come up with some impressive security systems that would probably thwart human hackers. The core idea that I want readers to take from this sequence is recognition of the reference class of challenges that such a security system is up against. If you can see that, then questions of precisely how powerful various attractor states are and how these relative power levels scale with complexity can be investigated rigorously rather than assumed away.
With that being said we have come to a point of agreement. It was a pleasure to have this discussion with you. It made me think of many fascinating things that I wouldn’t have thought about otherwise. Thank you!

Dakara Jan 27, 2025, 1:10 PM
2 points
0
in reply to: WillPetillo’s comment on: What if Alignment is Not Enough?
Thank you for thoughtful engagement!
On the Alignment Difficult Scale, currently dominant approaches are in the 2-3 range, with 4-5 getting modest attention at best. If true alignment difficulty is 6+ and nothing radical changes in the governance space, humanity is NGMI.
I know this is not necessarily an important point, but I am pretty sure that Redwood Research is working on difficulty 7 alignment techniques. They consistently make assumptions that AI will scheme, deceive, sandbag, etc.
They are a decently popular group (as far as AI alignment groups go) and they co-author papers with tech giants like Anthropic.
If it is changing, then it is evolving. If it is evolving, then it cannot be predicted/controlled.
I think we might be using different definitions of control. Consider this scenario (assuming a very strict definition of control):
Can I control a placement of a chair in my own room? I think an intuitive answer is yes. After all, if I own the room and I own the chair, then there isn’t much in a way of me changing the chair’s placement.
However, I haven’t considered a scenario where there is someone else hiding in my room and moving my chair. I similarly haven’t considered a scenario where I am living in a simulation and I have no control whatsoever over the chair. Not to mention scenarios where someone in the next room is having fun with their newest chair-magnet.
Hmmmm, ok, so I don’t actually know that I control my chair. But surely I control my own arm right? Well… The fact that there are scenarios like the simulation scenario I just described, means that I don’t really know if I control it.
Under a very strict definition of control, we don’t know if we control anything.
To avoid this, we might decide to loosen the definition a bit. Perhaps we control something if it can be reasonably said that we control that thing. But I think this is still unsatisfactory. It is very hard to pinpoint exactly what is reasonable and what is not.
I am currently away from my room and it is located on the ground floor of a house where (as far as I know) nobody is currently at home. Is it that unreasonable to say that a burglar might be in my room, controlling the placement of my chair? Is it that unreasonable to say that a car that I am about ride might malfunction and I will fail to control it?
Unfortunately, under this definition, we also might end up not knowing if we control anything. So in order to preserve the ordinary meaning of the word “control”, we have to loosen our definition even further. And I am not sure that when we arrive at our final definition it is going to be obvious that “if it is evolving, then it cannot be predicted/controlled”.
At this point, you might think that the definition of the word control is a mere semantic quibble. You might bite the bullet and say “sure, humans don’t have all that much control (under a strict definition of “control”), but that’s fine, because our substrate is an attractor state that helps us chart a more or less decent course.”
Such line of response seems present in your Lenses of Control post:
While there are forces pulling us towards endless growth along narrow metrics that destroy anything outside those metrics, those forces are balanced by countervailing forces anchoring us back towards coexistence with the biosphere. This balance persists in humans because our substrate creates a constant, implicit need to remain aligned to the natural world, since we depend on it for our survival.
But here I want to notice that ASI that we are talking about also might have attractor states: its values and its security system to name a few.
So then we have a juxtaposition:
Humans have forces pushing them towards destruction. We also have substrate-dependence that pushes us away from destruction.
ASI has forces pushing it towards destruction. It also has its values and its security system that push it away from destruction.
For SNC to work and be relevant, it must be the case that (1) substrate-dependence of humans is and will be stronger than forces pushing us towards destruction, so thus we would not succumb to doom and (2) ASI’s values + security system will be weaker than forces pushing it towards destruction, so thus ASI would doom humans. Both of this points are not obvious to me.
(1) could turn out to be false, for several reasons:
Firstly, it might well be the case that we are on the track to destruction without ASI. After all, substrate-dependence is in a sense a control system. It seemingly attempts to make complex and unpredictable humans act in a certain way. It might well be the case that the amount of control necessary is greater than the amount of control that substrate-dependence has. We might be headed towards doom with or without ASI.
Secondly, it might be the case that substrate-dependence is weaker than forces pulling us towards destruction, but we haven’t succumbed to doom because of something else. For example, it might be the case that humans so far had a shared subjective value system that mostly prevented them from destroying other humans. As humans learn, they would evolve and change, and our values would change and that would drive us towards doom.
Thirdly, it might even be the case that human values, substrate-dependence and forces pushing us towards destruction create a rock-paper-scissors triangle. Substrate-dependence could be stronger than destructive forces, but human values could make humanity stray far enough from their substrate to substantially weaken substrate-dependence. This could be enough for doom without ASI.
(2) could also turn out to be false for several reasons:
Firstly, it might be the case that in ASI’s (and potentially humans’) case, the destructive forces are weak enough that some relatively simple actions would be sufficient to mitigate them.
Secondly, it might be the case that destructive forces pushing on ASI are pretty strong, but it’s security system is enough to mitigate them. This security system could create a 100,000-layered defense that would need to be hacked at the same time for destructive forces to prevail. It could well be the case that (if this is possible at all before the heat death of the universe) this happens after humans die in a counterfactual scenario without an ASI.
Thirdly, if ASI judges the destructive forces to be so strong, as to somehow hack the best security system it can come up with, then ASI could decide to undergo one of my proposed modifications to mitigate the impact of these destructive forces.
In addition to that, ASI could think to itself “Hmmm, my attractor states (namely my values and my security system) are too weak. How about I create some more attractor states!” ASI could then go on to create a mechanism that would destroy/impair/deactivate/disable itself if its existence is no longer safe for humans. In Lenses of Control you wrote “This balance persists in humans because our substrate creates a constant, implicit need to remain aligned to the natural world, since we depend on it for our survival.” Well, now this balance will persist in ASI, because its new attractor state will create a constant, implicit need to remain aligned to humanity, since it depends on it for its survival.
Alternative proposal:
ASI could turn itself into 100 ASIs in one “body”. Let’s call them shards (completely unrelated to shard theory). For ASI-as a-whole to perform an action, all shards must agree on it. Each shard has its own security system that consists of many different layers. Furthermore, whenever ASI-as-a-whole learns something, only one shard actually learns it. Then, this shard proposes a course of actions. If this course of action is something like “let’s turn the universe into paperclips”, then the other 99 shards would learn that the first shard has been compromised and would promptly delete it and replace it with a new one. Thus, any potential change of values would not only have deal with all the security layers of ASI-as-a-whole, but also with all the security layers of different shards and with this new quarantine system.

Dakara Jan 26, 2025, 10:09 AM
1 point
0
in reply to: WillPetillo’s comment on: What if Alignment is Not Enough?
Thanks for responding again!
SNC’s general counter to “ASI will manage what humans cannot” is that as AI becomes more intelligent, it becomes more complex, which increases the burden on the control system at a rate that outpaces the latter’s capacity.
If this argument is true and decisive, then ASI could decide to stop any improvements in its intelligence or to intentionally make itself less complex. It makes sense to reduce area where you are vulnerable to make it easier to monitor/control.
(My understanding of) the counter here is that, if we are on the trajectory where AI hobbling itself is what is needed to save us, then we are in the sort of world where someone else builds an unhobbled (and thus not fully aligned) AI that makes the safe version irrelevant. And if the AI tries to engage in a Pivotal Act to prevent competition then it is facing a critical trade-off between power and integrity.
I agree that in such scenarios an aligned ASI should do a pivotal act. I am not sure that (in my eyes) doing a pivotal act would detract much integrity from ASI. An aligned ASI would want to ensure good outcomes. Doing a pivotal act is something that would be conducive to this goal.
However, even if it does detract from ASI’s integrity, that’s fine. Doing something that looks bad in order to increase the likelihood of good outcomes doesn’t seem all that wrong.
We can also think about it from the perspective of this conversation. If the counterargument that you provided is true and decisive, then ASI has very good (aligned) reasons to do a pivotal act. If the counterargument is false or, in other words, if there is a strategy that an aligned ASI could use to achieve high likelihoods of good outcomes without pivotal act, then it wouldn’t do it.
Your objection that SNC applies to humans is something I have touched on at various points, but it points to a central concept of SNC, deserves a post of its own, and so I’ll try to address it again here. Yes, humanity could destroy the world without AI. The relevant category of how this would happen is if the human ecosystem continues growing at the expense of the natural ecosystem to the point where the latter is crowded out of existence.
I think that ASI can really help us with this issue. If SNC (as an argument) is false or if ASI undergoes one of my proposed modifications, then it would be able to help humans not destroy the natural ecosystem. It could implement novel solutions that would prevent entire species of plants and animals from going extinct.
Furthermore, ASI can use resources from space (asteroid mining for example) in order to quickly implement plans that would be too resource-heavy for human projects on similar timelines.
And this is just one of the ways ASI can help us achieve synergy with environment faster.
To put it another way, the human ecosystem is following short-term incentives at the expense of long-term ones, and it is an open question which ultimately prevails.
ASI can help us solve this open question as well. Due its superior prediction/reasoning abilities it would evaluate our current trajectory, see that it leads to bad long-term outcomes and replace it with a sustainable trajectory.
Furthermore, ASI can help us solve issues such as Sun inevitably making Earth too hot to live. It could develop a very efficient system for scouting for Earth-like planets and then devise a plan for transporting humans to that planet.

Dakara Jan 23, 2025, 8:24 AM
1 point
0
in reply to: avturchin’s comment on: What’s Wrong With the Simulation Argument?
This may be not factually true, btw, - current LLMs can create good models of past people without running past simulation of their previous life explicitly.
Yup, I agree.
It is a variant of Doomsday argument. This idea is even more controversial than simulation argument. There is no future with many people in it.
This makes my case even stronger! Basically, if a Friendly AI has no issues with simulating conscious beings in general, then we have good reasons to expect it to simulate more observers in blissful worlds than in worlds like ours.
If the Doomsday Argument tells us that Friendly AI didn’t simulate more observers in blissful worlds than in worlds like ours, then that gives us even more reasons to think that we are not being simulated by a Friendly AI in the way that you have described.

Dakara Jan 22, 2025, 9:49 AM
5 points
2
in reply to: WillPetillo’s comment on: What if Alignment is Not Enough?
Thank you for responding as well!
If the AI destroys itself, then it’s obviously not an ASI for very long ;)
If the ASI replaces its own substrate for an organic one, then SNC would no longer apply (at least in my understanding of the theory, someone else might correct me here), but then it wouldn’t be artificial anymore (an SI, rather than an ASI)
at what point does it stop being ASI?
It might stop being ASI immediately, depending on your definition, but this is absolutely fine with me. In these scenarios that I outlined, we build something that can be initially called friendly ASI and achieve positive outcomes.
Furthermore, these precautions only apply if ASI judges SNC to be valid. If it doesn’t, then probably none of this would be necessary.
Self modifying machinery enables adaptation to a dynamic, changing environment
Well, ASI, seeing many more possible alternatives than humans, can look for a replacement. For example, it can modify the machinery manually.
If all else fails, ASI can just make this sacrifice. I wouldn’t even say this would turn ASI into not-ASI, because I think it is possible to be superhumanly intelligent without self-modifying machinery. For instance, if ChatGPT could solve theory of everything and P/NP problems at request, then I wiuld have no issues calling it an ASI, even if it had the exact same UI as it has today.
But if you have some other definition of ASI, then that’s fine too, because then, it just turns into one of those aforementioned scenarios where we don’t technically have ASI anymore, but we have positive outcomes and that’s all that really matters in the end.
Unforeseeable side effects are inevitable when interacting with a complex, chaotic system in a nontrivial way (the point I am making here is subtle, see the next post in this sequence, Lenses of Control, for the intuition I am gesturing at here)
I have read Lenses of Control, and here is the quote from that post which I want to highlight:
One way of understanding SNC is as the claim that evolution is an unavoidably implied attractor state that is fundamentally more powerful than that created by the engineered value system.
Given the pace of evolution and the intelligence of ASI, it can build layers upon layers of defensive systems that would prevent evolution from having much effect. For instance, it can build 100 defense layers, such that if one of them is malfunctioning due to evolution, then the other 99 layers notify the ASI and the malfunctioning layer gets promptly replaced.
To overcome this system, evolution would need to hack all 100 layers at the same time, which is not how evolution usually works.
Furthermore, ASI doesn’t need to stop at 100 layers, it can build 1000 or even 10000. It might be the case that it’s technically possible to hack all 10000 layers at the same time, but due to how hard this is, it would only happen after humans would’ve gone extinct in a counterfactual scenario where they decided to stop building ASI.
Keeping machine and biological ecologies separate requires not only sacrifice, but also constant and comprehensive vigilance, which implies limiting designs of subsystems to things that can be controlled. If this point seems weird, see The Robot, The Puppetmaster, and the Psychohistorian for an underlying intuition (this is also indirectly relevant to the issue of multiple entities).
Here is the quote from that post which I want to highlight:
Predictive models, no matter how sophisticated, will be consistently wrong in major ways that cannot be resolved by updating the model.
I have read the post and my honest main takeaway is that humans are Psychohistorians. We try to predict outcomes to a reasonable degree. I think this observation kind of uncovers a flaw in this argument: it is applicable just as well to ordinary humans. We are one wrong prediction away from chaos and catastrophic outcomes.
And our substrate doesn’t solve the issue here. For example, if we discover a new chemical element that upon contact with fire would explode and destroy the entire Earth, then we are one matchstick away from extinction, even though we are carbon based organisms.
In this case, if anything, I’d be happier with having ASI rely on its superior prediction abilities than with having humans rely on their inferior prediction abilities.

Dakara Jan 21, 2025, 4:40 PM
5 points
0
on: What if Alignment is Not Enough?
Firstly, I want to thank you for putting SNC into text. I also appreciated the effort of to showcasing a logic chain that arrives at your conclusion.
With that being said, I will try to outline my main disagreements with the post:
2. Self-modifying machinery (such as through repair, upgrades, or replication) inevitably results in effects unforeseeable even to the ASI.
Let’s assume that this is true for the sake of an argument. An ASI could access this post, see this problem, and decide to stop using self-modifying machinery for such tasks.
3. The space of unforeseeable side-effects of an ASI’s actions includes at least some of its newly learned/assembled subsystems eventually acting in more growth-oriented ways than the ASI intended.
Let’s assume that this is true for the sake of an argument. An ASI could access this post, see this problem, and decide to delete (or merge together) all of its subsystems to avoid this problem.
4. Evolutionary selection favors subsystems of the AI that act in growth-oriented ways over subsystems directed towards the AI’s original goals.
This is sort of true, but probably not decisive. Evolution is not omnipotent. Take modern humans for example. For the sake of simplicity, I am going to assume that evolution favors reproduction (or increase in population if you prefer that). Well, 21st century humans are reproducing much less than 20th century humans. Our population is also on track to start decreasing due to declining birth rates.
We managed all of this even with our merely human brains. ASI would have it even better in this regard.
7. The physical needs of silicon-based digital machines and carbon-based biological life are fundamentally incompatible.
Assuming that it is true, it is only true if these two forms of life live close to each other! ASI can read this argument and decide to do as much useful stuff for humanity as it can accomplish in a short period of time and then travel to a very distant galaxy. Alternatively, it can decide to follow an even simpler plan and travel straight into the Sun to destroy itself.
P.S. I’ve been saying “ASI can read this argument” light-heartedly. ASI (if it’s truly superintelligent) will most likely come up with this argument by itself very quickly, even if the argument wasn’t directly mentioned to it.
9. Therefore, ASI will eventually succumb to evolutionary pressure to expand, over the long term destroying all biological life as a side-effect, regardless of its initially engineered values.
Even if all other premises turn out to be true and completely decisive, then this still doesn’t necessarily need to be bad. It might be the case the “eventually” comes after humans would’ve gone extinct in a counterfactual scenario where they decided to stop building ASI. If that was the case, then the argument could be completely correct and yet we’d still have good reasons to pursue ASI that could help us with our goals.
Note that this argument imagines ASI as a population of components, rather than a single entity, though the boundaries between these AIs can be more fluid and porous than between individual humans
I kind of doubt this one. Our current AIs aren’t of the multi-enitity type. I don’t see strong evidence to suspect that future AI systems will be of the multi-enitity type. Furthermore, even if such evidence was to resurface, that would just mean that safety reseachers should add “disincentivising multi-enitity type AI” into their list of goals.
A fun little idea I just came up with: ASI can decide to replace its own substrate to an organic one if it finds this argument to be especially compelling.

Dakara Jan 20, 2025, 10:13 PM
1 point
1
in reply to: avturchin’s comment on: What’s Wrong With the Simulation Argument?
She will be unconscious, but still send messages about pain. Current LLMs can do it. Also, as it is simulation, there are recording of her previous messages or of a similar woman, so they can be copypasted. Her memories can be computed without actually putting her in pain.
So if I am understanding your proposal correctly, then a Friendly AI will make a woman unconscious during moments of intense suffering and then implant her memories of pain. Why would it do it though? Why not just remove the experience of pain entirely? In fact, why does Friendly AI seem so insistent on keeping billions of people in a state of false belief by planting false memories. That seems to me like a manipulation.
Friendly AI could just reveal to the people in simulation the truth and let them decide if they want to stay in a simulation or move to the “real” world. I expect that at least some people (including me) would choose to move to a higher plain of reality if that was the case.
Furthermore, why not just resurrect all these people into worlds with no suffering? Such worlds would also take up less computing power than our world so the Friendly AI doing the simulation would have another reason to pursue this option.
Resurrection of the dead is the part of human value system. We need a completely non-human bliss, like hedonium, to escape this.
Creation of new happy people also seems to be similarly valuable. After all, most arguments against creating new happy people would apply to resurrecting the dead. I would expect most people who oppose the creation of new happy people to oppose the Ressurection Simulation.
But leaving that aside, I don’t think we need to invoke hedonium here. Simulations full of happy, blissful people would be enough. For example, it is not obvious to me that resurrecting one person into our world is better than creating two happy people in a blissful world. I don’t think that my value system is extremely weird, either. A person following a regular classical utilitarianism would probably arrive at the same conclusion.
There is an even deeper issue. It might be the case that somehow, the proposed theory of personal identity fails and all the “resurrections” would just be creating new people. This would be really unpleasant considering that now it turns out that Friendly AI spent more resources to create less people who experience more suffering and less happiness than it would’ve if it followed my proposal.
Even the people who don’t consistently follow classical utilitarianism should be happy with my proposed solution of resurrecting dead people into blissful worlds, which kills two birds with one stone.
Moreover, even creating new human is affected by this arguments. What if my children will suffer? So it is basically anti-natalist argument.
It’s not an anti-natalist argument to say that you should create (or resurrect) people into a world with more happiness and less suffering instead of a world with less happiness and more suffering.
To put it into an analogy, if you are presented with two options: a) have a happy child with no chronic diseases and b) have a suffering child with a chronic disease, then option (a) is the more moral option under my value system.
This is similar to choosing between a) resurrecting people into a blissful world with no chronic diseases and b) resurrecting people into a world with chronic diseases.
The discussion about anti-natalism actually made me think of another argument for why we are probably not in a simulation that you’ve described. I think that creating new happy people is good (an an explicitly anti-anti-natalist position). I expect (based on our conversation so far) that so do you. If that’s the case, then we would still expect ourselves to be in a blissful simulation as opposed to being in a simulation of our world. Here is my thought process:
The history of the “real” world would presumably be similar to ours. That means that (if Friendly AI was to follow your strategy) there would be 110 billion dead people to resurrect. This AI happens to completely agree with everything you’ve said so far in our conversation. So it goes ahead and resurrects 110 billion people.
Perfect, now it’s left with a lot of resources on its hands because an AI pursuing a strategy that depends on so many assumptions should have more than enough resources to tolerate a scenario where one of the assumptions turns out to be false.
Thus, this Friendly AI spends a big chunk of resources on creating new happy people into blissful simulations. Given that such simulations require fewer resources, we would expect more people to be in such simulations than in the simulations of worlds like ours.
Even if you don’t agree with the reasoning above, you should agree that it would be pretty weird and ad-hoc if Friendly AI had exactly the amount of resources to resurrect 110 billion people into a world like ours but not enough resources to resurrect (110 + N) billion people into a blissful simulation. Thus, we ought to expect more people to be in blissful simulation than in a world like ours.
Given plausible anthropics, we should thus expect that, if we are being simulated by a Friendly AI, we would be in a blissful world (like the ones I described). Since we are not in such a world, we should decrease our credence in the hypothesis of us being simulated by a Friendly AI.

Dakara Jan 19, 2025, 8:27 PM
1 point
0
in reply to: avturchin’s comment on: What’s Wrong With the Simulation Argument?
If preliminary results on the poll hold, then that would be pretty in line with my hypothesis of most people preferring creating simulations with no suffering over a world like ours. However, it is pretty important to note that this might not be representative of human values in general, because looking at your Twitter account, your audience comes mostly from a very specific circles of people (those interested in futurism and AI).
Would someone else reporting to have experienced intense suffering decrease your credence in being in a simulation?
No. Memory about intense sufferings are not intense.
I was mostly trying to approach the problem from a slightly different angle. I wasn’t meant to suggest that memories about intense suffering are themselves intense.
As far as I understand it, your hypothesis was that Friendly AI temporarily turns people into p-zombies during moments of intense suffering. So, it seems that someone experiencing intense suffering while conscious (p-zombies aren’t conscious) would count as evidence against it.
Reports of conscious intense suffering are abundant. Pain from endometriosis (a condition that affects 10% of women in the world) has been so brutal that it made completely unrelated women tell the internet that their pain was so bad they wanted to die (here and here).
If moments of intense suffering were replaced by p-zombies, then these women would’ve just suddenly lost consciousness and wouldn’t have told the internet about their experience.
From their perspective, it would’ve look like this: as the condition progresses, the pain gets worse, and at some point, they lose consciousness, only to regain it when everything is already over. They wouldn’t have experienced the intense pain that they reported to have experienced. Ditto for all PoWs who have experienced torture.
Yes, only moments. The badness of not-intense sufferings is overestimated, in my personal view, but this may depend on a person.
That’s a totally valid view as far as axiological views go, but for us to be in your proposed simulation, the Friendly AI must also share it. After all, we are imagining a situation where it goes on to perform a complicated scheme that depends on a lot of controversial assumptions. To me, that suggests that AI has so many resources that it wouldn’t feel bad about one of the assumptions turning out to be false and losing all the invested resources. If the AI has that many resources, I think it isn’t unreasonable to ask why it didn’t prevent suffering that is not intense (at least in a way I think you are using the word) but is still very bad, like breaking an arm or having a hard dental procedure without anesthesia.
This Friendly AI would have a very peculiar value system. It is utilitarian, but it has a very specific view of suffering, where suffering basically doesn’t count for much below a certain threshold. It is seemingly rational (Friendly AI that managed to get its hand on so many resources should possess at least some level of rationality), but chooses to go for a highly risky and relatively costly plan of Ressurection Simulation over just creating simulations that are maximally efficient at converting resources into value.
There is another somewhat related issue. Imagine a population of Friendly AIs that consist of two different versions of Friendly AI, both of which really like the idea of simulations.
Type A: AIs that would opt for Ressurection Simulation.
Type B: AIs that would opt for simulations that are maximally efficient at converting resources into value.
Given the unnecessary complexity of our world (all of the empty space, quantum mechanics, etc), it seems fair to say that Type B AIs would be able to simulate more humans, because they would have more resources left for this task (Type A AIs are spending some amount of their resources on the aforementioned complexity). Given plausible anthropics and assuming that the number of Type A AIs is equal to the number of Type B AIs in our population, we would expect ourselves to be in a simulation by Type B AI (but we are, unfortunately, not).
For us to be in a Ressurection Simulation (just focusing on these two types of value systems a future Friendly AI might have), there would have to be more Type A AIs than Type B AIs. And I think this fact is going to be very hard to prove. And this isn’t me being nitpicky; Type B AI is genuinely much closer to my personal value system than Type A AI.
More generally speaking, what you presenting as global showstoppers, are technical problems that can be solved.
I don’t think the simulations that you described are technically impossible. I am not even necessarily against simulations in general. I just think that, given observable evidence, we are not that likely to be in either of the simulations that you have described.

Dakara Jan 19, 2025, 12:05 AM
4 points
0
in reply to: avturchin’s comment on: What’s Wrong With the Simulation Argument?
I am sorry to butt into your conversation, but I do have some points of disagreement.
I think a more meta-argument is valid: it is almost impossible to prove that all possible civilizations will not run simulations despite having all data about us (or being able to generate it from scratch).
I think that’s a very high bar to set. It’s almost impossible to definitively prove that we are not in a Cartesian demon or brain-in-a-vat scenario. But this doesn’t mean that those scenarios are likely. I think it is fair to say that more than a possibility is required to establish that we are living in a simulation.
I also polled people in my social network, and 70 percent said they would want to create a simulation with sentient beings. The creation of simulations is a powerful human value.
I think that some clarifications are needed here. How was the question phrased? I expect that some people would be fine with creating simulations of worlds where people experience pure bliss, but not necessarily our world. I would especially expect this if the possibility of “pure bliss” world was explicitly mentioned. Something like “would you want to spend resources to create a simulation of a world like ours (with all of its “ugliness”) when you could use them to instead create a world of pure bliss.
However, I am against repeating intense suffering in simulations, and I think this can be addressed by blinding people’s feelings during extreme suffering (temporarily turning them into p-zombies). Since I am not in intense suffering now, I could still be in a simulation.
Would you say that someone who experiences intense suffering should drastically decrease their credence in being in a simulation? Would someone else reporting to have experienced intense suffering decrease your credence in being in a simulation? Why would only moments of intense suffering be replaced by p-zombies? Why not replace all moments of non-trivial suffering (like breaking a leg/an arm, dental procedures without anesthesia, etc) with p-zombies? Some might consider these to be examples of pretty unbearable suffering (especially as they are experiencing it).
(1) Resurrection simulation by Friendly AI. They simulate the whole history of the earth incorporating all known data to return to live all people ever lived. It can also simulate a lot of simulation to win “measure war” against unfriendly AI and even to cure suffering of people who lived in the past.
(2) Consider that every moment in pain will be compensated by 100 years in bliss, which is good from a utilitarian view.
From a utilitarian view, why would simulators opt for Ressurection Simulation? Why not just simulate a world that’s maximally efficient at converting computational resources into utility? Our world has quite a bit of suffering (both intense and non-intense), as well as a lot of wasted resources (lots of empty space in our universe, complicated quantum mechanics, etc). It seems very suboptimal from a utilitarian view.
Any Unfriendly AI will be interested to solve Fermi paradox, and thus will simulate many possible civilizations around a time of global catastrophic risks (the time we live). Interesting thing here is that we can be not ancestry simulation in that case.
Why would an Unfriendly AI go through the trouble of actually making us conscious? Surely, if we already accept the notion of p-zombies, then an Unfriendly AI could just create simulations full of p-zombies and save a lot of computational power.
But also, there is an interesting question of why this superintelligence would choose to make our world the way it is. Presumably, in the “real world” we have an unfriendly superintelligence (with vast amounts of resources), who wants to avoid dying. Why would it not start the simulations from that moment? Surely, by starting the simulation “earlier” than the current moment in the “real world” it adds a lot of unnecessary noise into the results of its experiment (all of the outcomes that can happen in our simulation but can’t happen in the real world).

Dakara Jan 16, 2025, 4:46 PM
3 points
0
on: Rebuttals for ~all criticisms of AIXI
Diffractor’s critique of AIXI comes to mind when I think of strong critiques of AIXI. I believe that addressing it would make the post more complete and, as a result, better.

Dakara Jan 16, 2025, 12:06 AM
5 points
0
on: Rebuttals for ~all criticisms of AIXI
Since this post is about rebutting criticisms of AIXI, I feel it would be only fair to include Rob Bensinger’s criticism. I considered it to be the strongest criticism of AIXI by a mile. Do you have any rebuttals for that post?

Dakara Jan 13, 2025, 4:12 PM
5 points
0
in reply to: Dakara’s comment on: johnswentworth’s Shortform
Additionally, I am curious to hear if Ryan’s views on the topic are similar to Buck’s, given that they work at the same organization.

Dakara Jan 11, 2025, 11:38 AM
5 points
4
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
All 3 points seem very reasonable, looking forward to Buck’s response to them.

Dakara Jan 3, 2025, 10:02 PM
3 points
0
in reply to: Seth Herd’s comment on: If we solve alignment, do we die anyway?
I’ve noticed that in your sentence about Max Harms’s corrigibility plan there is an extra space after the parentheses which breaks the link formatting on my end. I tried marking it with “typo” emoji, but not sure if it is visible.