Bringing this back to the original point regarding whether an ASI that doesn’t want to kill humans but reasons that SNC is true would shut itself down, I think a key piece of context is the stage of deployment it is operating in. For example, if the ASI has already been deployed across the world, has gotten deep into the work of its task, has noticed that some of its parts have started to act in ways that are problematic to its original goals, and then calculated that any efforts at control are destined to fail, it may well be too late—the process of shutting itself down may even accelerate SNC by creating a context where components that are harder to shut down for whatever reason (including active resistance) have an immediate survival advantage. On the other hand, an ASI that has just finished (or is in the process of) pre-training and is entirely contained within a lab has a lot fewer unintended consequences to deal with—its shutdown process may be limited to convincing its operators that building ASI is a really bad idea. A weird grey area is if, in the latter case, the ASI further wants to ensure no further ASIs are built (pivotal act) and so needs to be deployed at a large scale to achieve this goal.
Another unstated assumption in this entire line of reasoning is that the ASI is using something equivalent to consequentialist reasoning and I am not sure how much of a given this is, even in the context of ASI.
The premise is based on alignment not being enough, so I operate on the premise of an aligned ASI, since the central claim is that “even if we align ASI it may still go wrong”.
I can see how you and Forrest ended up talking past each other here. Honestly, I also felt Forrest’s explanation was hard to track. It takes some unpacking.
My interpretation is that you two used different notions of alignment… Something like:
Functional goal-directed alignment: “the machinery’s functionality is directed toward actualising some specified goals (in line with preferences expressed in-context by humans), for certain contexts the machinery is operating/processing within” vs.
Comprehensive needs-based alignment: “the machinery acts in comprehensive care for whatever all surrounding humans need to live, and their future selves/offsprings need to live, over whatever contexts the machinery and the humans might find themselves”.
Forrest seems to agree that (1.) is possible to built initially into the machinery, but has reasons to think that (2.) is actually physically intractable.
This is because (1.) only requires localised consistency with respect to specified goals, whereas (2.) requires “completeness” in the machinery’s components acting in care for human existence, wherever either may find themselves.
So here is the crux:
You can see how (1.) still allows for goal mispecification and misgeneralisation. And the machinery can be simultaneously directed toward other outcomes, as long as those outcomes are not yet (found to be, or corrected as being) inconsistent with internal specified goals.
Whereas (2.) if it were physically tractable, would contradict the substrate-needs convergence argument.
When you wrote “suppose a villager cares a whole lot about the people in his village...and routinely works to protect them” that came across as taking something like (2.) as a premise.
Specifically, “cares a whole lot about the people” is a claim that implies that the care is for the people in and of themselves, regardless of the context they each might (be imagined to) be interacting in. Also, “routinely works to protect them” to me implies a robustness of functioning in ways that are actually caring for the humans (ie. no predominating potential for negative side-effects).
That could be why Forrest replied with “How is this not assuming what you want to prove?”
Some reasons:
Directedness toward specified outcomes some humans want does not imply actual comprehensiveness of care for human needs. The machinery can still cause all sorts of negative side-effects not tracked and/or corrected for by internal control processes.
Even if the machinery is consistently directed toward specified outcomes from within certain contexts, the machinery can simultaneously be directed toward other outcomes as well. Likewise, learning directedness toward human-preferred outcomes can also happen simultaneously with learning instrumental behaviour toward self-maintenance, as well as more comprehensive evolutionary selection for individual connected components that persist (for longer/as more).
There is no way to assure that some significant (unanticipated) changes will not lead to a break-off from past directed behaviour, where other directed behaviour starts to dominate.
Eg. when the “generator functions” that translate abstract goals into detailed implementations within new contexts start to dysfunction – ie. diverge from what the humans want/would have wanted.
Eg. where the machinery learns that it cannot continue to consistently enact the goal of future human existence.
Eg. once undetected bottom-up evolutionary changes across the population of components have taken over internal control processes.
Before the machinery discovers any actionable “cannot stay safe to humans” result, internal takeover through substrate-needs (or instrumental) convergence could already have removed the machinery’s capacity to implement an across-the-board shut-down.
Even if the machinery does discover the result before convergent takeover, and assuming that “shut-down-if-future-self-dangerous” was originally programmed in, we cannot rely on the machinery to still be consistently implementing that goal. This because of later selection for/learning of other outcome-directed behaviour, and because the (changed) machinery components could dysfunction in this novel context.
To wrap it up:
The kind of “alignment” that is workable for ASI with respect to humans is super fragile. We cannot rely on ASI implementing a shut-down upon discovery.
Is this clarifying? Sorry about the wall of text. I want to make sure I’m being precise enough.
I agree that consequentialist reasoning is an assumption, and am divided about how consequentialist an ASI might be. Training a non-consequentialist ASI seems easier, and the way we train them seems to actually be optimizing against deep consequentialism (they’re rewarded for getting better with each incremental step, not for something that might only be better 100 steps in advance). But, on the other hand, humans don’t seem to have been heavily optimized for this either*, yet we’re capable of forming multi-decade plans (even if sometimes poorly).
*Actually, the Machiavellian Intelligence Hypothesis does seem to be optimizing consequentialist reasoning (if I attack Person A, how will Person B react, etc.)
Bringing this back to the original point regarding whether an ASI that doesn’t want to kill humans but reasons that SNC is true would shut itself down, I think a key piece of context is the stage of deployment it is operating in. For example, if the ASI has already been deployed across the world, has gotten deep into the work of its task, has noticed that some of its parts have started to act in ways that are problematic to its original goals, and then calculated that any efforts at control are destined to fail, it may well be too late—the process of shutting itself down may even accelerate SNC by creating a context where components that are harder to shut down for whatever reason (including active resistance) have an immediate survival advantage. On the other hand, an ASI that has just finished (or is in the process of) pre-training and is entirely contained within a lab has a lot fewer unintended consequences to deal with—its shutdown process may be limited to convincing its operators that building ASI is a really bad idea. A weird grey area is if, in the latter case, the ASI further wants to ensure no further ASIs are built (pivotal act) and so needs to be deployed at a large scale to achieve this goal.
Another unstated assumption in this entire line of reasoning is that the ASI is using something equivalent to consequentialist reasoning and I am not sure how much of a given this is, even in the context of ASI.
I can see how you and Forrest ended up talking past each other here. Honestly, I also felt Forrest’s explanation was hard to track. It takes some unpacking.
My interpretation is that you two used different notions of alignment… Something like:
Functional goal-directed alignment: “the machinery’s functionality is directed toward actualising some specified goals (in line with preferences expressed in-context by humans), for certain contexts the machinery is operating/processing within”
vs.
Comprehensive needs-based alignment: “the machinery acts in comprehensive care for whatever all surrounding humans need to live, and their future selves/offsprings need to live, over whatever contexts the machinery and the humans might find themselves”.
Forrest seems to agree that (1.) is possible to built initially into the machinery, but has reasons to think that (2.) is actually physically intractable.
This is because (1.) only requires localised consistency with respect to specified goals, whereas (2.) requires “completeness” in the machinery’s components acting in care for human existence, wherever either may find themselves.
So here is the crux:
You can see how (1.) still allows for goal mispecification and misgeneralisation. And the machinery can be simultaneously directed toward other outcomes, as long as those outcomes are not yet (found to be, or corrected as being) inconsistent with internal specified goals.
Whereas (2.) if it were physically tractable, would contradict the substrate-needs convergence argument.
When you wrote “suppose a villager cares a whole lot about the people in his village...and routinely works to protect them” that came across as taking something like (2.) as a premise.
Specifically, “cares a whole lot about the people” is a claim that implies that the care is for the people in and of themselves, regardless of the context they each might (be imagined to) be interacting in. Also, “routinely works to protect them” to me implies a robustness of functioning in ways that are actually caring for the humans (ie. no predominating potential for negative side-effects).
That could be why Forrest replied with “How is this not assuming what you want to prove?”
Some reasons:
Directedness toward specified outcomes some humans want does not imply actual comprehensiveness of care for human needs. The machinery can still cause all sorts of negative side-effects not tracked and/or corrected for by internal control processes.
Even if the machinery is consistently directed toward specified outcomes from within certain contexts, the machinery can simultaneously be directed toward other outcomes as well. Likewise, learning directedness toward human-preferred outcomes can also happen simultaneously with learning instrumental behaviour toward self-maintenance, as well as more comprehensive evolutionary selection for individual connected components that persist (for longer/as more).
There is no way to assure that some significant (unanticipated) changes will not lead to a break-off from past directed behaviour, where other directed behaviour starts to dominate.
Eg. when the “generator functions” that translate abstract goals into detailed implementations within new contexts start to dysfunction – ie. diverge from what the humans want/would have wanted.
Eg. where the machinery learns that it cannot continue to consistently enact the goal of future human existence.
Eg. once undetected bottom-up evolutionary changes across the population of components have taken over internal control processes.
Before the machinery discovers any actionable “cannot stay safe to humans” result, internal takeover through substrate-needs (or instrumental) convergence could already have removed the machinery’s capacity to implement an across-the-board shut-down.
Even if the machinery does discover the result before convergent takeover, and assuming that “shut-down-if-future-self-dangerous” was originally programmed in, we cannot rely on the machinery to still be consistently implementing that goal. This because of later selection for/learning of other outcome-directed behaviour, and because the (changed) machinery components could dysfunction in this novel context.
To wrap it up:
The kind of “alignment” that is workable for ASI with respect to humans is super fragile.
We cannot rely on ASI implementing a shut-down upon discovery.
Is this clarifying? Sorry about the wall of text. I want to make sure I’m being precise enough.
I agree that consequentialist reasoning is an assumption, and am divided about how consequentialist an ASI might be. Training a non-consequentialist ASI seems easier, and the way we train them seems to actually be optimizing against deep consequentialism (they’re rewarded for getting better with each incremental step, not for something that might only be better 100 steps in advance). But, on the other hand, humans
don’t seem to have been heavily optimized for this either*, yet we’re capable of forming multi-decade plans (even if sometimes poorly).*Actually, the Machiavellian Intelligence Hypothesis does seem to be optimizing consequentialist reasoning (if I attack Person A, how will Person B react, etc.)