Remmelt comments on What if Alignment is Not Enough?

Remmelt 12 Mar 2024 4:26 UTC
1 point
0
The premise is based on alignment not being enough, so I operate on the premise of an aligned ASI, since the central claim is that “even if we align ASI it may still go wrong”.

I can see how you and Forrest ended up talking past each other here. Honestly, I also felt Forrest’s explanation was hard to track. It takes some unpacking.

My interpretation is that you two used different notions of alignment… Something like:
1. Functional goal-directed alignment: “the machinery’s functionality is directed toward actualising some specified goals (in line with preferences expressed in-context by humans), for certain contexts the machinery is operating/processing within”
  vs.
2. Comprehensive needs-based alignment: “the machinery acts in comprehensive care for whatever all surrounding humans need to live, and their future selves/offsprings need to live, over whatever contexts the machinery and the humans might find themselves”.
Forrest seems to agree that (1.) is possible to built initially into the machinery, but has reasons to think that (2.) is actually physically intractable.

This is because (1.) only requires localised consistency with respect to specified goals, whereas (2.) requires “completeness” in the machinery’s components acting in care for human existence, wherever either may find themselves.

So here is the crux:
1. You can see how (1.) still allows for goal mispecification and misgeneralisation. And the machinery can be simultaneously directed toward other outcomes, as long as those outcomes are not yet (found to be, or corrected as being) inconsistent with internal specified goals.
2. Whereas (2.) if it were physically tractable, would contradict the substrate-needs convergence argument.
When you wrote “suppose a villager cares a whole lot about the people in his village...and routinely works to protect them” that came across as taking something like (2.) as a premise.

Specifically, “cares a whole lot about the people” is a claim that implies that the care is for the people in and of themselves, regardless of the context they each might (be imagined to) be interacting in. Also, “routinely works to protect them” to me implies a robustness of functioning in ways that are actually caring for the humans (ie. no predominating potential for negative side-effects).

That could be why Forrest replied with “How is this not assuming what you want to prove?”

Some reasons:
1. Directedness toward specified outcomes some humans want does not imply actual comprehensiveness of care for human needs. The machinery can still cause all sorts of negative side-effects not tracked and/or corrected for by internal control processes.
2. Even if the machinery is consistently directed toward specified outcomes from within certain contexts, the machinery can simultaneously be directed toward other outcomes as well. Likewise, learning directedness toward human-preferred outcomes can also happen simultaneously with learning instrumental behaviour toward self-maintenance, as well as more comprehensive evolutionary selection for individual connected components that persist (for longer/as more).
3. There is no way to assure that some significant (unanticipated) changes will not lead to a break-off from past directed behaviour, where other directed behaviour starts to dominate.
  1. Eg. when the “generator functions” that translate abstract goals into detailed implementations within new contexts start to dysfunction – ie. diverge from what the humans want/would have wanted.
  2. Eg. where the machinery learns that it cannot continue to consistently enact the goal of future human existence.
  3. Eg. once undetected bottom-up evolutionary changes across the population of components have taken over internal control processes.
4. Before the machinery discovers any actionable “cannot stay safe to humans” result, internal takeover through substrate-needs (or instrumental) convergence could already have removed the machinery’s capacity to implement an across-the-board shut-down.
5. Even if the machinery does discover the result before convergent takeover, and assuming that “shut-down-if-future-self-dangerous” was originally programmed in, we cannot rely on the machinery to still be consistently implementing that goal. This because of later selection for/learning of other outcome-directed behaviour, and because the (changed) machinery components could dysfunction in this novel context.
To wrap it up:

The kind of “alignment” that is workable for ASI with respect to humans is super fragile.
We cannot rely on ASI implementing a shut-down upon discovery.

Is this clarifying? Sorry about the wall of text. I want to make sure I’m being precise enough.