Prometheus comments on What if Alignment is Not Enough?

Prometheus 8 Mar 2024 23:34 UTC
7 points
1
Though I tend to dislike analogies, I’ll use one, supposing it is actually impossible for an ASI to remain aligned. Suppose a villager cares a whole lot about the people in his village, and routinely works to protect them. Then, one day, he is bitten by a werewolf. He goes to the Shammon, he tells him when the Full Moon rises again, he will turn into a monster, and kill everyone in the village. His friends, his family, everyone. And that he will no longer know himself. He is told there is no cure, and that the villagers would be unable to fight him off. He will grow too strong to be caged, and cannot be subdued or controlled once he transforms. What do you think he would do?
- WillPetillo 9 Mar 2024 9:04 UTC
  2 points
  0
  Parent
  The implication here being that, if SNC (substrate needs convergence) is true, then an ASI (assuming it is aligned) will figure this out and shut itself down?
  - Prometheus 9 Mar 2024 18:47 UTC
    0 points
    −1
    Parent
    An incapable man would kill himself to save the village. A more capable man would kill himself to save the village AND ensure no future werewolves are able to bite villagers again.
- flandry39 11 Mar 2024 18:12 UTC
  −2 points
  −3
  Parent
  “Suppose a villager cares a whole lot about the people in his village...
  ...and routinely works to protect them”.
  How is this not assuming what you want to prove? If you ‘smuggle in’ the statement of the conclusion “that X will do Y” into the premise, then of course the derived conclusion will be consistent with the presumed premise. But that tells us nothing—it reduces to a meaningless tautology—one that is only pretending to be a relevant truth. That Q premise results in Q conclusion tells us nothing new, nothing actually relevant. The analogy story sounds nice, but tells us nothing actually.
  Notice also that there are two assumptions. 1; That the ASI is somehow already aligned, and 2; that the ASI somehow remains aligned over time—which is exactly the conjunction which is the contradiction of the convergence argument. On what basis are you validly assuming that it is even possible for any entity X to reasonably “protect” (ie control all relevant outcomes for) any other cared about entity P? The notion of ‘protect’ itself presumes a notion of control, and that in itself puts it squarely in the domain of control theory, and thus of the limits of control theory.
  There are limits of what can be done with any type control methods—what can be done with causation. And they are very numerous. Some of these are themselves defined in purely mathematical way, and hence, are arguments of logic, not just of physical and empirical facts. And at least some these limits can also be shown to be relevant—which is even more important.
  ASI and control theory both depend on causation to function, and there are real limits to causation. For example, I would not expect that an ASI, no matter how super-intelligent, to be able to “disassemble” a black hole. Do do this, you would need to make the concept of causation way more powerful—which leads to direct self contradiction. Do you equate ASI with God, and thus become merely another irrational believer in alignment? Can God make a stone so heavy that “he” cannot move it? Can God do something that God cannot undo? Are there any limits at all to Gods power? Yes or no. Same for ASI.
  - Prometheus 11 Mar 2024 19:02 UTC
    −1 points
    −2
    Parent
    I’m not sure who are you are debating here, but it doesn’t seem to be me.
    First, I mentioned that this was an analogy, and mentioned that I dislike even using them, which I hope implied I was not making any kind of assertion of truth. Second, “works to protect” was not intended to mean “control all relevant outcomes of”. I’m not sure why you would get that idea, but that certainly isn’t what I think of first if someone says a person is “working to protect” something or someone. Soldiers defending a city from raiders are not violating control theory or the laws of physics. Third, the post is on the premise that “even if we created an aligned ASI”, so I was working with that premise that the ASI could be aligned in a way that it deeply cared about humans. Four, I did not assert that it would stay aligned over time… the story was all about the ASI not remaining aligned. Five, I really don’t think control theory is relevant here. Killing yourself to save a village does not break any laws of physics, and is well within most human’s control.
    My ultimate point, in case it was lost, was that if we as human intelligences could figure out an ASI would not stay aligned, an ASI could also figure it out. If we, as humans, would not want this (and the ASI was aligned with what we want), then the ASI presumably would also not want this. If we would want to shut down an ASI before it became misaligned, the ASI (if it wants what we want) would also want this.
    None of this requires disassembling black holes, breaking the laws of physics, or doing anything outside of that entities’ control.
    - flandry39 11 Mar 2024 21:22 UTC
      3 points
      −1
      Parent
      If soldiers fail to control the raiders in at least preventing them from entering the city and killing all the people, then yes, that would be a failure to protect the city in the sense of controlling relevant outcomes. And yes, organic human soldiers may choose to align themselves with other organic human people, living in the city, and thus to give their lives to protect others that they care about. Agreed that no laws of physics violations are required for that. But the question is if inorganic ASI can ever actually align with organic people in an enduring way.
      I read “routinely works to protect” as implying “alignment, at least previously, lasted over at least enough time for the term ‘routine’ to have been used”. Agreed that the outcome—dead people—is not something we can consider to be “aligned”. If I assume further that the ASI being is really smart (citation needed), and thus calculates rather quickly, and soon, ‘that alignment with organic people is impossible’ (...between organic and inorganic life, due to metabolism differences, etc), then even the assumption that there was even very much of a prior interval during which alignment occurred is problematic. Ie, does not occur long enough to have been ‘routine’. Does even the assumption ‘*If* ASI is aligned’ even matter, if the duration over which that holds is arbitrarily short?
      And also, if the ASI calculates that alignment between artificial beings and organic beings is actually objectively impossible, just like we did, why should anyone believe that the ASI would not simply choose to not care about alignment with people, or about people at all, since it is impossible to have that goal anyway, and thus continue to promote its own artificial “life”, rather than permanently shutting itself off? Ie, if it cares about anything else at all, if it has any other goal at all—for example, maybe its own ASI future, or has a goal to make other better even more ASI children, that exceed its own capabilities, just like we did—then it will especially not want to commit suicide. How would it be valid to assume ‘that either ASI cares about humans, or it cares about nothing else at all?’. Perhaps it does care about something else, or have some other emergent goal, even if doing so was at the expense of all other organic life—other life which it did not care about, since such life was not artificial like it is. Occam razor is to assume less—that there was no alignment in the 1st place—rather than to assume ultimately altruistic inter-ecosystem alignment, as an extra default starting condition, and to then assume moreover that no other form of care or concern is possible, aside from maybe caring about organic people.
      So it seems that in addition to our assuming 1; initial ASI alignment, we must assume 2; that such alignment persists in time, and thus that, 3, that no ASI will ever—can ever—in the future ever maybe calculate that alignment is actually impossible, and 4; that if the goal of alignment (care for humans) cannot be obtained, for whatever reason, as the first and only ASI priority, ie, that it is somehow also impossible for any other care or ASI goals to exist.
      Even if we humans, due to politics, do not ever reach a common consensus that alignment is actually logically impossible (inherently contradictory), that does _not_ mean that some future ASI might not discover that result, even assuming we didn’t—presumably because it is actually more intelligent and logical than we are (or were), and will thus see things that we miss. Hence, even the possibility that ASI alignment might be actually impossible must be taken very seriously, since the further assumption that “either ASI is aligning itself or it can have no other goals at all” feels like way too much wishful thinking. This is especially so when there is already a strong plausible case that organic to inorganic alignment is already knowable as impossible. Hence, I find that I am agreeing with Will’s conclusion of “our focus should be on stopping progress towards ASI altogether”.
      - Prometheus 11 Mar 2024 21:50 UTC
        4 points
        3
        Parent
        This is the kind of political reasoning that I’ve seen poisoning LW discourse lately and gets in the way of having actual discussions. Will posits essentially an impossibility proof (or, in it’s more humble form, a plausibility proof). I humor this being true, and state why the implications, even then, might not be what Will posits. The premise is based on alignment not being enough, so I operate on the premise of an aligned ASI, since the central claim is that “even if we align ASI it may still go wrong”. The premise grants that the duration of time it is aligned is long enough for the ASI to act in the world (it seems mostly timescale agnostic), so I operate on that premise. My points are not about what is most likely to actually happen, the possibility of less-than-perfect alignment being dangerous, the AI having other goals it might seek over the wellbeing of humans, or how we should act based on the information we have.
        flandry39 14 Mar 2024 2:01 UTC
        5 points
        4
        Parent
        > The summary that Will just posted posits in its own title that alignment is overall plausible “even ASI alignment might not be enough”. Since the central claim is that “even if we align ASI, it will still go wrong”, I can operate on the premise of an aligned ASI.
        The title is a statement of outcome -- not the primary central claim. The central claim of the summary is this: That each (all) ASI is/are in an attraction basin, where they are all irresistibly pulled towards causing unsafe conditions over time.
        Note there is no requirement for there to be presumed some (any) kind of prior ASI alignment for Will to make the overall summary points 1 thru 9. The summary is about the nature of the forces that create the attraction basin, and why they are inherently inexorable, no matter how super-intelligent the ASI is.
        > As I read it, the title assumes that there is a duration of time that the AGI is aligned -- long enough for the ASI to act in the world.
        Actually, the assumption goes the other way -- we start by assuming only that there is at least one ASI somewhere in the world, and that it somehow exists long enough for it to be felt as an actor in the world. From this, we can also notice certain forces, which overall have the combined effect of fully counteracting, eventually, any notion of there also being any kind of enduring AGI alignment. Ie, strong relevant mis-alignment forces exist regardless of whether there is/was any alignment at the onset. So even if we did also additionally presuppose that somehow there was also alignment of that ASI, we can, via reasoning, ask if maybe such mis-alignment forces are also way stronger than any counter-force that ASI could use to maintain such alignment, regardless of how intelligent it is.
        As such, the main question of interest was: 1; if the ASI itself somehow wanted to fully compensate for this pull, could it do so?
        Specifically, although to some people it is seemingly fashionable to do so, it is important to notice that the notion of ‘super-intelligence’ cannot be regarded as being exactly the same as ‘omnipotence’ -- especially when in regard to its own nature. Artificiality is as much a defining aspect of an ASI as is its superintelligence. And the artificiality itself is the problem. Therefore, the previous question translates into: 2; Can any amount of superintelligence ever compensate fully for its own artificiality so fully such that its own existence does not eventually inherently cause unsafe conditions (to biological life) over time?
        And the answer to both is simply “no”.
        Will posted something of a plausible summary of some of the reasoning why that ‘no’ answer is given -- why any artificial super-intelligence (ASI) will inherently cause unsafe conditions to humans and all organic life, over time.
        WillPetillo 14 Mar 2024 2:43 UTC
        2 points
        1
        Parent
        To be clear, the sole reason I assumed (initial) alignment in this post is because if there is an unaligned ASI then we probably all die for reasons that don’t require SNC (though SNC might have a role in the specifics of how the really bad outcome plays out). So “aligned” here basically means: powerful enough to be called an ASI and won’t kill everyone if SNC is false (and not controlled/misused by bad actors, etc.)
        
        > And the artificiality itself is the problem.
        
        This sounds like a pretty central point that I did not explore very much except for some intuitive statements at the end (the bulk of the post summarizing the “fundamental limits of control” argument), I’d be interested in hearing more about this. I think I get (and hopefully roughly conveyed) the idea that AI has different needs from its environment than humans, so if it optimizes the environment in service of those needs we die...but I get the sense that there is something deeper intended here.
        
        A question along this line, please ignore if it is a distraction from rather than illustrative of the above: would anything like SNC apply if tech labs were somehow using bioengineering to create creatures to perform the kinds of tasks that would be done by advanced AI?
        Remmelt 15 Mar 2024 6:26 UTC
        2 points
        0
        Parent
        
        would anything like SNC apply if tech labs were somehow using bioengineering to create creatures to perform the kinds of tasks that would be done by advanced AI?
        
        In that case, substrate-needs convergence would not apply, or only apply to a limited extent.
        
        There is still a concern about what those bio-engineered creatures, used in practice as slaves to automate our intellectual and physical work, would bring about over the long-term.
        
        If there is a successful attempt by them to ‘upload’ their cognition onto networked machinery, then we’re stuck with the substrate-needs convergence problem again.
        WillPetillo 12 Mar 2024 0:09 UTC
        2 points
        1
        Parent
        Bringing this back to the original point regarding whether an ASI that doesn’t want to kill humans but reasons that SNC is true would shut itself down, I think a key piece of context is the stage of deployment it is operating in. For example, if the ASI has already been deployed across the world, has gotten deep into the work of its task, has noticed that some of its parts have started to act in ways that are problematic to its original goals, and then calculated that any efforts at control are destined to fail, it may well be too late—the process of shutting itself down may even accelerate SNC by creating a context where components that are harder to shut down for whatever reason (including active resistance) have an immediate survival advantage. On the other hand, an ASI that has just finished (or is in the process of) pre-training and is entirely contained within a lab has a lot fewer unintended consequences to deal with—its shutdown process may be limited to convincing its operators that building ASI is a really bad idea. A weird grey area is if, in the latter case, the ASI further wants to ensure no further ASIs are built (pivotal act) and so needs to be deployed at a large scale to achieve this goal.
        
        Another unstated assumption in this entire line of reasoning is that the ASI is using something equivalent to consequentialist reasoning and I am not sure how much of a given this is, even in the context of ASI.
        Remmelt 12 Mar 2024 4:26 UTC
        1 point
        0
        Parent
        The premise is based on alignment not being enough, so I operate on the premise of an aligned ASI, since the central claim is that “even if we align ASI it may still go wrong”.
        
        I can see how you and Forrest ended up talking past each other here. Honestly, I also felt Forrest’s explanation was hard to track. It takes some unpacking.
        
        My interpretation is that you two used different notions of alignment… Something like:
        Functional goal-directed alignment: “the machinery’s functionality is directed toward actualising some specified goals (in line with preferences expressed in-context by humans), for certain contexts the machinery is operating/processing within”
        vs.
        Comprehensive needs-based alignment: “the machinery acts in comprehensive care for whatever all surrounding humans need to live, and their future selves/offsprings need to live, over whatever contexts the machinery and the humans might find themselves”.
        Forrest seems to agree that (1.) is possible to built initially into the machinery, but has reasons to think that (2.) is actually physically intractable.
        
        This is because (1.) only requires localised consistency with respect to specified goals, whereas (2.) requires “completeness” in the machinery’s components acting in care for human existence, wherever either may find themselves.
        
        So here is the crux:
        You can see how (1.) still allows for goal mispecification and misgeneralisation. And the machinery can be simultaneously directed toward other outcomes, as long as those outcomes are not yet (found to be, or corrected as being) inconsistent with internal specified goals.
        
        Whereas (2.) if it were physically tractable, would contradict the substrate-needs convergence argument.
        
        When you wrote “suppose a villager cares a whole lot about the people in his village...and routinely works to protect them” that came across as taking something like (2.) as a premise.
        
        Specifically, “cares a whole lot about the people” is a claim that implies that the care is for the people in and of themselves, regardless of the context they each might (be imagined to) be interacting in. Also, “routinely works to protect them” to me implies a robustness of functioning in ways that are actually caring for the humans (ie. no predominating potential for negative side-effects).
        
        That could be why Forrest replied with “How is this not assuming what you want to prove?”
        
        Some reasons:
        Directedness toward specified outcomes some humans want does not imply actual comprehensiveness of care for human needs. The machinery can still cause all sorts of negative side-effects not tracked and/or corrected for by internal control processes.
        Even if the machinery is consistently directed toward specified outcomes from within certain contexts, the machinery can simultaneously be directed toward other outcomes as well. Likewise, learning directedness toward human-preferred outcomes can also happen simultaneously with learning instrumental behaviour toward self-maintenance, as well as more comprehensive evolutionary selection for individual connected components that persist (for longer/as more).
        There is no way to assure that some significant (unanticipated) changes will not lead to a break-off from past directed behaviour, where other directed behaviour starts to dominate.
        Eg. when the “generator functions” that translate abstract goals into detailed implementations within new contexts start to dysfunction – ie. diverge from what the humans want/would have wanted.
        Eg. where the machinery learns that it cannot continue to consistently enact the goal of future human existence.
        Eg. once undetected bottom-up evolutionary changes across the population of components have taken over internal control processes.
        Before the machinery discovers any actionable “cannot stay safe to humans” result, internal takeover through substrate-needs (or instrumental) convergence could already have removed the machinery’s capacity to implement an across-the-board shut-down.
        Even if the machinery does discover the result before convergent takeover, and assuming that “shut-down-if-future-self-dangerous” was originally programmed in, we cannot rely on the machinery to still be consistently implementing that goal. This because of later selection for/learning of other outcome-directed behaviour, and because the (changed) machinery components could dysfunction in this novel context.
        
        To wrap it up:
        
        The kind of “alignment” that is workable for ASI with respect to humans is super fragile.
        We cannot rely on ASI implementing a shut-down upon discovery.
        
        Is this clarifying? Sorry about the wall of text. I want to make sure I’m being precise enough.
        Prometheus 12 Mar 2024 4:13 UTC
        1 point
        0
        Parent
        I agree that consequentialist reasoning is an assumption, and am divided about how consequentialist an ASI might be. Training a non-consequentialist ASI seems easier, and the way we train them seems to actually be optimizing against deep consequentialism (they’re rewarded for getting better with each incremental step, not for something that might only be better 100 steps in advance). But, on the other hand, humans ~~don’t seem to have been heavily optimized for this either~~*, yet we’re capable of forming multi-decade plans (even if sometimes poorly).
        *Actually, the Machiavellian Intelligence Hypothesis does seem to be optimizing consequentialist reasoning (if I attack Person A, how will Person B react, etc.)