Roko comments on A Nonconstructive Existence Proof of Aligned Superintelligence

Roko 12 Sep 2024 12:06 UTC
2 points
0

it might be that the set of properties one wants is contradictory.

So how is that a problem with AI alignment? If you want something that is impossible, it should come as no surprise that an AI cannot achieve it for you.
- mishka 12 Sep 2024 13:33 UTC
  5 points
  0
  Parent
  (I am not talking about my viewpoint, but about a logical possibility.)
  
  If it so happens that the property of the world is such that
  
  there are no processes where superintelligence is present and the chances of “bad” things with “badness” exceeding some large threshold are small
  
  but at the same time world lines where the chances of “bad” things with “badness” exceeding some large threshold are small do exist, then one has to avoid having superintelligence in order to have a chance at keeping probabilities of some particularly bad things low.
  
  That is what people essentially mean when they say “ASI alignment is impossible”. The situation where something “good enough” (low chances of certain particularly bad things happening) is only possible in the absence of superintelligence, but is impossible when superintelligence is present.
  
  So, they are talking about a property of the world where certain unacceptable deterioration is necessarily linked to the introduction of superintelligence.
  
  I am not talking about my viewpoint, but about a logical possibility. But I don’t think your proof addresses that. In particular, because a directed acyclic graph is not a good model. We need to talk about a process, not a static state, so the model must be recurrent (if it’s a directed acyclic graph, it must be applied in a fashion which makes the overall thing recurrent, for example in an autoregressive mode).
  
  And we are talking about superintelligence which is usually assumed to be capable of a good deal of self-modifications and recursive self-improvement, so the model should incorporate that. The statement of “impossibility of sufficiently benign forms of superintelligence” might potentially have a form of a statement of “impossibility of superintelligence which would refrain from certain kinds of self-modification, with those kinds of self-modification having particularly unacceptable consequences”.
  
  And it’s not enough to draw a graph which refrains from self-modification, because one can argue that a model which agrees to constrain itself in such a radical fashion as to never self-modify in an exploratory fashion is fundamentally not superintelligent (even humans often self-modify when given an opportunity and seeing a potential upside).
  - Roko 12 Sep 2024 14:35 UTC
    2 points
    0
    Parent
    
    a model which agrees to constrain itself in such a radical fashion as to never self-modify in an exploratory fashion is fundamentally not superintelligent
    
    OK, what is your definition of “superintelligent”?
    - mishka 12 Sep 2024 15:16 UTC
      3 points
      0
      Parent
      Being able to beat humans in all endeavours by miles.
      
      That includes the ability to explore novel paths.
      - Roko 12 Sep 2024 15:43 UTC
        3 points
        0
        Parent
        
        Being able to beat humans
        
        What do you mean by humans? How large a group of humans? Infinite?
        mishka 12 Sep 2024 15:44 UTC
        3 points
        0
        Parent
        10 billion
        Roko 12 Sep 2024 16:08 UTC
        2 points
        0
        Parent
        But then it is possible for an AI to be able to up to 10 billion humans in all endeavours by miles, but also not modify itself.
        
        In fact, I can prove that such an AI exists.
        
        So you have two different and contradictory definitions of “superintelligence” that you are using.
        mishka 12 Sep 2024 16:27 UTC
        3 points
        0
        Parent
        A realistic one, which can competently program and can competently do AI research?
        
        Surely, since humans do pretty impressive AI research, a superintelligent AI will do better AI research.
        
        What exactly might (even potentially) prevent it from creating drastically improved variants of itself?
        Roko 12 Sep 2024 16:37 UTC
        4 points
        2
        Parent
        A superintelligence based on the first definition you gave (Being able to beat humans in all endeavours by miles) would be able to beat humans at AI research, but it would also be able to beat humans at not doing AI research.
        So, by your own definition, in order to be a superintelligence, it must be able to spend the whole lifetime of the universe not doing AI research.
        mishka 12 Sep 2024 16:41 UTC
        2 points
        0
        Parent
        You mean, a version which decides to sacrifice exploration and self-improvement, despite it being so tempting...
        
        And that after doing quite a bit of exploration and self-improvement (otherwise it would not have gotten to the position of being powerful in the first place).
        
        But then deciding to turn around drastically and become very conservative, and to impose a new “conservative on a new level world order”...
        
        Yes, that is a logical possibility...
        mishka 12 Sep 2024 16:47 UTC
        2 points
        0
        Parent
        Yes, OK.
        
        I doubt that an adequate formal proof is attainable, but a mathematical existence of a “lucky one” is not implausible...
        Expand this thread
        mishka 12 Sep 2024 16:55 UTC
        2 points
        0
        Parent
        Yes, an informal argument is that if it is way smarter and way more capable than humans, that it potentially should be better at being able to refrain from exercising the capabilities.
        
        In this sense, the theoretical existence of a superintelligence which does not make things worse than they would be without existence of this particular superintelligence seems very plausible, yes… (And it’s a good definition of alignment, “aligned == does not make things notably worse”.)
        mishka 12 Sep 2024 17:02 UTC
        4 points
        0
        Parent
        so these two considerations
        
        if it is way smarter and way more capable than humans, that it potentially should be better at being able to refrain from exercising the capabilities
        
        and
        
        “aligned == does not make things notably worse”
        
        taken together indeed constitute a nice “informal theorem” that the claim of “aligned superintelligence being impossible” looks wrong. (I went back and added my upvotes to this post, even though I don’t think the technique in the linked post is good.)
        Roko 12 Sep 2024 20:56 UTC
        2 points
        0
        Parent
        
        I don’t think the technique in the linked post is good.
        
        why not?
        mishka 13 Sep 2024 6:23 UTC
        3 points
        0
        Parent
        I think I said already.
        
        We are not aiming for a state to be reached. We need to maintain some properties of processes extending indefinitely in time. That formalism does not seem to do that. It does not talk about invariant properties of processes and other such things, which one needs to care about when trying to maintain properties of processes.
        
        We don’t know fundamental physics. We don’t know the actual nature of quantum space-time, because quantum gravity is unsolved, we don’t know what is “true logic” of the physical world, and so on. There is no reason why one can rely on simple-minded formalisms, on standard Boolean logic, on discrete tables and so on, if one wants to establish something fundamental, when we don’t really know the nature of reality we are trying to approximate.
        
        There are a number of reasons a formalization could fail even if it goes as far as proving the results within a theorem prover (which is not the case here). The first and foremost of those reasons is that formalization might fail to capture the reality with sufficient degree of faithfulness. That is almost certainly the case here.
        
        But then a formal proof (an adequate version of which is likely to be impossible at our current state of knowledge) is not required. A simple informal argument above is more to the point. It’s a very simple argument, and so it makes the idea that “aligned superintelligence might be fundamentally impossible” very unlikely to be true.
        
        First of all, one step this informal argument is making is weakening the notion of “being aligned”. We are only afraid of “catastrophic misalignment”, so let’s redefine the alignment as something simple which avoids that. An AI which sufficiently takes itself out of action, does achieve that. (I actually asked for something a bit stronger, “does not make things notably worse”; that’s also not difficult, via the same mechanism of taking oneself sufficiently out of action.)
        
        And a strongly capable AI should be capable to take itself out of action, to refrain from doing things. The capability to choose is an important capability, a strongly capable system is a system which, in particular, can make choices.
        
        So, yes, a very capable AI system can avoid being catastrophically misaligned, because it can choose to avoid action. This is that non-constructive proof of existence which has been sought. It’s an informal proof, but that’s fine.
        
        No extra complexity is required, and no extra complexity would make this argument better or more convincing.
        Roko 13 Sep 2024 17:44 UTC
        3 points
        0
        Parent
        
        We need to maintain some properties of processes extending indefinitely in time. That formalism does not seem to do that.
        
        You can run all the same arguments I used, but talk about processes rather than states.
        mishka 13 Sep 2024 19:34 UTC
        2 points
        0
        Parent
        On one hand, you still assume too much:
        
        Since our best models of physics indicate that there is only a finite amount of computation that can ever be done in our universe
        
        No, nothing like that is at all known. It’s not a consensus. There is no consensus that the universe is computable, this is very much a minority viewpoint, and it might always make sense to augment a computer with a (presumably) non-computable element (e.g. a physical random number generator, an analog circuit, a camera, a reader of human real-time input, and so on). AI does not have to be a computable thing, it can be a hybrid. (In fact, when people model real-world computers as Turing machines instead of modeling them as Turing machines with oracles, with the external world being the oracle, it leads to all kinds of problems, e.g. the well-known Penrose’s “Goedel argument” makes this mistake and falls apart as soon as one remembers the presence of the oracle.)
        
        Other than that...
        
        Yes, you have an interesting notion of alignment. Not something which we might want, and might be possible, but might be unachievable by mere humans, but something much weaker than that (although not as weak as the version I put forward, my version is super-weak, and your version is intermediate in strength):
        
        I claim then that for any generically realizable desirable outcome that is realizable by a group of human advisors, there must exist some AI which will also realize it.
        
        Yes, this is obviously correct. An ASI can choose to emulate a group of human and its behavior, and being way more capable than that group of humans, it should be able to emulate that group as precisely as needed.
        
        One does not need to say anything else to establish that.
        Roko 14 Sep 2024 8:22 UTC
        3 points
        0
        Parent
        
        No, nothing like that is at all known. It’s not a consensus
        
        I disagree, modern physics places various bounds on compute such as the Beckenstein Bound.
        
        https://en.wikipedia.org/wiki/Bekenstein_bound
        
        If your objection to my proof involves infinite compute then I am happy to acknowledge that I honestly do not know what happens in that case. It is plausible that since humans are finite in complexity/information/compute, a world with infinite compute would break the symmetry between computers and humans that I am using here. Most likely it means that computers are capable of fundamentally superior outcomes, so there would be “hyperaligned” AIs. But since infinite compute is a minority position I will not pursue it.
        mishka 14 Sep 2024 16:56 UTC
        3 points
        0
        Parent
        I don’t see what the entropy bound has to do with compute. The Bekenstein bound is not much in question, but its link to compute is a different story. It does seem to limit how many bits can be stored in a finite volume (so for a potentially infinite compute an unlimited spatial expansion is needed).
        
        But it does not say anything about possibilities of non-computable processes. It’s not clear if “collapse of wave function” is computable, and it is typically assumed not to be computable. So powerful non-Turing-computable oracles seem to likely be available (that’s much more than “infinite compute”).
        
        But I also think all these technicalities constitute an overkill, I don’t see them as at all relevant.
        
        This seems rather obvious regardless of the underlying model:
        
        An ASI can choose to emulate a group of human and its behavior, and being way more capable than that group of humans, it should be able to emulate that group as precisely as needed.
        
        This seems obviously true, no matter what.
        
        I don’t see why a more detailed formalization would help to further increase certainty. Especially when there are so many questions about that formalization.
        
        If the situation were different, if the statement would not be obvious, even a loose formalization might help. But when the statement seems obvious, the standards a formalization needs to satisfy to further increase our certainty in the truth of the statement become really high...
        Roko 15 Sep 2024 13:36 UTC
        5 points
        0
        Parent
        
        “collapse of wave function” is computable, and it is typically assumed not to be computable
        
        The wavefunction never actually collapses if you believe in MWI. Rather, a classical reality emerges in all branches thanks to decoherence.
        
        If you think something nonomputable happens because of quantum mechanics, it probably means that your interpretation of QM is wrong and you need to read the sequences on that.
        mishka 15 Sep 2024 15:16 UTC
        2 points
        0
        Parent
        If you believe in MWI, then this whole argument is… not “wrong”, but very incomplete...
        
        Where is the consideration of branches? What does it mean for one entity to be vastly superior to another, if there are many branches?
        
        If one believes in MWI, then the linked proof does not even start to look like a proof. It obviously considers only a single branch.
        
        And a “subjective navigation” in the branches is not assumed to be computable, even if the “objective multiverse” is computable; that is the whole point of MWI, the “collapse” becomes “subjective navigation”, but this does not make it computable. If a consideration is only of a single branch, that branch is not computable, even if it is embedded in a large computable multiverse.
        
        Not every subset of a computable set (say, of a set of natural numbers) is computable.
        
        An interpretation of QM can’t be “wrong”. It is a completely open research and philosophical question, there is no “right” interpretation, and the Sequences is (thankfully) not a Bible (if even a very respected thinker says something, this does not yet mean that one should accept that without questions).
        Roko 17 Sep 2024 13:40 UTC
        3 points
        0
        Parent
        
        It obviously considers only a single branch.
        
        Thanks to decoherece, you can just ignore any type of interference and treat each branch as a single classical universe.
        mishka 17 Sep 2024 14:01 UTC
        2 points
        0
        Parent
        I don’t think so. If it were classical, we would not be able to observe effects of double-slit experiments and so on.
        
        And, also, there is no notion of “our branch” until one has traveled along it. At any given point in time, there are many branches ahead. Only looking back one can speak about one’s branch. But looking forward one can’t predict the branch one will end up in. One does not know the results of future “observations”/”measurements”. This is not what a classical universe looks like.
        
        (Speaking of MWI, I recall David Deutsch’s “Fabric of Reality” very eloquently explaining effects from “neighboring branches”. The reason I am referencing this book is that this was the work particularly strongly associated with MWI back then. So I think we should be able to rely on his understanding of MWI.)
        Roko 17 Sep 2024 19:18 UTC
        5 points
        2
        Parent
        
        one can’t predict the branch one will end up in
        
        yes one can—all of them!
        mishka 17 Sep 2024 21:54 UTC
        2 points
        0
        Parent
        Yes, but then what do you want to prove?
        
        Something like, “for all branches, [...]”? That might be not that easy to prove or even to formulate. In any case, the linked proof has not even started to deal with this.
        
        Something like, “there exist a branch such that [...]”? That might be quite tractable, but probably not enough for practical purposes.
        
        “The probability that one ends up in a branch with such and such properties is no less than/no more than” [...]? Probably something like that, realistically speaking, but this still needs a lot of work, conceptual and mathematical...
        Roko 21 Sep 2024 16:29 UTC
        2 points
        0
        Parent
        bringing QM into this is not helping. All these types of questions are completely generic QM questions and ultimately they come down to measure ||Psi>|²
        mishka 21 Sep 2024 21:19 UTC
        2 points
        0
        Parent
        It’s just… having a proof is supposed to boost our confidence that the conclusion is correct...
        
        if the proof relies on assumptions which are already quite far from the majority opinion about our actual reality (and are probably going to deviate further, as AIs will be better physicists and engineers than us and will leverage the strangeness of our physics much further than we do), then what’s the point of that “proof”?
        
        how does having this kind of “proof” increase our confidence in what seems informally correct for a single branch reality (and rather uncertain in a presumed multiverse, but we don’t even know if we are in a multiverse, so bringing a multiverse in might, indeed, be one of the possible objections to the statement, but I don’t know if one wants to pursue this line of discourse, because it is much more complicated than what we are doing here so far)?
        
        (as an intellectual exercise, a proof like that is still of interest, even under the unrealistic assumption that we live in a computable reality, I would not argue with that; it’s still interesting)
        Roko 17 Sep 2024 19:18 UTC
        3 points
        0
        Parent
        
        we would not be able to observe effects of double-slit experiments
        
        yes, but thanks to decoherence this generally doesn’t affect macroscopic variables. Branches are causally independent once they have split.
        mishka 17 Sep 2024 21:56 UTC
        3 points
        0
        Parent
        No. I can only repeat my reference to Fabric of Reality as a good presentation of MWI and to remind that we do not live in a classical world, which is easy to confirm empirically.
        
        And there are plenty of known macroscopic quantum effects already, and that list will only grow. Lasers are quantum, superfluidity and superconductivity are quantum, and so on.
        Roko 21 Sep 2024 16:31 UTC
        2 points
        0
        Parent
        Decoherence means that different branches don’t interfere with each other on macroscopic scales. That’s just the way it works.
        
        Superfluids/superconductors/lasers are still microscopic effects that only matter at the scale of atoms or at ultra-low temperature or both.
        mishka 21 Sep 2024 21:23 UTC
        3 points
        0
        Parent
        No, not microscopic.
        
        Coherent light produced by lasers is not microscopic, we see its traces in the air. And we see the consequences (old fashioned holography and the ability to cut things with focused light, even at large distances). Room temperature is fine for that.
        
        Superconductors used in the industry are not microscopic (and the temperatures are high enough to enable industrial use of them in rather common devices such as MRI scanners).
      - mishka 12 Sep 2024 15:25 UTC
        2 points
        0
        Parent
        And I personally think that superintelligence leading to good trajectories is possible. It seems unlikely that we are in a reality where there is a theorem to the contrary.
        
        It feels intuitively likely that it is possible to have superintelligence or the ecosystem of superintelligences which is wise enough to be able to navigate well.
        
        But I doubt that one is likely to be able to formally prove that.
        mishka 12 Sep 2024 15:49 UTC
        2 points
        0
        Parent
        
        But I doubt that one is likely to be able to formally prove that.
        
        E.g. it is possible that we are in a reality where very cautious and reasonable, but sufficiently advanced experiments in quantum gravity lead to a disaster.
        
        Advanced systems are likely to reach those capabilities, and they might make very reasonable estimates that it’s OK to proceed, but due to bad luck of being in a particularly unfortunate reality, the “local neighborhood” might get destroyed as a result… One can’t prove that it’s not the case...
        
        Whereas, if the level of overall intelligence remains sufficiently low, we might not be able to ever achieve the technical capabilities to get into the danger zone...
        
        It is logically possible that the reality is like that.
        Roko 12 Sep 2024 16:10 UTC
        4 points
        0
        Parent
        
        It is logically possible that the reality is like that.
        
        Yes, it is. But even if that is the case, by the argument given in this post, there must exist an AI system that avoids the dangerzone.
        mishka 12 Sep 2024 16:30 UTC
        2 points
        0
        Parent
        Yes, possibly.
        
        Not by the argument given in the post (considering quantum gravity, one immediately sees how inadequate and unrealistic is the model in the post).
        
        But yes, it is possible that they will be so wise that they will be cautious enough even in a very unfortunate situation.
        
        Yes, I was trying to explicitly refute your claim, but my refutation has holes.
        
        (I don’t think you have a valid proof, but this is not yet a counterexample.)
  - Roko 12 Sep 2024 14:36 UTC
    1 point
    0
    Parent
    
    there are no processes where superintelligence is present and the chances of “bad” things with “badness” exceeding some large threshold are small
    
    Do you think a team of sufficiently wise humans is capable of producing a world where the chances of “bad” things with “badness” exceeding some large threshold are small? Yes or no?
    - mishka 12 Sep 2024 15:14 UTC
      4 points
      0
      Parent
      
      (I am not talking about my viewpoint, but about a logical possibility.)
      
      In particular, humans might be able to refrain from screwing the world too badly, if they avoid certain paths.
      
      (No, personally I don’t think so. If people crack down hard enough, they probably screw up the world pretty badly due to the crackdown, and if they don’t crack down hard enough, then people will explore various paths leading to bad trajectories, via superintelligence or via other more mundane means. I personally don’t see a safe path, and I don’t know how to estimate probabilities. But it is not a logical impossibility. E.g. if someone makes all humans dumb by putting a magic irreversible stupidifier in the air and water, perhaps those things can be avoided, hence it is logically possible. Do I want “safety” at this price? No, I think it’s better to take risks...)
      - Roko 12 Sep 2024 15:42 UTC
        2 points
        0
        Parent
        
        humans might be able to refrain from screwing the world too badly
        
        But then, if a team of humans is capable of producing a world where the chances of “bad” things with “badness” exceeding some large threshold are small, by exactly the argument given in this post there must be a Lookup Table which simply contains the same boolean function.
        
        So, your claim is provably false. It is not possible for something (anything) to be generically achievable by humans but not by AI, and you’re just hitting a special case of that.
        mishka 12 Sep 2024 15:45 UTC
        2 points
        0
        Parent
        No, they are not “producing”. They are just being impotent enough. Things are happening on their own...
        
        And I don’t believe a Lookup Table is a good model.
        Roko 12 Sep 2024 16:06 UTC
        2 points
        0
        Parent
        
        They are just being impotent enough
        
        An AI can also be impotent. Surely this is obvious to you? Have you not thought this through properly?
        mishka 12 Sep 2024 16:24 UTC
        4 points
        0
        Parent
        It can. Then it is not “superintelligence”.
        
        Superintelligence is capable of almost unlimited self-improvement.
        
        (Even our miserable recursive self-improvement AI experiments show rather impressive results before saturating. Well, they will not keep saturating forever. Currently, this self-improvement typically happens via rather awkward and semi-competent generation of novel Python code. Soon it will be done by better means (which we probably should not discuss here).)
        Roko 12 Sep 2024 20:41 UTC
        2 points
        0
        Parent
        By your own definition of “superintelligence”, it must be better at “being impotent” than any group of humans less than 10 billion. So it must be super-good at being impotent and doing very little, if that is required.
        mishka 13 Sep 2024 5:54 UTC
        2 points
        0
        Parent
        Being impotent is not a property of “being good”. One is not aiming for that.
        
        It’s just a limitation. One usually does not self-impose it (with rare exceptions), although one might want to impose it on adversaries.
        
        “Being impotent” is always worse. One can’t be “better at it”.
        
        One can be better at refraining from exercising the capability (we have a different branch in this discussion for that).
        Roko 13 Sep 2024 17:44 UTC
        2 points
        0
        Parent
        
        One can be better at refraining from exercising the capability
        
        If that is what is needed then it must (by definition) be better at it
        Expand this thread
        mishka 13 Sep 2024 19:38 UTC
        2 points
        0
        Parent
        Not if it is disabling.
        
        If it is disabling, then one has a self-contradictory situation (if ASI fundamentally disables itself, then it stops being more capable, and stops being an ASI, and can’t keep exercising its superiority; it’s the same as if it self-destructs).
        Roko 13 Sep 2024 19:52 UTC
        2 points
        0
        Parent
        If a superintelligence is worse than a human at permanently disabling itself—given that as the only required task—then there is a task that it is subhuman at and therefore not a superintelligence.
        Roko 13 Sep 2024 19:56 UTC
        2 points
        0
        Parent
        I suppose you could make some modifications to your definition to take account of this. But in any case, I think it’s not a great definition as it make an implicit assumption about the structure of problems (that basically problems have a single “scalar” difficulty)
        mishka 13 Sep 2024 20:03 UTC
        2 points
        0
        Parent
        No, it can disable itself.
        
        But it is not a solution, it is a counterproductive action. It makes things worse.
        
        (In some sense, it has an obligation not to irreversibly disable itself.)