Rohin Shah comments on A Case for the Least Forgiving Take On Alignment

Rohin Shah 5 May 2023 20:32 UTC
LW: 9 AF: 6
5
AF
See Section 5 for more discussion of all of that.
Sorry, I seem to have missed the problems mentioned in that section on my first read.
There’s no reason to expect that AGI would naturally “stall” at the exact same level of performance and restrictions.
I’m not claiming the AGI would stall at human level, I’m claiming that on your model, the discontinuity should have some decent likelihood of ending at or before human level.
(I care about this because I think it cuts against this point: We only have one shot. There will be a sharp discontinuity in capabilities once we get to AGI, and attempts to iterate on alignment will fail. Either we get AGI right on the first try, or we die. In particular it seems like if the discontinuity ends before human level then you can iterate on alignment.)
that algorithm is still resource-constrained (by the brain’s compute) and privilege-constrained within the mind (e. g., it doesn’t have full write-access to our instincts)
Why isn’t this also true of the weak AGI? Current models cannot autonomously get more compute (humans have to give it to them) or perform gradient descent on their own weights (unless the humans specifically try to make that happen); most humans placed in the models’ position would not be able to do that either.
It sounds like your answer is that the development of AGI could lead to something below-human-level, that wouldn’t be able to get itself more compute / privileges, but we will not realize that it’s AGI, so we’ll give it more compute / privileges until it gets to “so superintelligent we can’t do anything about it”. Is that correct?
There would be no warning signs, because “weak” AGI (human-level or below) can’t be clearly distinguished from a very capable pre-AGI system, based solely on externally-visible behaviour.
… Huh. How do you know that humans are generally intelligent? Are you relying on introspection on your own cognitive process, and extrapolating that to other humans?
What if our policy is to scale up resources / privileges available to almost-human-level AI very slowly? Presumably after getting to a somewhat-below-human-level AGI, with a small amount of additional resources it would get to a mildly-superhuman-level AI, and we could distinguish it then?
Or maybe you’re relying on an assumption that the AGI immediately becomes deceptive and successfully hides the fact that it’s an AGI?
- Thane Ruthenis 5 May 2023 21:05 UTC
  LW: 4 AF: 3
  0
  AF Parent
  I’m not claiming the AGI would stall at human level, I’m claiming that on your model, the discontinuity should have some decent likelihood of ending at or before human level.
  Hm? “Stall at the human level” and “the discontinuity ends at or before the human level” reads like the same thing to me. What difference do you see between the two?
  It sounds like your answer is that the development of AGI could lead to something below-human-level, that wouldn’t be able to get itself more compute / privileges, but we will not realize that it’s AGI, so we’ll give it more compute / privileges until it gets to “so superintelligent we can’t do anything about it”. Is that correct?
  Basically, except instead of directly giving it privileges/compute, I meant that we’d keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts).
  The strategy of slowly scaling our AI up is workable at the core, but IMO there are a lot of complications:
  - A “mildly-superhuman” AGI, or even just a genius-human AGI, is still be an omnicide risk (see also). I wouldn’t want to experiment with that; I would want it safely at average-human-or-below level. It’s likely hard to “catch” it at that level by inspecting its external behavior, though: can only be reliably done via advanced interpretability tools.
  - Deceptiveness (and manipulation) is a significant factor, as you’ve mentioned. Even just a mildly-superhuman AGI will likely be very good at it. Maybe not implacably good, but it’d be like working bare-handed with an extremely dangerous chemical substance, with the entire humanity at the stake.
  - The problem of “iterating” on this system. If we have just a “weak” AGI on our hands, it’s mostly a pre-AGI system, with a “weak” general-intelligence component that doesn’t control much. Any “naive” approaches, like blindly training interpretability probes on it or something, would likely ignore that weak GI component, and focus mainly on analysing or shaping heuristics/shards. To get the right kind of experience from it, we’d have to very precisely aim our experiments at the GI component — which, again, likely requires advanced interpretability tools.
  Basically, I think we need to catch the AGI-ness while it’s an “asymptomatic” stage, because the moment it becomes visible it’s likely already incredibly dangerous (if not necessarily maximally dangerous).
  … Huh. How do you know that humans are generally intelligent? Are you relying on introspection on your own cognitive process, and extrapolating that to other humans?
  More or less, plus the theoretical argument from the apparent Turing-completeness of human understanding and the lack of empirical evidence to the contrary. Our “mental vocabulary” is Turing-complete, so we should very literally be able to model anything that can be modeled (up to our working-memory limits) — and, indeed, we’re yet to observe anything we can’t model.
  I’m not sure why the extrapolation step would be suspect?
  - Rohin Shah 6 May 2023 13:24 UTC
    LW: 10 AF: 7
    2
    AF Parent
    Hm? “Stall at the human level” and “the discontinuity ends at or before the human level” reads like the same thing to me. What difference do you see between the two?
    Discontinuity ending (without stalling):
    Stalling:
    Basically, except instead of directly giving it privileges/compute, I meant that we’d keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts).
    Are you imagining systems that are built differently from today? Because I’m not seeing how SGD could give the GI component an ability to rewrite the weights or get more compute given today’s architectures and training regimes.
    (Unless you mean “SGD enhances the GI component until the GI component is able to hack into the substrate it is running on to access the memory containing its own weights, which it can then edit”, though I feel like it is inaccurate to summarize this as “SGD give it more privileges”, so probably you don’t mean that)
    (Or perhaps you mean “SGD creates a set of weights that effectively treats the input English tokens as a programming language by which the network’s behavior can be controlled, and the GI component can then select tokens to output that both achieve low loss and also allow it to control its instincts on the next forward pass”, but this also seems super exotic and is probably not what you mean.)
    More or less, plus the theoretical argument from the apparent Turing-completeness of human understanding and the lack of empirical evidence to the contrary.
    Interesting. Personally I would talk about humans generalizing to doing science as evidence for our general intelligence. The theoretical arguments + introspection are relatively minor bits of evidence relative to that, for me. I’m surprised it isn’t the same for you.
    (If you did buy that story though, then I’d think it should be possible in your view to have behavioral tests of AGI before it is so superintelligent that we’ve lost control.)
    I’m not sure why the extrapolation step would be suspect?
    It isn’t suspect, sorry, I didn’t mean to imply that.
    - Thane Ruthenis 6 May 2023 14:56 UTC
      LW: 4 AF: 3
      0
      AF Parent
      Discontinuity ending (without stalling):
      Stalling:
      Ah, makes sense.
      Are you imagining systems that are built differently from today?
      I do expect that some sort of ability to reprogram itself at inference time will be ~necessary for AGI, yes. But I also had in mind something like your “SGD creates a set of weights that effectively treats the input English tokens as a programming language” example. In the unlikely case that modern transformers are AGI-complete, I’d expect something on that order of exoticism to be necessary (but it’s not my baseline prediction).
      Personally I would talk about humans generalizing to doing science as evidence for our general intelligence
      “Doing science” is meant to be covered by “lack of empirical evidence that there’s anything in the universe that humans can’t model”. Doing science implies the ability to learn/invent new abstractions, and we’re yet to observe any limits to how far we can take it / what that trick allows us to understand.
      (If you did buy that story though, then I’d think it should be possible in your view to have behavioral tests of AGI before it is so superintelligent that we’ve lost control.)
      Mmm. Consider a scheme like the following:
      Let $T_{2}$ be the current date.
      Train an AI on all of humanity’s knowledge up to a point in time $T_{1}$ , where $T_{1} < T_{2}$ .
      Assemble a list $D$ of all scientific discoveries made in the time period $(T_{1}; T_{2}]$ .
      See if the AI can replicate these discoveries.
      At face value, if the AI can do that, it should be considered able to “do science” and therefore AGI, right?
      I would dispute that. If the period $(T_{1}; T_{2}]$ is short enough, then it’s likely that most of the cognitive work needed to make the leap to any discovery in $D$ is already present in the data up to $T_{1}$ . Making a discovery from that starting point doesn’t necessarily require developing new abstractions/doing science — it’s possible that it may be done just by interpolating between a few already-known concepts. And here, some asymmetry between humans and e. g. SOTA LLMs becomes relevant:
      No human knows everything the entire humanity knows. Imagine if making some discovery in $D$ by interpolation required combining two very “distant” concepts, like a physics insight and advanced biology knowledge. It’s unlikely that there’d be a human with sufficient expertise in both, so a human will likely do it by actual-science (e. g., a biologist would re-derive the physics insight from first principles).
      An LLM, however, has a bird’s eye view on the entire human concept-space up to $T_{1}$ . It directly sees both the physics insight and the biology knowledge, at once. So it can just do an interpolation between them, without doing truly-novel research.
      Thus, the ability to produce marginal scientific insights may mean either the ability to “do science”, or that the particular scientific insight is just a simple interpolation between already-known but distant concepts.
      On the other hand, now imagine that the period $(T_{1}; T_{2}]$ is very large, e. g. from 1940 to 2020. We’d then be asking our AI to make very significant discoveries, such that they surely can’t be done by simple interpolation, only by actually building chains of novel abstractions. But… well, most humans can’t do that either, right? Not all generally-intelligent entities are scientific geniuses. Thus, this is a challenge a “weak” AGI would not be able to meet, only a genius/superintelligent AGI — i. e., only an AGI that’s already an extinction threat.
      In theory, there should be a pick of $(T_{1}; T_{2}]$ that fits between the two extremes. A set of discoveries such that they can’t be done by interpolation, but also don’t require dangerous genius to solve.
      But how exactly are we supposed to figure out what the right interval is? (I suppose it may not be an unsolvable problem, and I’m open to ideas, but skeptical on priors.)
      What links here?
      Thane Ruthenis's comment on A Case for the Least Forgiving Take On Alignment by Thane Ruthenis (9 May 2023 7:46 UTC; 3 points)
      MichaelStJules's comment on A Case for the Least Forgiving Take On Alignment by Thane Ruthenis (9 May 2023 8:47 UTC; 3 points)
      - Rohin Shah 6 May 2023 15:15 UTC
        LW: 6 AF: 4
        0
        AF Parent
        Okay, this mostly makes sense now. (I still disagree but it no longer seems internally inconsistent.)
        Fwiw, I feel like if I had your model, I’d be interested in:
        Producing tests for general intelligence. It really feels like there should be something to do here, that at least gives you significant Bayesian evidence. For example, filter the training data to remove anything talking about <some scientific field, e.g. complexity theory>, then see whether the resulting AI system can invent that field from scratch if you point it at the problems that motivated the development of the field.
        Identifying “dangerous” changes to architectures, e.g. inference time reprogramming. Maybe we can simply avoid these architectures and stick with things that are more like LLMs.
        Hardening the world against mildly-superintelligent AI systems, so that you can study them / iterate on them more safely. (Incidentally, I don’t buy the argument that mildly-superintelligent AI systems could clearly defeat us all. It’s not at all clear to me that once you have a mildly-superintelligent AI system you’ll have a billion mildly-superintelligent-AI-years worth of compute to run them.)
        Thane Ruthenis 6 May 2023 15:22 UTC
        LW: 2 AF: 1
        0
        AF Parent
        I agree that those are useful pursuits.
        I still disagree but it no longer seems internally inconsistent
        Mind gesturing at your disagreements? Not necessarily to argue them, just interested in the viewpoint.
        Rohin Shah 6 May 2023 15:47 UTC
        LW: 7 AF: 5
        1
        AF Parent
        Oh, I disagree with your core thesis that the general intelligence property is binary. (Which then translates into disagreements throughout the rest of the post.) But experience has taught me that this disagreement tends to be pretty intractable to talk through, and so I now try just to understand the position I don’t agree with, so that I can notice if its predictions start coming true.
        You mention universality, active adaptability and goal-directedness. I do think universality is binary, but I expect there are fairly continuous trends in some underlying latent variables (e.g. “complexity and generality of the learned heuristics”), and “becoming universal” occurs when these fairly continuous trends exceed some threshold. For similar reasons I think active adaptability and goal-directedness will likely increase continuously, rather than being binary.
        You might think that since I agree universality is binary that alone is enough to drive agreement with other points, but:
        I don’t expect a discontinuous jump at the point you hit the universality property (because of the continuous trends), and I think it’s plausible that current LLMs already have the capabilities to be “universal”. I’m sure this depends on how you operationalize universality, I haven’t thought about it carefully.
        I don’t think that the problems significantly change character after you pass the universality threshold, and so I think you are able to iterate prior to passing it.
        Thane Ruthenis 6 May 2023 16:08 UTC
        LW: 4 AF: 3
        0
        AF Parent
        Interesting, thanks.
        I don’t expect a discontinuous jump at the point you hit the universality property
        Agreed that this point (universality leads to discontinuity) probably needs to be hashed out more. Roughly, my view is that universality allows the system to become self-sustaining. Prior to universality, it can’t autonomously adapt to novel environments (including abstract environments, e. g. new fields of science). Its heuristics have to be refined by some external ground-truth signals, like trial-and-error experimentation or model-based policy gradients. But once the system can construct and work with self-made abstract objects, it can autonomously build chains of them — and that causes a shift in the architecture and internal dynamics, because now its primary method of cognition is iterating on self-derived abstraction chains, instead of using hard-coded heuristics/modules.
        Rohin Shah 6 May 2023 16:22 UTC
        LW: 4 AF: 3
        0
        AF Parent
        I agree that there’s a threshold for “can meaningfully build and chain novel abstractions” and this can lead to a positive feedback loop that was not previously present, but there will already be lots of positive feedback loops (such as “AI research → better AI → better assistance for human researchers → AI research”) and it’s not clear why to expect the new feedback loop to be much more powerful than the existing ones.
        (Aside: we’re now talking about a discontinuity in the gradient of capabilities rather than of capabilities themselves, but sufficiently large discontinuities in the gradient of capabilities have much of the same implications.)
        Thane Ruthenis 6 May 2023 16:56 UTC
        LW: 4 AF: 3
        0
        AF Parent
        it’s not clear why to expect the new feedback loop to be much more powerful than the existing ones
        Yeah, the argument here would rely on the assumption that e. g. the extant scientific data already uniquely constraint some novel laws of physics/engineering paradigms/psychological manipulation techniques/etc., and we would be eventually able to figure them out even if science froze right this moment. In this case, the new feedback loop would be faster because superintelligent cognition would be faster than real-life experiments.
        And I think there’s a decent amount of evidence for this. Consider that there are already narrow AIs that can solve protein folding more efficiently than our best manually-derived algorithms — which suggests that better algorithms are already uniquely constrained by the extant data, and we’ve just been unable to find them. Same may be true for all other domains of science — and thus, a superintelligence iterating on its own cognition would be able to outspeed human science.