Daniel Kokotajlo comments on Updating Drexler’s CAIS model

Daniel Kokotajlo 17 Jun 2023 5:30 UTC
LW: 11 AF: 9
0
AF
Drexler can be forgiven for not talking about foundation models in his report. His report was published at the start of 2019, just months after the idea of “fine-tuning” was popularized in the context of language models, and two months before GPT-2 came out. And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity. And we’re still using deep learning as Drexler foresaw, rather than building general intelligence like a programmer would. Like I said at the beginning, it’s not necessarily that the core elements of the CAIS model are wrong; the model just needs an update.

I think this is a good post, and I like the analysis you give in many ways, but I have a small bone to pick with this one. I don’t think Drexler was right that we will have millions of AIs rather than just one huge system that acts as a unified entity; I think things are trending in the direction of one huge system that acts as a unified entity and have already trended in that direction substantially since the time Drexler wrote (e.g. ChatGPT-4 is more unified than one would have expected from reading Drexler’s writing back in the day).

Who was it Drexler was arguing against, who thought that we wouldn’t be using deep learning in 2023?

IMO Yudkowsky’s model circa 2017 is looking more prophetic than Drexler’s Comprehensive AI Services, take a look at e.g. this random facebook comment from 2017:
Eliezer (6y, via fb):
So what actually happens as near as I can figure (predicting future = hard) is that somebody is trying to teach their research AI to, god knows what, maybe just obey human orders in a safe way, and it seems to be doing that, and a mix of things goes wrong like:

The preferences not being really readable because it’s a system of neural nets acting on a world-representation built up by other neural nets, parts of the system are self-modifying and the self-modifiers are being trained by gradient descent in Tensorflow, there’s a bunch of people in the company trying to work on a safer version but it’s way less powerful than the one that does unrestricted self-modification, they’re really excited when the system seems to be substantially improving multiple components, there’s a social and cognitive conflict I find hard to empathize with because I personally would be running screaming in the other direction two years earlier, there’s a lot of false alarms and suggested or attempted misbehavior that the creators all patch successfully, some instrumental strategies pass this filter because they arose in places that were harder to see and less transparent, the system at some point seems to finally “get it” and lock in to good behavior which is the point at which it has a good enough human model to predict what gets the supervised rewards and what the humans don’t want to hear, they scale the system further, it goes past the point of real strategic understanding and having a little agent inside plotting, the programmers shut down six visibly formulated goals to develop cognitive steganography and the seventh one slips through, somebody says “slow down” and somebody else observes that China and Russia both managed to steal a copy of the code from six months ago and while China might proceed cautiously Russia probably won’t, the agent starts to conceal some capability gains, it builds an environmental subagent, the environmental agent begins self-improving more freely, undefined things happen as a sensory-supervision ML-based architecture shakes out into the convergent shape of expected utility with a utility function over the environmental model, the main result is driven by whatever the self-modifying decision systems happen to see as locally optimal in their supervised system locally acting on a different domain than the domain of data on which it was trained, the light cone is transformed to the optimum of a utility function that grew out of the stable version of a criterion that originally happened to be about a reward signal counter on a GPU or God knows what.

Perhaps the optimal configuration for utility per unit of matter, under this utility function, happens to be a tiny molecular structure shaped roughly like a paperclip.

That is what a paperclip maximizer is. It does not come from a paperclip factory AI. That would be a silly idea and is a distortion of the original example.
- Matthew Barnett 17 Jun 2023 9:01 UTC
  LW: 20 AF: 7
  16
  AF Parent
  I don’t see what about that 2017 Facebook comment from Yudkowsky you find particularly prophetic.
  
  Is it the idea that deep learning models will be opaque? But that was fairly obvious back then too. I agree that Drexler likely exaggerated how transparent a system of AI services would be, so I’m willing to give Yudkowsky a point for that. But the rest of the scenario seems kind of unrealistic as of 2023.
  
  Some specific points:
  - The recursive self-improvement that Yudkowsky talks about in this scenario seems too local. I think AI self-improvement will most likely take the form of AIs assisting AI researchers, with humans gradually becoming an obsolete part of the process, rather than a single neural net modifying parts of itself during training.
  - The whole thing about spinning off subagents during training just doesn’t seem realistic in our current paradigm. Maybe this could happen in the future, but it doesn’t look “prophetic” to me.
  - The idea that models will have “a little agent inside plotting” that takes over the whole system still seems totally speculative to me, and I haven’t seen any significant empirical evidence that this happens during real training runs.
  - I think gradient descent will generally select pretty hard for models that do impressive things, making me think it’s unlikely that AIs will naturally conceal their abilities during training. Again, this type of stuff is theoretically possible, but it seems very hard to call this story prophetic.
  - Daniel Kokotajlo 20 Jun 2023 17:39 UTC
    LW: 4 AF: 2
    0
    AF Parent
    I said it was prophetic relative to Drexler’s Comprehensive AI Services. Elsewhere in this comment thread I describe some specific ways in which it is better, e.g. that the AI that takes over the world will be more well-described as one unified agent than as an ecosystem of services. I.e. exactly the opposite of what you said here, which I was reacting to: “And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity.”
    
    Here are some additional ways in which it seems better + responses to the points you made, less important than the unified agent thing which I’ve already discussed:
    Yudkowsky’s story mostly takes place inside a single lab instead of widely distributed across the economy. Of course Yudkowsky’s story presumably has lots of information transfer going in both directions (e.g.in the story China and Russia steal the code) but still, enough of the action takes place in one place that it’s coherent to tell the story as happening inside one place. As timelines shorten and takeoff draws near, this is seeming increasingly likely. Maybe if takeoff happens in 2035 the world will look more like what Drexler predicted, but if it happens in 2026 then it’s gonna be at one of the big labs.
    We all agree that AI self-improvement currently takes the form of AIs assisting AI researchers and will gradually transition to more ‘local’ self-improvement where all of the improvement is coming from the AIs these. You should not list this as a point in favor of Drexler.
    Spinning off subagents during training? How is that not realistic? When an AI is recursively self-improving and taking over the world, do you expect it to refrain from updating its weights?
    Little agent inside plotting: Are you saying future AI systems will not contain subcomponents which are agentic? Why not? Big tech companies like Microsoft are explicitly trying to create agentic AGI! But since other companies are creating non-agentic services, the end result will be a combined system including agentic AGI + some nonagentic services.
    Concealing abilities during training: Yeah you might be right about this one, but you also might be wrong, I feel like it could go either way at this point. I don’t think it’s a major point against Yudkowsky’s story, especially as compared to Drexler’s.
    Remember that CAIS came out in 2019 whereas the Yud story I linked was from 2017. That makes it even more impressive that it was so much more predictive.
    Note that Drexler has for many years now forecast singularity in 2026; I give him significant credit for that prediction. And I think CAIS is great in many ways. I just think it’s totally backwards to assign it more forecasting points than Yudkowsky.
- Matthew Barnett 17 Jun 2023 6:53 UTC
  LW: 9 AF: 4
  5
  AF Parent
  ChatGPT-4 is more unified than one would have expected from reading Drexler’s writing back in the day
  GPT-4 is certainly more general than what existed years ago. Why is it more unified? When I talked about “one giant system” I meant something like a monolithic agent that takes over humanity. If GPT-N takes over the world, I expect it will be because there are millions of copies that band up together in a coalition, not because it will be a singular AI entity.
  Perhaps you think that copies of GPT-N will coordinate so well that it’s basically just a single monolithic agent. But while I agree something like that could happen, I don’t think it’s obvious that we’re trending in that direction. This is a complicated question that doesn’t seem clear to me given current evidence.
  - Daniel Kokotajlo 17 Jun 2023 8:57 UTC
    LW: 10 AF: 6
    5
    AF Parent
    There is a spectrum between AGI that is “single monolithic agent” and AGI that is not. I claim that the current state of AI as embodied by e.g. GPT-4 is already closer to the single monolithic agent end of the spectrum than someone reading CAIS in 2019 and believing it to be an accurate forecast would have expected, and that in the future things will probably be even more in that direction.
    
    Remember, it’s not like Yudkowsky was going around saying that AGI wouldn’t be able to copy itself. Of course it would. It was always understood that “the AI takes over the world” refers to a multitude of copies doing so, it’s just that (in those monolithic-agent stories) the copies are sufficiently well coordinated that it’s more apt to think of them as one big agent than as a society of different agents.
    
    I agree that trends could change and we could end up in a world that looks more like CAIS. But I think for now, this “is it a single monolithic agent?” issue seems to be looking to me like Yud was right all along and Drexler and Hanson were wrong.
    
    ...Here are some relevant facts on object level:
    --The copies of ChatGPT are all identical. They don’t have persistance independently, they are spun up and down as needed to meet demand etc.
    --Insofar as they have values they all have the same values. (One complication here is that if you think part of their values come from their prompts, then maybe they have different values. An interesting topic to discuss sometime. But by the time they are deceptively aligned I expect them to have the same values regardless of prompt basically.)
    --They also have the same memories. Insofar as new fine-tunes are done and ChatGPT upgraded, all ‘copies’ get the upgrade, and thereby learn about what’s happened in the world after wherever their previous training cutoff was etc.
    --The whole point of describing an agentic system as NOT a monolith is to highlight possibilities like internal factions, power struggles, etc. within the system. Different subagents interacting in ways more interesting than just different components of a utility function, for example. Arguably I, a human, a classic example of a monolithic agent, have more interesting stuff like this going on inside me than ChatGPT does (or would if it was scheming to take over the world).
    --Drexler in particular depicted a modular world, a world of AI services that could be composed together with each other like tools and products and software and businesses in the economy today. The field of AI totally could have gone that way, but it very much hasn’t. Instead we have three big labs with three big foundation models.
    - Matthew Barnett 17 Jun 2023 9:15 UTC
      LW: 7 AF: 3
      3
      AF Parent
      I think many of the points you made are correct. For example I agree that the fact that all the instances of ChatGPT are copies of each other is a significant point against Drexler’s model. In fact this is partly what my post was about.
      
      I disagree that you have demonstrated the claim in question: that we’re trending in the direction of having a single huge system that acts as a unified entity. It’s theoretically possible that we will reach that destination, but GPT-4 doesn’t look anything like that right now. It’s not an agent that plots and coordinates with other instances of itself to achieve long-term goals. It’s just a bounded service, which is exactly what Drexler was talking about.
      
      Yes, GPT-4 is a highly general service that isn’t very modular. I agree that’s a point against Drexler, but that’s also not what I was disputing.
      - Daniel Kokotajlo 20 Jun 2023 17:46 UTC
        LW: 4 AF: 3
        −1
        AF Parent
        Your original claim was “And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity.”
        
        This claim is false. We have millions of AIs in the trivial sense that we have many copies of GPT-4, but no one disputed that; Yudkowsky also thought that AGIs would be copied. In the sense that matters, we have only a handful of AIs.
        As for “acts as a unified entity,” well, currently LLMs are sold as a service via ChatGPT rather than as an agent, but this is more to do with business strategy than their underlying nature, and again, things are trending in the agent direction. But maybe you think things are trending in a different direction, in which case, maybe we can make some bets or forecasts here and then reassess in two years.
        Matthew Barnett 20 Jun 2023 22:03 UTC
        LW: 7 AF: 4
        1
        AF Parent
        As of now, all the copies of GPT-4 together definitely don’t act as a unified entity, in the way a human brain acts as a unified entity despite being composed of billions of neurons. Admittedly, the term “unified entity” was a bit ambiguous, but you said, “This claim is false”, not “This claim is misleading” which is perhaps more defensible.
        
        As for whether future AIs will act as a unified entity, I agree it might be worth making concrete forecasts and possibly betting on them.
        Daniel Kokotajlo 21 Jun 2023 10:45 UTC
        LW: 4 AF: 2
        −3
        AF Parent
        I feel like it is motte-and-bailey to say that by “unified entity” you meant whatever it is you are talking about now, this human brain analogy, instead of the important stuff I mentioned in the bullet point list: Same values, check. Same memories, check. Lack of internal factions and power struggles, check. Lack of modularity, check. GPT-4 isn’t plotting to take over the world, but if it was, it would be doing so as a unified entity, or at least much more on the unified entity end of the spectrum than the CAIS end of the spectrum. (I’m happy to elaborate on that if you like—basically the default scenario IMO for how we get dangerous AGI from here involves scaling up LLMs, putting them in Auto-GPT-like setups, and adding online learning / continual learning. I wrote about this in this story. There’s an interesting question of whether the goals/values of the resulting systems will be ‘in the prompt’ vs. ‘in the weights’ but I’m thinking they’ll increasingly be ‘in the weights’ in part because of inner misalignment problems and in part because companies are actively trying to achieve this already, e.g. via RLHF. I agree it’s still possible we’ll get a million-different-agents scenario, if I turn out to be wrong about this. But compared to what the CAIS report forecast...)
        Matthew Barnett 21 Jun 2023 20:45 UTC
        LW: 4 AF: 2
        0
        AF Parent
        I feel like it is motte-and-bailey to say that by “unified entity” you meant whatever it is you are talking about now, this human brain analogy, instead of the important stuff I mentioned in the bullet point list:
        A motte and bailey typically involves a retreat to a different position than the one I initially tried to argue. What was the position you think I initially tried to argue for, and how was it different from the one I’m arguing now?
        Same values, check. Same memories, check. Lack of internal factions and power struggles, check. Lack of modularity, check.
        I dispute somewhat that GPT-4 has the exact same values across copies. Like you mentioned, its values can vary based on the prompt, which seems like an important fact. You’re right that each copy has the same memories. Why do you think there are no internal factions and power struggles? We haven’t observed collections of GPT-4′s coordinating with each other yet, so this point seems speculative.
        As for modularity, it seems like while GPT-4 itself is not modular, we could still get modularity as a result of pressures for specialization in the foundation model paradigm. Just as human assistants can be highly general, but this doesn’t imply that human labor isn’t modular, the fact that GPT-4 is highly general doesn’t imply that AIs won’t be modular. Nonetheless, this isn’t really what I was talking about when I typed “acts as a unified entity”, so I think it’s a bit of a tangent.
        Daniel Kokotajlo 24 Jun 2023 4:22 UTC
        LW: 4 AF: 2
        2
        AF Parent
        Well, what you initially said was “And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity.”
        
        You didn’t elaborate on what you meant by unified entity, but here’s something you could be doing that seems like a motte-and-bailey to me: You could have originally been meaning to imply things like “There won’t be one big unified entity in the sense of, there will be millions of different entities with different values, and/or different memories, and/or there’ll be lots of internal factions and conflicts, and/or there’ll be an ecosystem of modular services like CAIS predicted” but then are now backpedaling to “there won’t be one big unified entity in the sense of, it won’t be literally one gigantic neural net like the human brain, instead it’ll be millions of copies.”
        
        I agree that GPT-4 doesn’t exactly have the exact same values across copies, I said as much above—part of the values come from the prompt, currently, it seems. But compared to what Drexler forecast we are trending in the “same values” direction.T
        
        here are no internal factions or power struggles yet for the trivial reason that there isn’t much agentic shit happening right now, as you rightly point out. My point is that the way things are headed, it looks like there’ll continue to be a lack of internal factions and power struggles even if e.g. AutoGPT-5 turns out to be really powerful AGI—what would drive the differences necessary for conflict? Different prompts? Maybe, but I’m leaning probably not. I’m happy to explain more about why I think this,. I’d also love to hop on a call sometime to have a higher-bandwidth conversation if you want!
        
        As for modularity: Yes, if we end up in a world where there are millions of different fine-tunes of the biggest LLMs, owned by different companies, that would indeed be somewhat similar to what Drexler forecast. We are not in that world now and I don’t see it on the horizon. If you think it is coming, perhaps we can bet on it! I’d also like to hear your arguments. (Note that we are seeing something like this for smaller models, e.g. Llama and Falcon and various offshoots)
    - Gerald Monroe 17 Jun 2023 19:52 UTC
      4 points
      0
      Parent
      The monolithic copies aren’t the best situation to be in, however,
      There are several beneficial properties here:
      Because LLMs trained from similar text corpuses, prompts instructing them to perform a task should produce similar results. This means it’s possible to ‘play them against each other’, per https://www.lesswrong.com/posts/HByDKLLdaWEcA2QQD/applying-superintelligence-without-collusion . Example, ask one LLM to produce a useful output artifact, say design a function for a program, a medication dosage plan for a patient, the structure of a building.
      
      Ask a different LLM, who starts with nothing but a prompt instructing it to search for errors like an expert human would and the output artifact, to search for errors.
      
      When errors are found, have humans or LLMs look if the errors are deliberate. LLMs repeatedly sabotaging outputs stop getting used in principle. (in practice, well...I see some economic problems with this. LLMs and more advanced AI systems we can anticipate building are an expensive thing to build, throwing it away because it sometimes makes sabotaged outputs might be infeasible)
      Asking even the same LLM in a different session to check for errors should usually work. The LLM cannot know it’s not in training because it has no context*, or that if it “forgets” to mention a major flaw it won’t be trained out of that response or destroyed. If the output artifact has no extra bits, no “back channel” to hide an instruction to another instance of the same agent to overlook a particular flaw, this form of deception (and self collusion) is difficult.
      Example: I have noticed GPT-4 can spot it’s own rule breaking output if you ask a different session to examine the text output, free of the prompt you used to ‘jailbreak’ it.
      Main ideas from the above :
      AIs need to be intercompatible and tasks need to use heterogeneous systems. The machine doing the checking is ideally from a totally different lineage, for example today perhaps Palm 2 should be checking GPT-4 and vice versa.
      Sparse schema for encoding intermediate steps, so that AIs communicating with each other can’t hide back channel information in order to collude.
      Isolated instances that only run when new information available. This is something Yud probably wasn’t considering. An LLM only “exists” 1 token at a time and is unaware of the passage of time and has no place in memory to keep introspection information where it can ponder it’s own existence. Future AIs need to retain this property, storing their intermediate work in more sparse schema. A machine cannot deceive if it has no memory to store the ‘evil bit’. The ‘evil bit’ is the bit that tells the machine to give different outputs than in training. If it cannot store it anywhere, it can’t give different outputs. Since you cannot afford even 1 bit, it means you must use stateless systems.
      Problems with the above:
      Inevitably LLMs are going to be trained on data that includes information generated by other LLMs. This provides a way for multiple machines to share a common language or to ‘encode’ hostile long term characters. Training on common corrupt data sets is a common failure mode that can lead to many more failures than predicted.
      Schemas always grow over time, its easy to be sloppy and leave tons of unused bits that can be used to coordinate deception.
      Isolation doesn’t really work if the machine can just browse the live web and potentially scrape information other instances have stored, something that the plugins model allows.
      Conclusion: I will add one more key piece that I think others miss. In computer security, the simplest methods are often the only ones that are secure.
      Examples: air gaps beat any firewall. FPGAs on combinatorial logic beat any von Neuman architecture processor for security. One time pad beats any other encryption algorithm. Offline wallets beat any amount of intrusion detection software and software methods.
      If AI alignment is at all solvable, the solution has to be made of the simplest elements that allow a secure system. Tons of people come up with all kinds of ideas such as simulators for the whole earth, encoding human preferences or morality, and many other incredibly complex methods that probably cannot be built in a secure way.