ChatGPT-4 is more unified than one would have expected from reading Drexler’s writing back in the day
GPT-4 is certainly more general than what existed years ago. Why is it more unified? When I talked about “one giant system” I meant something like a monolithic agent that takes over humanity. If GPT-N takes over the world, I expect it will be because there are millions of copies that band up together in a coalition, not because it will be a singular AI entity.
Perhaps you think that copies of GPT-N will coordinate so well that it’s basically just a single monolithic agent. But while I agree something like that could happen, I don’t think it’s obvious that we’re trending in that direction. This is a complicated question that doesn’t seem clear to me given current evidence.
There is a spectrum between AGI that is “single monolithic agent” and AGI that is not. I claim that the current state of AI as embodied by e.g. GPT-4 is already closer to the single monolithic agent end of the spectrum than someone reading CAIS in 2019 and believing it to be an accurate forecast would have expected, and that in the future things will probably be even more in that direction.
Remember, it’s not like Yudkowsky was going around saying that AGI wouldn’t be able to copy itself. Of course it would. It was always understood that “the AI takes over the world” refers to a multitude of copies doing so, it’s just that (in those monolithic-agent stories) the copies are sufficiently well coordinated that it’s more apt to think of them as one big agent than as a society of different agents.
I agree that trends could change and we could end up in a world that looks more like CAIS. But I think for now, this “is it a single monolithic agent?” issue seems to be looking to me like Yud was right all along and Drexler and Hanson were wrong.
...Here are some relevant facts on object level: --The copies of ChatGPT are all identical. They don’t have persistance independently, they are spun up and down as needed to meet demand etc. --Insofar as they have values they all have the same values. (One complication here is that if you think part of their values come from their prompts, then maybe they have different values. An interesting topic to discuss sometime. But by the time they are deceptively aligned I expect them to have the same values regardless of prompt basically.) --They also have the same memories. Insofar as new fine-tunes are done and ChatGPT upgraded, all ‘copies’ get the upgrade, and thereby learn about what’s happened in the world after wherever their previous training cutoff was etc. --The whole point of describing an agentic system as NOT a monolith is to highlight possibilities like internal factions, power struggles, etc. within the system. Different subagents interacting in ways more interesting than just different components of a utility function, for example. Arguably I, a human, a classic example of a monolithic agent, have more interesting stuff like this going on inside me than ChatGPT does (or would if it was scheming to take over the world). --Drexler in particular depicted a modular world, a world of AI services that could be composed together with each other like tools and products and software and businesses in the economy today. The field of AI totally could have gone that way, but it very much hasn’t. Instead we have three big labs with three big foundation models.
I think many of the points you made are correct. For example I agree that the fact that all the instances of ChatGPT are copies of each other is a significant point against Drexler’s model. In fact this is partly what my post was about.
I disagree that you have demonstrated the claim in question: that we’re trending in the direction of having a single huge system that acts as a unified entity. It’s theoretically possible that we will reach that destination, but GPT-4 doesn’t look anything like that right now. It’s not an agent that plots and coordinates with other instances of itself to achieve long-term goals. It’s just a bounded service, which is exactly what Drexler was talking about.
Yes, GPT-4 is a highly general service that isn’t very modular. I agree that’s a point against Drexler, but that’s also not what I was disputing.
Your original claim was “And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity.”
This claim is false. We have millions of AIs in the trivial sense that we have many copies of GPT-4, but no one disputed that; Yudkowsky also thought that AGIs would be copied. In the sense that matters, we have only a handful of AIs.
As for “acts as a unified entity,” well, currently LLMs are sold as a service via ChatGPT rather than as an agent, but this is more to do with business strategy than their underlying nature, and again, things are trending in the agent direction. But maybe you think things are trending in a different direction, in which case, maybe we can make some bets or forecasts here and then reassess in two years.
As of now, all the copies of GPT-4 together definitely don’t act as a unified entity, in the way a human brain acts as a unified entity despite being composed of billions of neurons. Admittedly, the term “unified entity” was a bit ambiguous, but you said, “This claim is false”, not “This claim is misleading” which is perhaps more defensible.
As for whether future AIs will act as a unified entity, I agree it might be worth making concrete forecasts and possibly betting on them.
I feel like it is motte-and-bailey to say that by “unified entity” you meant whatever it is you are talking about now, this human brain analogy, instead of the important stuff I mentioned in the bullet point list: Same values, check. Same memories, check. Lack of internal factions and power struggles, check. Lack of modularity, check. GPT-4 isn’t plotting to take over the world, but if it was, it would be doing so as a unified entity, or at least much more on the unified entity end of the spectrum than the CAIS end of the spectrum. (I’m happy to elaborate on that if you like—basically the default scenario IMO for how we get dangerous AGI from here involves scaling up LLMs, putting them in Auto-GPT-like setups, and adding online learning / continual learning. I wrote about this in this story. There’s an interesting question of whether the goals/values of the resulting systems will be ‘in the prompt’ vs. ‘in the weights’ but I’m thinking they’ll increasingly be ‘in the weights’ in part because of inner misalignment problems and in part because companies are actively trying to achieve this already, e.g. via RLHF. I agree it’s still possible we’ll get a million-different-agents scenario, if I turn out to be wrong about this. But compared to what the CAIS report forecast...)
I feel like it is motte-and-bailey to say that by “unified entity” you meant whatever it is you are talking about now, this human brain analogy, instead of the important stuff I mentioned in the bullet point list:
A motte and bailey typically involves a retreat to a different position than the one I initially tried to argue. What was the position you think I initially tried to argue for, and how was it different from the one I’m arguing now?
Same values, check. Same memories, check. Lack of internal factions and power struggles, check. Lack of modularity, check.
I dispute somewhat that GPT-4 has the exact same values across copies. Like you mentioned, its values can vary based on the prompt, which seems like an important fact. You’re right that each copy has the same memories. Why do you think there are no internal factions and power struggles? We haven’t observed collections of GPT-4′s coordinating with each other yet, so this point seems speculative.
As for modularity, it seems like while GPT-4 itself is not modular, we could still get modularity as a result of pressures for specialization in the foundation model paradigm. Just as human assistants can be highly general, but this doesn’t imply that human labor isn’t modular, the fact that GPT-4 is highly general doesn’t imply that AIs won’t be modular. Nonetheless, this isn’t really what I was talking about when I typed “acts as a unified entity”, so I think it’s a bit of a tangent.
Well, what you initially said was “And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity.”
You didn’t elaborate on what you meant by unified entity, but here’s something you could be doing that seems like a motte-and-bailey to me: You could have originally been meaning to imply things like “There won’t be one big unified entity in the sense of, there will be millions of different entities with different values, and/or different memories, and/or there’ll be lots of internal factions and conflicts, and/or there’ll be an ecosystem of modular services like CAIS predicted” but then are now backpedaling to “there won’t be one big unified entity in the sense of, it won’t be literally one gigantic neural net like the human brain, instead it’ll be millions of copies.”
I agree that GPT-4 doesn’t exactly have the exact same values across copies, I said as much above—part of the values come from the prompt, currently, it seems. But compared to what Drexler forecast we are trending in the “same values” direction.T
here are no internal factions or power struggles yet for the trivial reason that there isn’t much agentic shit happening right now, as you rightly point out. My point is that the way things are headed, it looks like there’ll continue to be a lack of internal factions and power struggles even if e.g. AutoGPT-5 turns out to be really powerful AGI—what would drive the differences necessary for conflict? Different prompts? Maybe, but I’m leaning probably not. I’m happy to explain more about why I think this,. I’d also love to hop on a call sometime to have a higher-bandwidth conversation if you want!
As for modularity: Yes, if we end up in a world where there are millions of different fine-tunes of the biggest LLMs, owned by different companies, that would indeed be somewhat similar to what Drexler forecast. We are not in that world now and I don’t see it on the horizon. If you think it is coming, perhaps we can bet on it! I’d also like to hear your arguments. (Note that we are seeing something like this for smaller models, e.g. Llama and Falcon and various offshoots)
The monolithic copies aren’t the best situation to be in, however,
There are several beneficial properties here:
Because LLMs trained from similar text corpuses, prompts instructing them to perform a task should produce similar results. This means it’s possible to ‘play them against each other’, per https://www.lesswrong.com/posts/HByDKLLdaWEcA2QQD/applying-superintelligence-without-collusion . Example, ask one LLM to produce a useful output artifact, say design a function for a program, a medication dosage plan for a patient, the structure of a building.
Ask a different LLM, who starts with nothing but a prompt instructing it to search for errors like an expert human would and the output artifact, to search for errors.
When errors are found, have humans or LLMs look if the errors are deliberate. LLMs repeatedly sabotaging outputs stop getting used in principle. (in practice, well...I see some economic problems with this. LLMs and more advanced AI systems we can anticipate building are an expensive thing to build, throwing it away because it sometimes makes sabotaged outputs might be infeasible)
Asking even the same LLM in a different session to check for errors should usually work. The LLM cannot know it’s not in training because it has no context*, or that if it “forgets” to mention a major flaw it won’t be trained out of that response or destroyed. If the output artifact has no extra bits, no “back channel” to hide an instruction to another instance of the same agent to overlook a particular flaw, this form of deception (and self collusion) is difficult.
Example: I have noticed GPT-4 can spot it’s own rule breaking output if you ask a different session to examine the text output, free of the prompt you used to ‘jailbreak’ it.
Main ideas from the above :
AIs need to be intercompatible and tasks need to use heterogeneoussystems. The machine doing the checking is ideally from a totally different lineage, for example today perhaps Palm 2 should be checking GPT-4 and vice versa.
Sparse schema for encoding intermediate steps, so that AIs communicating with each other can’t hide back channel information in order to collude.
Isolated instances that only run when new information available. This is something Yud probably wasn’t considering. An LLM only “exists” 1 token at a time and is unaware of the passage of time and has no place in memory to keep introspection information where it can ponder it’s own existence. Future AIs need to retain this property, storing their intermediate work in more sparse schema. A machine cannot deceive if it has no memory to store the ‘evil bit’. The ‘evil bit’ is the bit that tells the machine to give different outputs than in training. If it cannot store it anywhere, it can’t give different outputs. Since you cannot afford even 1 bit, it means you must use stateless systems.
Problems with the above:
Inevitably LLMs are going to be trained on data that includes information generated by other LLMs. This provides a way for multiple machines to share a common language or to ‘encode’ hostile long term characters. Training on commoncorrupt data sets is a common failure mode that can lead to many more failures than predicted.
Schemas always grow over time, its easy to be sloppy and leave tons of unused bits that can be used to coordinate deception.
Isolation doesn’t really work if the machine can just browse the live web and potentially scrape information other instances have stored, something that the plugins model allows.
Conclusion: I will add one more key piece that I think others miss. In computer security, the simplest methods are often the only ones that are secure.
Examples: air gaps beat any firewall. FPGAs on combinatorial logic beat any von Neuman architecture processor for security. One time pad beats any other encryption algorithm. Offline wallets beat any amount of intrusion detection software and software methods.
If AI alignment is at all solvable, the solution has to be made of the simplest elements that allow a secure system. Tons of people come up with all kinds of ideas such as simulators for the whole earth, encoding human preferences or morality, and many other incredibly complex methods that probably cannot be built in a secure way.
GPT-4 is certainly more general than what existed years ago. Why is it more unified? When I talked about “one giant system” I meant something like a monolithic agent that takes over humanity. If GPT-N takes over the world, I expect it will be because there are millions of copies that band up together in a coalition, not because it will be a singular AI entity.
Perhaps you think that copies of GPT-N will coordinate so well that it’s basically just a single monolithic agent. But while I agree something like that could happen, I don’t think it’s obvious that we’re trending in that direction. This is a complicated question that doesn’t seem clear to me given current evidence.
There is a spectrum between AGI that is “single monolithic agent” and AGI that is not. I claim that the current state of AI as embodied by e.g. GPT-4 is already closer to the single monolithic agent end of the spectrum than someone reading CAIS in 2019 and believing it to be an accurate forecast would have expected, and that in the future things will probably be even more in that direction.
Remember, it’s not like Yudkowsky was going around saying that AGI wouldn’t be able to copy itself. Of course it would. It was always understood that “the AI takes over the world” refers to a multitude of copies doing so, it’s just that (in those monolithic-agent stories) the copies are sufficiently well coordinated that it’s more apt to think of them as one big agent than as a society of different agents.
I agree that trends could change and we could end up in a world that looks more like CAIS. But I think for now, this “is it a single monolithic agent?” issue seems to be looking to me like Yud was right all along and Drexler and Hanson were wrong.
...Here are some relevant facts on object level:
--The copies of ChatGPT are all identical. They don’t have persistance independently, they are spun up and down as needed to meet demand etc.
--Insofar as they have values they all have the same values. (One complication here is that if you think part of their values come from their prompts, then maybe they have different values. An interesting topic to discuss sometime. But by the time they are deceptively aligned I expect them to have the same values regardless of prompt basically.)
--They also have the same memories. Insofar as new fine-tunes are done and ChatGPT upgraded, all ‘copies’ get the upgrade, and thereby learn about what’s happened in the world after wherever their previous training cutoff was etc.
--The whole point of describing an agentic system as NOT a monolith is to highlight possibilities like internal factions, power struggles, etc. within the system. Different subagents interacting in ways more interesting than just different components of a utility function, for example. Arguably I, a human, a classic example of a monolithic agent, have more interesting stuff like this going on inside me than ChatGPT does (or would if it was scheming to take over the world).
--Drexler in particular depicted a modular world, a world of AI services that could be composed together with each other like tools and products and software and businesses in the economy today. The field of AI totally could have gone that way, but it very much hasn’t. Instead we have three big labs with three big foundation models.
I think many of the points you made are correct. For example I agree that the fact that all the instances of ChatGPT are copies of each other is a significant point against Drexler’s model. In fact this is partly what my post was about.
I disagree that you have demonstrated the claim in question: that we’re trending in the direction of having a single huge system that acts as a unified entity. It’s theoretically possible that we will reach that destination, but GPT-4 doesn’t look anything like that right now. It’s not an agent that plots and coordinates with other instances of itself to achieve long-term goals. It’s just a bounded service, which is exactly what Drexler was talking about.
Yes, GPT-4 is a highly general service that isn’t very modular. I agree that’s a point against Drexler, but that’s also not what I was disputing.
Your original claim was “And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity.”
This claim is false. We have millions of AIs in the trivial sense that we have many copies of GPT-4, but no one disputed that; Yudkowsky also thought that AGIs would be copied. In the sense that matters, we have only a handful of AIs.
As for “acts as a unified entity,” well, currently LLMs are sold as a service via ChatGPT rather than as an agent, but this is more to do with business strategy than their underlying nature, and again, things are trending in the agent direction. But maybe you think things are trending in a different direction, in which case, maybe we can make some bets or forecasts here and then reassess in two years.
As of now, all the copies of GPT-4 together definitely don’t act as a unified entity, in the way a human brain acts as a unified entity despite being composed of billions of neurons. Admittedly, the term “unified entity” was a bit ambiguous, but you said, “This claim is false”, not “This claim is misleading” which is perhaps more defensible.
As for whether future AIs will act as a unified entity, I agree it might be worth making concrete forecasts and possibly betting on them.
I feel like it is motte-and-bailey to say that by “unified entity” you meant whatever it is you are talking about now, this human brain analogy, instead of the important stuff I mentioned in the bullet point list: Same values, check. Same memories, check. Lack of internal factions and power struggles, check. Lack of modularity, check. GPT-4 isn’t plotting to take over the world, but if it was, it would be doing so as a unified entity, or at least much more on the unified entity end of the spectrum than the CAIS end of the spectrum. (I’m happy to elaborate on that if you like—basically the default scenario IMO for how we get dangerous AGI from here involves scaling up LLMs, putting them in Auto-GPT-like setups, and adding online learning / continual learning. I wrote about this in this story. There’s an interesting question of whether the goals/values of the resulting systems will be ‘in the prompt’ vs. ‘in the weights’ but I’m thinking they’ll increasingly be ‘in the weights’ in part because of inner misalignment problems and in part because companies are actively trying to achieve this already, e.g. via RLHF. I agree it’s still possible we’ll get a million-different-agents scenario, if I turn out to be wrong about this. But compared to what the CAIS report forecast...)
A motte and bailey typically involves a retreat to a different position than the one I initially tried to argue. What was the position you think I initially tried to argue for, and how was it different from the one I’m arguing now?
I dispute somewhat that GPT-4 has the exact same values across copies. Like you mentioned, its values can vary based on the prompt, which seems like an important fact. You’re right that each copy has the same memories. Why do you think there are no internal factions and power struggles? We haven’t observed collections of GPT-4′s coordinating with each other yet, so this point seems speculative.
As for modularity, it seems like while GPT-4 itself is not modular, we could still get modularity as a result of pressures for specialization in the foundation model paradigm. Just as human assistants can be highly general, but this doesn’t imply that human labor isn’t modular, the fact that GPT-4 is highly general doesn’t imply that AIs won’t be modular. Nonetheless, this isn’t really what I was talking about when I typed “acts as a unified entity”, so I think it’s a bit of a tangent.
Well, what you initially said was “And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity.”
You didn’t elaborate on what you meant by unified entity, but here’s something you could be doing that seems like a motte-and-bailey to me: You could have originally been meaning to imply things like “There won’t be one big unified entity in the sense of, there will be millions of different entities with different values, and/or different memories, and/or there’ll be lots of internal factions and conflicts, and/or there’ll be an ecosystem of modular services like CAIS predicted” but then are now backpedaling to “there won’t be one big unified entity in the sense of, it won’t be literally one gigantic neural net like the human brain, instead it’ll be millions of copies.”
I agree that GPT-4 doesn’t exactly have the exact same values across copies, I said as much above—part of the values come from the prompt, currently, it seems. But compared to what Drexler forecast we are trending in the “same values” direction.T
here are no internal factions or power struggles yet for the trivial reason that there isn’t much agentic shit happening right now, as you rightly point out. My point is that the way things are headed, it looks like there’ll continue to be a lack of internal factions and power struggles even if e.g. AutoGPT-5 turns out to be really powerful AGI—what would drive the differences necessary for conflict? Different prompts? Maybe, but I’m leaning probably not. I’m happy to explain more about why I think this,. I’d also love to hop on a call sometime to have a higher-bandwidth conversation if you want!
As for modularity: Yes, if we end up in a world where there are millions of different fine-tunes of the biggest LLMs, owned by different companies, that would indeed be somewhat similar to what Drexler forecast. We are not in that world now and I don’t see it on the horizon. If you think it is coming, perhaps we can bet on it! I’d also like to hear your arguments. (Note that we are seeing something like this for smaller models, e.g. Llama and Falcon and various offshoots)
The monolithic copies aren’t the best situation to be in, however,
There are several beneficial properties here:
Because LLMs trained from similar text corpuses, prompts instructing them to perform a task should produce similar results. This means it’s possible to ‘play them against each other’, per https://www.lesswrong.com/posts/HByDKLLdaWEcA2QQD/applying-superintelligence-without-collusion . Example, ask one LLM to produce a useful output artifact, say design a function for a program, a medication dosage plan for a patient, the structure of a building.
Ask a different LLM, who starts with nothing but a prompt instructing it to search for errors like an expert human would and the output artifact, to search for errors.
When errors are found, have humans or LLMs look if the errors are deliberate. LLMs repeatedly sabotaging outputs stop getting used in principle. (in practice, well...I see some economic problems with this. LLMs and more advanced AI systems we can anticipate building are an expensive thing to build, throwing it away because it sometimes makes sabotaged outputs might be infeasible)
Asking even the same LLM in a different session to check for errors should usually work. The LLM cannot know it’s not in training because it has no context*, or that if it “forgets” to mention a major flaw it won’t be trained out of that response or destroyed. If the output artifact has no extra bits, no “back channel” to hide an instruction to another instance of the same agent to overlook a particular flaw, this form of deception (and self collusion) is difficult.
Example: I have noticed GPT-4 can spot it’s own rule breaking output if you ask a different session to examine the text output, free of the prompt you used to ‘jailbreak’ it.
Main ideas from the above :
AIs need to be intercompatible and tasks need to use heterogeneous systems. The machine doing the checking is ideally from a totally different lineage, for example today perhaps Palm 2 should be checking GPT-4 and vice versa.
Sparse schema for encoding intermediate steps, so that AIs communicating with each other can’t hide back channel information in order to collude.
Isolated instances that only run when new information available. This is something Yud probably wasn’t considering. An LLM only “exists” 1 token at a time and is unaware of the passage of time and has no place in memory to keep introspection information where it can ponder it’s own existence. Future AIs need to retain this property, storing their intermediate work in more sparse schema. A machine cannot deceive if it has no memory to store the ‘evil bit’. The ‘evil bit’ is the bit that tells the machine to give different outputs than in training. If it cannot store it anywhere, it can’t give different outputs. Since you cannot afford even 1 bit, it means you must use stateless systems.
Problems with the above:
Inevitably LLMs are going to be trained on data that includes information generated by other LLMs. This provides a way for multiple machines to share a common language or to ‘encode’ hostile long term characters. Training on common corrupt data sets is a common failure mode that can lead to many more failures than predicted.
Schemas always grow over time, its easy to be sloppy and leave tons of unused bits that can be used to coordinate deception.
Isolation doesn’t really work if the machine can just browse the live web and potentially scrape information other instances have stored, something that the plugins model allows.
Conclusion: I will add one more key piece that I think others miss. In computer security, the simplest methods are often the only ones that are secure.
Examples: air gaps beat any firewall. FPGAs on combinatorial logic beat any von Neuman architecture processor for security. One time pad beats any other encryption algorithm. Offline wallets beat any amount of intrusion detection software and software methods.
If AI alignment is at all solvable, the solution has to be made of the simplest elements that allow a secure system. Tons of people come up with all kinds of ideas such as simulators for the whole earth, encoding human preferences or morality, and many other incredibly complex methods that probably cannot be built in a secure way.