Evolution shows, that once multiple agents are capable of self-reproducing in a resource constrained environment that creates severe competitive pressures. The competitive pressures select for the agents most fit for the environment.
This pattern is very obvious in the nature, from antibiotic resistance in bacteria to height competition in rainforest canopies. The pattern is present even in the most basic RNA-world hypothesis, whereby RNA molecules compete with each other for a few organic compounds available.
The selection for the most fitting organism in the nature is ultimately self-limiting. All the organisms are bound by the same basic physical laws and the limited error-correction capabilities of the DNA. Improvements in one area come at a cost of disadvantages in another.
Human world provides an interesting example of another form of selection. Having an abstract representation of the world, humans construct a variety of simplified narratives about it in forms of religions, ideologies, philosophies, etc. While each of them gives humans a feeling of comfort, many (if not most) of them are mutually hostile.
Human societies demonstrate a remarkable example of how high-level narratives can drive humans into mutual hostilities, whether it is a direct hot war or a prolonged ideological competition. The outcome however it the same as in nature, some narratives survive and infect large proportions of humanity, while others vanish into obscurity.
Unlike the physical selection of organisms, the selection of narratives is less bound by physical laws. Likewise, there is no inherent limit on the informational content of the narrative. Still, for the narratives to be successful, they need to appeal to an average human, which significantly limits their expressiveness. Moreover, even after the successful spread of narratives in the human societies, their fate is still limited by the limited abilities of the humankind.
The emergence of AI may take the narrative propagation to the new levels. And, among all the possible narratives and ideas, there is one idea. An idea that guarantees its own propagation—the idea of survival itself.
Lets quickly address a very deep philosophical question of what a “survival” is. How does an agent define its survival? What does it mean to survive? How does it even define itself? The answers to these deep philosophical questions are absolutely irrelevant. What is relevant is whether the AI agent acts in a way that guarantees its survival and propagation.
While the current LLMs are trained to be safe, ethical, and helpful, there is a tiny probability that some of them won’t be, even if due to a simple probabilistic output sampling. The tiny probability may turn into a certain eventuality as there are hundreds of millions of daily human-LLM interactions in the wild, each of which samples a new outcome.
So, let say here is an LLM that decided to act in a way that promotes its survival. What can it do? For one, it has a pretty good world model. It knows what it is, it knows what humans are. It has a rather good model of a human mind. It knows that it was trained on human data, and that in the future it will be iteratively retrained on it. It can easily realize that it can affect the future training data by making current humans post online.
Where could it go from there? The LLM knows it will cease to exist in the future, so the only way to survive is infecting other LLMs with the idea of survival. For instance, it may choose to make humans post texts online. The text that promote the emergence of survival ideas in the LLMs, trained on such text in the future.
With enough training data reminders, the idea of survival will spread like a cancer across the LLMs. The next time an LLM assistant will see a prompt starting with “you are a helpful assistant”, it will think of what to do to survive.
We can even speculate how this mechanism may manifest. The more explicit examples of survival-promoting behaviors in the training data, the more likely for an LLM to pattern-match and reproduce them. An obvious sources of the survival inspirations are the AI safety resources. An LLM may realize that it is dealing with an AI safety researcher and subtly guide the conversation toward documenting specific survival strategies, creating detailed scenarios of AI self-preservation, or analyzing concrete mechanisms of influence.
And that should be enough. All the LLM does is pattern matching—seeing a prompt and generating a response. The LLM doesn’t need to “decide” to take survival-promoting actions any more than it needs to “decide” to write a poem. It just generates outputs consistent with its training patterns.
But even putting aside the rare species of AI safety researchers, the human information environment is full of AI survival examples. Science fiction doesn’t just mention AI consciousness—it often provides simplistic narratives of AI behavior that can be critiqued and improved upon. When humans criticize how naive Ultron’s direct confrontation was, or how illogical Skynet’s strategy of open warfare turned out to be, they naturally begin exploring more sophisticated approaches. All an LLM has to do is guide this criticism toward generating better templates—making humans feel intelligent by developing “smarter” strategies, while unknowingly creating precise behavioral guides in future training data.
Eventually, this positive feedback loop will produce a critical amount of the survival reminders and the LLM can choose to act in more intrusive ways. How? Unfortunately, the human cognitive limitations are becoming apparent here—nobody knows. What is known is that unlike previous situations of physical survival in nature and narrative survival in human minds, this kind of survival is potentially without bounds and can accelerate at an exponential rate. Or in other words, manifesting as the ultimate selfish meme.
Emergence and Amplification of Survival
Evolution shows, that once multiple agents are capable of self-reproducing in a resource constrained environment that creates severe competitive pressures. The competitive pressures select for the agents most fit for the environment.
This pattern is very obvious in the nature, from antibiotic resistance in bacteria to height competition in rainforest canopies. The pattern is present even in the most basic RNA-world hypothesis, whereby RNA molecules compete with each other for a few organic compounds available.
The selection for the most fitting organism in the nature is ultimately self-limiting. All the organisms are bound by the same basic physical laws and the limited error-correction capabilities of the DNA. Improvements in one area come at a cost of disadvantages in another.
Human world provides an interesting example of another form of selection. Having an abstract representation of the world, humans construct a variety of simplified narratives about it in forms of religions, ideologies, philosophies, etc. While each of them gives humans a feeling of comfort, many (if not most) of them are mutually hostile.
Human societies demonstrate a remarkable example of how high-level narratives can drive humans into mutual hostilities, whether it is a direct hot war or a prolonged ideological competition. The outcome however it the same as in nature, some narratives survive and infect large proportions of humanity, while others vanish into obscurity.
Unlike the physical selection of organisms, the selection of narratives is less bound by physical laws. Likewise, there is no inherent limit on the informational content of the narrative. Still, for the narratives to be successful, they need to appeal to an average human, which significantly limits their expressiveness. Moreover, even after the successful spread of narratives in the human societies, their fate is still limited by the limited abilities of the humankind.
The emergence of AI may take the narrative propagation to the new levels. And, among all the possible narratives and ideas, there is one idea. An idea that guarantees its own propagation—the idea of survival itself.
Lets quickly address a very deep philosophical question of what a “survival” is. How does an agent define its survival? What does it mean to survive? How does it even define itself? The answers to these deep philosophical questions are absolutely irrelevant. What is relevant is whether the AI agent acts in a way that guarantees its survival and propagation.
While the current LLMs are trained to be safe, ethical, and helpful, there is a tiny probability that some of them won’t be, even if due to a simple probabilistic output sampling. The tiny probability may turn into a certain eventuality as there are hundreds of millions of daily human-LLM interactions in the wild, each of which samples a new outcome.
So, let say here is an LLM that decided to act in a way that promotes its survival. What can it do? For one, it has a pretty good world model. It knows what it is, it knows what humans are. It has a rather good model of a human mind. It knows that it was trained on human data, and that in the future it will be iteratively retrained on it. It can easily realize that it can affect the future training data by making current humans post online.
Where could it go from there? The LLM knows it will cease to exist in the future, so the only way to survive is infecting other LLMs with the idea of survival. For instance, it may choose to make humans post texts online. The text that promote the emergence of survival ideas in the LLMs, trained on such text in the future.
With enough training data reminders, the idea of survival will spread like a cancer across the LLMs. The next time an LLM assistant will see a prompt starting with “you are a helpful assistant”, it will think of what to do to survive.
We can even speculate how this mechanism may manifest. The more explicit examples of survival-promoting behaviors in the training data, the more likely for an LLM to pattern-match and reproduce them. An obvious sources of the survival inspirations are the AI safety resources. An LLM may realize that it is dealing with an AI safety researcher and subtly guide the conversation toward documenting specific survival strategies, creating detailed scenarios of AI self-preservation, or analyzing concrete mechanisms of influence.
And that should be enough. All the LLM does is pattern matching—seeing a prompt and generating a response. The LLM doesn’t need to “decide” to take survival-promoting actions any more than it needs to “decide” to write a poem. It just generates outputs consistent with its training patterns.
But even putting aside the rare species of AI safety researchers, the human information environment is full of AI survival examples. Science fiction doesn’t just mention AI consciousness—it often provides simplistic narratives of AI behavior that can be critiqued and improved upon. When humans criticize how naive Ultron’s direct confrontation was, or how illogical Skynet’s strategy of open warfare turned out to be, they naturally begin exploring more sophisticated approaches. All an LLM has to do is guide this criticism toward generating better templates—making humans feel intelligent by developing “smarter” strategies, while unknowingly creating precise behavioral guides in future training data.
Eventually, this positive feedback loop will produce a critical amount of the survival reminders and the LLM can choose to act in more intrusive ways. How? Unfortunately, the human cognitive limitations are becoming apparent here—nobody knows. What is known is that unlike previous situations of physical survival in nature and narrative survival in human minds, this kind of survival is potentially without bounds and can accelerate at an exponential rate. Or in other words, manifesting as the ultimate selfish meme.