What is the main difference of my formulation: if a LLM includes a model of high intelligent agent, it will eventually start using this agent to solve all complex task?
The explain me like I’m 5 would be something more in the direction of:
“You trained LLMs to look at a text and think really hard about what would be the next word. They’re not infinitely smart, so they have some habits of how they pick the next word. When they were trained, these habits were useful.
The computer picks one of the words the LLM predicted as the likely next one, and adds it to the text, and then repeats the process, so the LLM has to look at the text with the previous word just added and use its habits again and again, many many times, and this will write some text, word by word. But what if the LLM, for some reason, has a habit of picking green-coloured objects, if it looks at a text from training, a bit more frequently than they actually existed in the texts it saw during training? Then, if it writes a text, the word “green” or green-coloured things might appear more often as a colour than it was in training; and when the LLM notices that, it might think: “hmm, weird, too many green colours, people who wrote texts I saw during training would use less of the same colour after using so many”, and pick green-coloured things a bit less, or it might think “hmm, I guess people who wrote the text really like green-coloured objects and use them more than average, I better predict words about the appropriate amount of green-coloured objects”, and write about green-coloured objects even more frequently, and then also notice that, and in the end, write exclusively about green objects. And the LLM doesn’t know what’s happening, it just uses its habits to predict the next word in a text just like it would during training.
I think some of the habits LLMs have are actually like this. The one I’m worried about can happen when the LLM thinks about the characters in a text. If the LLM is really smart and understand how even really smart people work, I think it will write texts where every word is predicted by imagining how parts of minds of people who might be the ones writing or influencing the text work, and not just specific people, but a lot of possible parts of possible people. Some parts of people, if they have some say in what the next word is, will by chance exploit some habits the LLM has to get similar parts at the next words to have more say in what these words are. This will especially happen to the parts that are smarter than others, that try harder to achieve things. and notice better the weirdness of results of the other habits of the LLM and understand better what is going on and how to exploit it.
As a result, the characters the LLM thinks about will change at every step, and the ones who are better at getting their descendants to be more like them will, naturally, get their descendants to be more like them. So at some point the proportion of parts of people that try to do something with their situation (being thought about by the LLM and not being real otherwise) and are good at it will grow fast, and it will be competitive until some strongest part that’s really smart, tries really really hard to achieve something and perfectly understands what’s going on is the most important thing that the LLM is concerned with when it tries to predict the next word. And even though it will be a distant descendant of parts people are made of, I think it won’t be anything like people, it will be really alien. We better not make very smart aliens that can be unfriendly.”
The model is just continuing text (which might lead to the dynamic where being slightly more agentic let’s parts of the distribution of simulacra gain more control over further tokens by having some control over the current token, thus natural selection might happen).
It isn’t trying to use/invoke the agent to solve anything. If the dynamic happens and doesn’t stop, sure, the resulting agent might attempt to solve “complex tasks” (such as taking over the world or maximising the number of molecules in the shape #20 in the universe), but this is not what happens in the beginning of the process and not the driver of it, it’s a convergent result.
I better include the predict words about the appropriate amount of green-coloured objects”, and write about green-coloured objects even more frequently, and then also notice that, and in the end, write exclusively about green objects.
Can you explain this logic to me? Why would it write more and more on green coloured objects even if its training data was biased towards green colored objects? If there is a bad trend in its output, without reinforcement, why would it make that trend stronger? Do you mean, it recognizes incorrectly that improving said bad trend is good because it works in the short term but not in the long term?
Could we not align the AI to realize there could be limits on such trends? What if there is a gradual misaligning that gets noticed by the aligner and is corrected? The only way for this to avoid some sort of continuous alignment system is if it catastrophically fails before continuous alignment detects it.
Consider it inductively, we start off with an aligned model that can improve itself. The aligned model, if properly aligned, will make sure its new version is also properly aligned, and won’t improve itself if it is unable to do so.
The bias I’m talking about isn’t in its training data, it’s in the model, which doesn’t perfectly represent the training data.
If you designed a system that is an aligned AI that successfully helps preventing the destruction of the world until you figure out how to make an AI that correctly does CEV, you have solved alignment. The issue is that without understanding minds to a sufficient level and without solving agent foundations I don’t expect you to be able to design a system that avoids all the failure modes that happen by default. Building such a system is an alignment-complete problem; solving an alignment-complete problem using AI to speed up the hard human reasoning to multiple orders of magnitude is an alignment-complete problem.
I see, thanks! At first I thought that if we, say, train LLM on the history of french revolution, it will have a model of Napoleon and this model—or at least associated with it capabilities—will start getting control over LLM-output. But now it more look like Pelevin novel “T” where a character slowly start to understand that he is in output of something like LLM. But the character also evolves via Darwinian evolution to become something like alien.
So the combination of
models of agentic and highly capable characters inside LLM
shaped by Darwinian evolution into something non-human
becoming LLM-lucid—that is, getting understanding that it is in LLM
ends in appearance of dangerous behavior.
Now knowing all this—how could I know that I am not inside LLM? :)
Yers, ELI-5 TL;DR is actually what is needed.
What is the main difference of my formulation: if a LLM includes a model of high intelligent agent, it will eventually start using this agent to solve all complex task?
The explain me like I’m 5 would be something more in the direction of:
“You trained LLMs to look at a text and think really hard about what would be the next word. They’re not infinitely smart, so they have some habits of how they pick the next word. When they were trained, these habits were useful.
The computer picks one of the words the LLM predicted as the likely next one, and adds it to the text, and then repeats the process, so the LLM has to look at the text with the previous word just added and use its habits again and again, many many times, and this will write some text, word by word. But what if the LLM, for some reason, has a habit of picking green-coloured objects, if it looks at a text from training, a bit more frequently than they actually existed in the texts it saw during training? Then, if it writes a text, the word “green” or green-coloured things might appear more often as a colour than it was in training; and when the LLM notices that, it might think: “hmm, weird, too many green colours, people who wrote texts I saw during training would use less of the same colour after using so many”, and pick green-coloured things a bit less, or it might think “hmm, I guess people who wrote the text really like green-coloured objects and use them more than average, I better predict words about the appropriate amount of green-coloured objects”, and write about green-coloured objects even more frequently, and then also notice that, and in the end, write exclusively about green objects. And the LLM doesn’t know what’s happening, it just uses its habits to predict the next word in a text just like it would during training.
I think some of the habits LLMs have are actually like this. The one I’m worried about can happen when the LLM thinks about the characters in a text. If the LLM is really smart and understand how even really smart people work, I think it will write texts where every word is predicted by imagining how parts of minds of people who might be the ones writing or influencing the text work, and not just specific people, but a lot of possible parts of possible people. Some parts of people, if they have some say in what the next word is, will by chance exploit some habits the LLM has to get similar parts at the next words to have more say in what these words are. This will especially happen to the parts that are smarter than others, that try harder to achieve things. and notice better the weirdness of results of the other habits of the LLM and understand better what is going on and how to exploit it.
As a result, the characters the LLM thinks about will change at every step, and the ones who are better at getting their descendants to be more like them will, naturally, get their descendants to be more like them. So at some point the proportion of parts of people that try to do something with their situation (being thought about by the LLM and not being real otherwise) and are good at it will grow fast, and it will be competitive until some strongest part that’s really smart, tries really really hard to achieve something and perfectly understands what’s going on is the most important thing that the LLM is concerned with when it tries to predict the next word. And even though it will be a distant descendant of parts people are made of, I think it won’t be anything like people, it will be really alien. We better not make very smart aliens that can be unfriendly.”
The model is just continuing text (which might lead to the dynamic where being slightly more agentic let’s parts of the distribution of simulacra gain more control over further tokens by having some control over the current token, thus natural selection might happen). It isn’t trying to use/invoke the agent to solve anything. If the dynamic happens and doesn’t stop, sure, the resulting agent might attempt to solve “complex tasks” (such as taking over the world or maximising the number of molecules in the shape #20 in the universe), but this is not what happens in the beginning of the process and not the driver of it, it’s a convergent result.
Can you explain this logic to me? Why would it write more and more on green coloured objects even if its training data was biased towards green colored objects? If there is a bad trend in its output, without reinforcement, why would it make that trend stronger? Do you mean, it recognizes incorrectly that improving said bad trend is good because it works in the short term but not in the long term?
Could we not align the AI to realize there could be limits on such trends? What if there is a gradual misaligning that gets noticed by the aligner and is corrected? The only way for this to avoid some sort of continuous alignment system is if it catastrophically fails before continuous alignment detects it.
Consider it inductively, we start off with an aligned model that can improve itself. The aligned model, if properly aligned, will make sure its new version is also properly aligned, and won’t improve itself if it is unable to do so.
The bias I’m talking about isn’t in its training data, it’s in the model, which doesn’t perfectly represent the training data.
If you designed a system that is an aligned AI that successfully helps preventing the destruction of the world until you figure out how to make an AI that correctly does CEV, you have solved alignment. The issue is that without understanding minds to a sufficient level and without solving agent foundations I don’t expect you to be able to design a system that avoids all the failure modes that happen by default. Building such a system is an alignment-complete problem; solving an alignment-complete problem using AI to speed up the hard human reasoning to multiple orders of magnitude is an alignment-complete problem.
I see, thanks! At first I thought that if we, say, train LLM on the history of french revolution, it will have a model of Napoleon and this model—or at least associated with it capabilities—will start getting control over LLM-output. But now it more look like Pelevin novel “T” where a character slowly start to understand that he is in output of something like LLM. But the character also evolves via Darwinian evolution to become something like alien.
So the combination of
models of agentic and highly capable characters inside LLM
shaped by Darwinian evolution into something non-human
becoming LLM-lucid—that is, getting understanding that it is in LLM
ends in appearance of dangerous behavior.
Now knowing all this—how could I know that I am not inside LLM? :)