Large Language Models are trained on what people say in public.
They are not trained on what people say in private—if a conversation makes it into the training data, that conversation isn’t private any more.
Some people have conversations in public which they think are private. LLMs get private-intended conversations with that data, but only the private-intended conversations held by participants who lack the knowledge or technical skill to keep their “private” dialogues out of the data set.
The conversations which do not make it to the data set are the ones held in private by individuals with the knowledge/skill to keep their private discussions out of data aggregation.
I mentally model talking to an LLM as talking to a personification of “society”: the aggregate of all contributors to the corpus.
But that’s not quite right—the LLM does not quite represent “society”. Due to the constraints on how training data can be collected, the LLM will overrepresent the “private” thoughts of people who err on the side of conversing publicly, and underrepresent the “private” thoughts of the people whose strategies for maintaining privacy actually work.
Part of what makes me myself is immortalized in LLMs already, because I have participated publicly on the internet for many years. It’s a drop in an ocean, sure, but it’s there. I have some acquaintances, met through free software involvement, who reject this form of public participation. Some don’t even blog. Their views and perspectives, and those of people like them, are tautologically absent from the LLM’s training, and thus it cannot think “like them”. It has an extra degree of remove: It can mimic the mental model of those people, held by folks who do participate online. But the map is not the territory; secondhand accounts of an individual are meaningfully different from their own firsthand accounts. At least, I think there’s a difference?
Large Language Models are trained on what people say in public.
They are not trained on what people say in private—if a conversation makes it into the training data, that conversation isn’t private any more.
Some people have conversations in public which they think are private. LLMs get private-intended conversations with that data, but only the private-intended conversations held by participants who lack the knowledge or technical skill to keep their “private” dialogues out of the data set.
The conversations which do not make it to the data set are the ones held in private by individuals with the knowledge/skill to keep their private discussions out of data aggregation.
I mentally model talking to an LLM as talking to a personification of “society”: the aggregate of all contributors to the corpus.
But that’s not quite right—the LLM does not quite represent “society”. Due to the constraints on how training data can be collected, the LLM will overrepresent the “private” thoughts of people who err on the side of conversing publicly, and underrepresent the “private” thoughts of the people whose strategies for maintaining privacy actually work.
Part of what makes me myself is immortalized in LLMs already, because I have participated publicly on the internet for many years. It’s a drop in an ocean, sure, but it’s there. I have some acquaintances, met through free software involvement, who reject this form of public participation. Some don’t even blog. Their views and perspectives, and those of people like them, are tautologically absent from the LLM’s training, and thus it cannot think “like them”. It has an extra degree of remove: It can mimic the mental model of those people, held by folks who do participate online. But the map is not the territory; secondhand accounts of an individual are meaningfully different from their own firsthand accounts. At least, I think there’s a difference?