One important implication is that pure AI companies such as OpenAI, Anthropic, Conjecture, Cohere are likely to fall behind companies with access to large amounts of non-public-internet text data like Facebook, Google, Apple, perhaps Slack. Email and messaging are especially massive sources of “dark” data, provided they can be used legally and safely (e.g. without exposing private user information). Taking just email, something like 500 billion emails are sent daily, which is more text than any LLM has ever been trained on (admittedly with a ton of duplication and low quality content).
Another implication is that federated learning, data democratization efforts, and privacy regulations like GDPR are much more likely to be critical levers on the future of AI than previously thought.
Another implication is that centralised governments with the ability to aggressively collect and monitor citizen’s data, such as China, could be major players.
A government such as China has no need to scrape data from the Internet, while being mindful of privacy regulations and copyright. Instead they can demand 1.4 billion people’s data from all of their domestic tech companies. This includes everything such as emails, texts, WeChat, anything that the government desires.
I suspect that litigation over copyright concerns with LLMs could significantly slow timelines, although it may come with the disadvantage of favoring researchers who don’t care about following regulations/data use best practices
I mean Microsoft for one seems fully invested in (married to) OpenAI and will continue to be for the foreseeable future, and Outlook/Exchange is probably the biggest source of “dark” data in the world, so I wouldn’t necessarily put OpenAI on the same list as the others without strong traditional tech industry partnerships.
Allowing OpenAI to use Microsofts customer data to train the model essentially means releasing confidential customer information to the public. I doubt that’s something that Microsoft is willing to do.
Another implication is that federated learning, data democratization efforts, and privacy regulations like GDPR are much more likely to be critical levers on the future of AI than previously thought.
And presumably data poisoning as well? This sort of thing isn’t easily influenced because it’s deep in the turf of major militaries, but it would definitely be good news in the scenario that data becomes the bottleneck.
Thought-provoking post, thanks.
One important implication is that pure AI companies such as OpenAI, Anthropic, Conjecture, Cohere are likely to fall behind companies with access to large amounts of non-public-internet text data like Facebook, Google, Apple, perhaps Slack. Email and messaging are especially massive sources of “dark” data, provided they can be used legally and safely (e.g. without exposing private user information). Taking just email, something like 500 billion emails are sent daily, which is more text than any LLM has ever been trained on (admittedly with a ton of duplication and low quality content).
Another implication is that federated learning, data democratization efforts, and privacy regulations like GDPR are much more likely to be critical levers on the future of AI than previously thought.
Another implication is that centralised governments with the ability to aggressively collect and monitor citizen’s data, such as China, could be major players.
A government such as China has no need to scrape data from the Internet, while being mindful of privacy regulations and copyright. Instead they can demand 1.4 billion people’s data from all of their domestic tech companies. This includes everything such as emails, texts, WeChat, anything that the government desires.
I suspect that litigation over copyright concerns with LLMs could significantly slow timelines, although it may come with the disadvantage of favoring researchers who don’t care about following regulations/data use best practices
I mean Microsoft for one seems fully invested in (married to) OpenAI and will continue to be for the foreseeable future, and Outlook/Exchange is probably the biggest source of “dark” data in the world, so I wouldn’t necessarily put OpenAI on the same list as the others without strong traditional tech industry partnerships.
Allowing OpenAI to use Microsofts customer data to train the model essentially means releasing confidential customer information to the public. I doubt that’s something that Microsoft is willing to do.
And presumably data poisoning as well? This sort of thing isn’t easily influenced because it’s deep in the turf of major militaries, but it would definitely be good news in the scenario that data becomes the bottleneck.