The Great Data Integration Schlep

Link post

Midjourney, “Fourth Industrial Revolution Digital Transformation”

This is a little rant I like to give, because it’s something I learned on the job that I’ve never seen written up explicitly.

There are a bunch of buzzwords floating around regarding computer technology in an industrial or manufacturing context: “digital transformation”, “the Fourth Industrial Revolution”, “Industrial Internet of Things”.

What do those things really mean?

Do they mean anything at all?

The answer is yes, and what they mean is the process of putting all of a company’s data on computers so it can be analyzed.

This is the prerequisite to any kind of “AI” or even basic statistical analysis of that data; before you can start applying your fancy algorithms, you need to get that data in one place, in a tabular format.

Wait, They Haven’t Done That Yet?

Each of these machines in a semiconductor fab probably stores its data locally. The team that operates one machine might not be able to see the data from the next one over.

In a manufacturing context, a lot of important data is not on computers.

Some data is not digitized at all, but literally on paper: lab notebooks, QA reports, work orders, etc.

Other data is is “barely digitized”, in the form of scanned PDFs of those documents. Fine for keeping records, but impossible to search, or analyze statistically. (A major aerospace manufacturer, from what I heard, kept all of the results of airplane quality tests in the form of scanned handwritten PDFs of filled-out forms. Imagine trying to compile trends in quality performance!)

Still other data is siloed inside machines on the factory floor. Modern, automated machinery can generate lots of data — sensor measurements, logs of actuator movements and changes in process settings — but that data is literally stored in that machine, and only that machine.

Manufacturing process engineers, for nearly a hundred years, have been using data to inform how a factory operates, generally using a framework known as statistical process control. However, in practice, much more data is generated and collected than is actually used. Only a few process variables get tracked, optimized, and/​or used as inputs to adjust production processes; the rest are “data exhaust”, to be ignored and maybe deleted. In principle the “excess” data may be relevant to the facility’s performance, but nobody knows how, and they’re not equipped to find out.

This is why manufacturing/​industrial companies will often be skeptical about proposals to “use AI” to optimize their operations. To “use AI”, you need to build a model around a big dataset. And they don’t have that dataset.

You cannot, in general, assume it is possible to go into a factory and find a single dataset that is “all the process logs from all the machines, end to end”.

Moreover, even when that dataset does exist, there often won’t be even the most basic built-in tools to analyze it. In an unusually modern manufacturing startup, the M.O. might be “export the dataset as .csv and use Excel to run basic statistics on it.”

Why Data Integration Is Hard

In order to get a nice standardized dataset that you can “do AI to” (or even “do basic statistics/​data analysis to”) you need to:

  1. obtain the data

  2. digitize the data (if relevant)

  3. standardize/​ “clean” the data

  4. set up computational infrastructure to store, query, and serve the data

Data Access Negotiation, AKA Please Let Me Do The Work You Paid Me For

Obtaining the data is a hard human problem.

That is, people don’t want to give it to you.

When you’re a software vendor to a large company, it’s not at all unusual for it to be easier to make a multi-million dollar sale than to get the data access necessary to actually deliver the finished software tool.

Why?

Partly, this is due to security concerns. There will typically be strict IT policies about what data can be shared with outsiders, and what types of network permissions are kosher.

For instance, in the semiconductor industry, everyone is justifiably paranoid about industrial espionage. They are not putting their factory data “on the cloud.” They may have fully airgapped facilities where nothing is connected to the open internet. They do not want images of in-progress chips, or details of their production processes, getting into “the wrong hands.”

Other industries have analogous concerns, about leaking trade secrets or customer details or other sensitive information. You will have to meet a lot of security requirements to get sensitive data out of the “blessed zone”[1] to put it on your computers, or to get your computers approved to install into the “blessed zone.”

Sometimes complying with security requirements for data sharing is simple; but the larger the company, the more likely you are to encounter multiple IT policies from different departments, and the more likely it is that some of them contradict each other, or are fundamentally incompatible with any kind of large-scale data integration.[2]

I have worked with a company that requires even potential vendors to spend over $1M of their own money building a special, ultra-secure room to store their data in…before the sale is even closed. Security requirements can be intense.

Then there are more “political” or “personal” reasons it can be hard to get lots of a company’s data in one place.

Sometimes people are worried that a “big data” tool will replace their jobs, or make their performance look bad, and they’re trying to obstruct the process.

Sometimes there are inter-departmental rivalries, such that people in one department don’t want to share data with another.

Sometimes people are just busy, and taking time out of their workday to do setup for a vendor is an inconvenience.

Dealing with the “human problem” of negotiating for data access is a huge, labor-intensive headache, and the effort scales pretty much linearly in the amount of data you’re trying to collect.

Palantir Technologies is now embracing the “AI” label, but back when I worked there, in 2016-2017, they billed themselves as a “data integration” company, because this is fundamentally what they do. Palantir builds its own software tools for managing and analyzing data — databases, ETL pipelines, analytics dashboards — and those tools work fine, but they are not, as far as I know, unique or exceptional in the tech industry. What is remarkable is that they have invested in a large number of people — at least a third of the company by headcount — to descend en masse to the customer’s facilities, set up and customize their database and associated tools, teach the users to work the UI, and, crucially, negotiate for data access.

The Palantir playbook, at least on the commercial side,[3] is:

  • only sell to companies facing an “existential threat”, i.e. where something has gone so drastically wrong they might go out of business altogether

  • get buy-in from the C-suite executives, who care about company-wide objectives (“not going out of business”, “making more profit”) and whose individual reputations gain from leading major company-wide initiatives (like a “digital transformation” push).

  • win the hearts of front-line workers who’ll use your software by being super helpful and making their jobs easier, and hanging out with them on-site and building personal connections

  • use your allies on the bottom (front-line workers) and the top (executives) to squeeze out your opponents in the middle (managers, often in IT or data science departments, who would rather not have an external vendor encroach on their turf).

    • The “higher-ups” can demand that you be given speedy assistance and data access;

    • the “front-line” guys can pressure their bosses to do what’s necessary to get the Cool New Palantir Tools up and running.

The Palantir Way is labor-intensive and virtually impossible to systematize, let alone automate away. This is why there aren’t a hundred Palantirs. You have to throw humans at the persuasion problem — well-paid, cognitively flexible, emotionally intelligent humans, who can cope with corporate dysfunction.

In that way, they’re a lot like management consultants…and in fact, data integration at large scale is inherently a little like management consulting.

Every company has something fucked-up and dumb going on somewhere, no matter how admirable they are in other respects, and if they’re facing an existential crisis there’s definitely something going badly wrong that somebody doesn’t want to face. If you ever want to get all your data in one place, you need to figure out some of the shape of the Badness, in an environment where most of the people you meet are presenting as “reasonable and highly competent professionals” and everybody’s got a different story about the Badness and why it’s unavoidable or someone else’s fault.

For instance, at the customer I was embedded with, there was an entire data science department, of maybe ten people, whose job it was to create a single number — a risk score — which would be sent to another department to deal with. The head of this second department didn’t trust statistics[4] so he threw the number in the trash. The entire work output of those ten data scientists was irrelevant to the company.

I am not, myself, exceptionally good at this “corporate detective work.” I tend to ask questions pretty bluntly, which sometimes gets good results and sometimes ruffles feathers I didn’t intend… and then I tend to back off when I get negative reactions, which is itself sometimes the wrong move in retrospect. I have just enough experience with this process to be aware it is a thing, and to be humbled by how hard it is to see the whole picture of what’s going on in an organization.

What it’s like trying to figure out wtf is going on with your customer

Data Cleaning, AKA I Can’t Use This Junk

Every working data scientist will tell you they spend more time on “data cleaning” than actually running any statistical or machine-learning models.

What does that mean?

Removing commas. Formatting adjustments generally. Normalization. Imputing missing data in some consistent way.

And, understanding what your data sources actually refer to so that when you make judgment calls they make any practical sense at all.

Data cleaning doesn’t seem intellectually challenging, but it is surprisingly difficult to automate, in a way that I think would make David Chapman smirk — there are unenumerable ways you might need to “clean” data to get it into a standard format appropriate for training a model, and it empirically never seems to be possible to write a program that just does that for you, though of course individual parts of the process can be automated.

Part of the issue is that the “reasonable” thing to do can depend on the “real-world” meaning of the data, which you need to consult a human expert on. For instance, are these two columns identical because they are literal duplicates of the same sensor output (and hence one can safely be deleted), or do they refer to two different sensors which happened to give the same readings in this run because the setting that would allow them to differ was switched off this time? The answer can’t be derived from the dataset, because the question pertains to the physical machine the data refers to; the ambiguity is inherently impossible to automate away using software alone.

I would expect that LLMs could make substantial, if not total, improvements in automating data cleaning, but my preliminary experiments with commercial LLMs (like ChatGPT & Claude) have generally been disappointing; it takes me longer to ask the LLM repeatedly to edit my file to the appropriate format than to just use regular expressions or other scripting methods myself. I may be missing something simple here in terms of prompting, though, or maybe LLMs need more surrounding “software scaffolding” or specialized fine-tuning before they can make a dent in data cleaning tasks.

Data cleaning, like data negotiation, is labor-intensive, because it’s high-paid work that scales linearly with dataset size. The more different data sources you need to integrate, the more consultant-hours you need to spend negotiating for it, and the more engineer-hours you need to spend cleaning it.

Every big enterprise software company (think SAP, Salesforce, AWS, etc) that promises to be “the” singular place a large company can keep a large portion of its digital records, devotes a ton of labor to getting each client set up with the software, because that job involves data access, data standardization, and data transfer, all of which are hard.

There are entire industries of third-party “partner” companies that do nothing all day but help their clients set up Salesforce or AWS or whatever. And it’s not unusual for setup to take a year or more.

In general, it’s my observation that some kinds of work are harder than they look — tedious and seemingly routine, but often requiring judgment calls that are surprisingly hard to outsource to an intern or otherwise “junior” staff member, let alone automate away with software. Data cleaning is one of them. So are data labeling, data entry, summarizing and organizing “key” facts and numbers from large piles of documents, etc. Apart from the basic programming involved in data cleaning, none of this requires any special education or training; it’s “just” reading, writing, arithmetic, and “common sense”…and yet, somehow, not just anyone can do it right.[5]

AI is Gated On Data Integration

This is why I disagree with a lot of people who imagine an “AI transformation” in the economic productivity sense happening instantaneously once the models are sufficiently advanced.

For AI to make really serious economic impact, after we’ve exploited the low-hanging fruit around public Internet data, it needs to start learning from business data and making substantial improvements in the productivity of large companies.

If you’re imagining an “AI R&D researcher” inventing lots of new technologies, for instance, that means integrating it into corporate R&D, which primarily means big manufacturing firms with heavy investment into science/​engineering innovation (semiconductors, pharmaceuticals, medical devices and scientific instruments, petrochemicals, automotive, aerospace, etc). You’d need to get enough access to private R&D data to train the AI, and build enough credibility through pilot programs to gradually convince companies to give the AI free rein, and you’d need to start virtually from scratch with each new client. This takes time, trial-and-error, gradual demonstration of capabilities, and lots and lots of high-paid labor, and it is barely being done yet at all.

I’m not saying “AI is overrated”, at all — all of this work can be done and ultimately can be extremely high ROI. But it moves at the speed of human adaptation.

Think of the adoption of computers into the business world, decades ago. There was nothing fake about that “transformation”. It happened; it mattered; many fortunes were made. But it happened at the speed of human negotiation and learning — corporate executives had to decide to buy computers, and many individuals within each company had to figure out what they were going to do with the new machines, and this had to happen independently for each individual use case. It wasn’t instantaneous. And it didn’t hit everywhere at the same speed; some applications would be doing things manually long after other applications would be fully digitized.

So too with AI. You can’t skip over the work of convincing customers that the tool will be worth the setup costs, overcoming internal friction to actually change the way companies operate, and testing and adapting the tool for different contexts.

  1. ^

    not a technical term; the actual terminology will differ by industry and context. Sometimes they call it “OT” or “high side”.

  2. ^

    Pharma trade secrets, for instance, are often kept secret from other departments within the company. This can make it tricky to build datasets that span multiple departments.

  3. ^

    I don’t know as much about the government side, where they’re selling to military, security agency, and law enforcement customers, and where the constraints and incentives are significantly different.

  4. ^

    literally, he did not believe in probabilities between zero and one. yes, such people exist. he would say things like “either it is, or it isn’t” and didn’t buy it when we tried to explain that a 90% chance and a 10% chance are both uncertain but you should treat them differently.

  5. ^

    the “conventional wisdom” in image processing is that “data labeling” is a trivial task you can farm out to mechanical-Turk gig workers, and that smart, well-paid people write machine-learning models. But actually, I’m convinced that staring at the images is skilled work that can be done much better by some people than others. For instance, notice Casey Handmer’s discovery of the “crackle pattern” characteristic of ink on the Herculaneum scrolls; before any algorithms got written, he spent hours staring at the pictures. Do you, in principle, need to be a JPL engineer and successful startup founder to see a pattern in pictures? No. In principle, anyone could do it. In practice, it was someone “overqualified” for the job who did. This is not, anecdotally, unusual.