Stop Cleaning Your Data. Use AI To Figure Out Which Info Matters
The most expensive piece of advice in enterprise technology right now is five words long: get your data ready first.
It is everywhere. The World Economic Forum reports that 72% of enterprises plan to prioritize data foundations and pipelines as their fastest-growing AI investment this year. Gartner predicts that through 2026, organizations will abandon 60% of AI projects unsupported by "AI-ready" data. Cloudera’s latest global survey found that 96% of IT leaders report AI integration — but nearly 80% say their initiatives are constrained by limited data access, and only 18% describe their data as fully governed. A Fivetran benchmark of more than 500 senior data and technology leaders found that 73% of enterprise data initiatives fail to meet expectations, despite average annual data spending of $29.3 million per organization.
The diagnosis is always the same: more governance, more cleaning, more pipeline engineering. Get the data house in order, then deploy AI.
This sounds prudent. It is destroying value at scale.
The logic has a seductive surface: bad data in, bad decisions out. Nobody disagrees. But the conclusion most enterprises are drawing — that data must be cleaned, standardized and governed before AI can be useful — inverts the actual sequence of value creation. It assumes you know which data matters before you have asked which decisions matter. And it treats AI as something that consumes clean data, rather than what it actually is: the most powerful tool ever built for finding structure in unstructured information.
The result is an enterprise data strategy that spends tens of billions of dollars a year polishing datasets that may contain no decision-relevant signal at all, while ignoring messy, unstructured sources that are rich with signal but don’t fit the governance framework.
The same dynamic recently played out in dramatic fashion when OpenAI shut down Sora, a product with massive compute but no proprietary signal moat beneath it. What killed Sora at the product level is killing enterprise AI initiatives at the portfolio level. Roughly 80% of data lake initiatives eventually fail, degenerating into what practitioners bluntly call "data swamps." The lakes are clean. The signal was never in them.
There is a better question to ask before any of this spending begins. It comes from decision theory, and it has a name: the expected value of perfect information . The EVPI framework is simple: if you had perfect information, would it materially change the decision you are about to make? If the answer is no, the information has no economic value no matter how clean it is. If the answer is yes, even messy, unstructured, "dirty" data that points toward that answer is worth more than a perfectly governed dataset that tells you nothing new.
The enterprises that apply this lens first — signal before cleaning, decisions before infrastructure — are building durable competitive advantages. The ones that don't are building expensive filing cabinets.
The Koch Principle: Why Dirty Data Can Be More Valuable Than Clean Data
Koch Industries’ Pine Bend refinery in Minnesota offers an instructive analogy. Family members who feuded over the business openly referred to their Canadian feedstock as "garbage crudes" — sulfur-laden, low-grade material that other refineries passed over. The crude was extraordinarily cheap precisely because of its high sulfur content, and few refineries could process it — but Koch sold its refined products into markets where supply was tight and prices were high. The willingness to process what others wouldn't, and the expertise to extract premium value from difficult feedstock, became one of the most durable competitive advantages in American industry.
The signal principle in AI works exactly the same way. Unstructured customer feedback, call center transcripts, satellite imagery, sensor telemetry, proprietary operational logs — these are the sulfur-laden crudes of the data world. Abundant, cheap and largely ignored by organizations that are busy cleaning what they already have. The firms that develop the capability to extract decision-relevant signal from this feedstock will outperform those that are still polishing their master data management programs.
And here is the irony the "data readiness" consensus misses entirely: AI is the refinery. It is the tool purpose-built to find structure in unstructured data, to surface patterns in noise, to extract value from feedstock that traditional analytics cannot process. Delaying AI deployment until the data is clean is like telling Koch to stop refining until someone else removes the sulfur from the crude. The sulfur is the opportunity. The refinery is the capability. You don't sequence the cleaning before the processing. You build the processing capability and let it tell you what's worth cleaning.
What Does a Signal-First Data Strategy Look Like?
The enterprise-scale proof of this principle predates generative AI by decades. Verisk Analytics didn’t build a data lake and ask what could be done with it. The company started with the decisions the insurance industry needed to make — how to price risk, detect fraud, model catastrophe exposure, assess claims — and then systematically acquired every organization that held signal relevant to those decisions.
AIR Worldwide gave it probabilistic catastrophe signal. Xactware, embedded in 80% of major property insurers’ claims workflows, gave it granular repair cost signal. Jornaya gave it real-time consumer intent signal — knowing who is actively shopping for coverage before they have applied. Each acquisition answered the same question: what information, if known, would actually change an underwriter’s, adjuster's or executive's decision? That is EVPI applied as an M&A strategy.
The network effects reinforced the moat. Participating insurers contribute their own loss data to Verisk’s statistical database in exchange for access to the aggregated industry dataset. Nobody is cleaning data for its own sake. They are contributing signal to get better signal back. Governance follows the value. It always has.
Verisk's market capitalization reached over $38 billion at its peak — roughly 13 times its 2009 IPO valuation. The asset being valued is not a data lake. It is the accumulation of decision-relevant signal — and the workflows through which that signal reaches the people who need it.
What Should the Chief Data Officer Actually Do?
This reordering has profound implications for how enterprises should think about data leadership. The chief data officer role has been defined, almost universally, as a governance and infrastructure function: clean the data, catalog the data, build the pipelines, manage compliance. These are real and necessary tasks. But they are second-order tasks.
The first-order task is signal identification: working backward from the decisions the enterprise needs to make and asking, with EVPI rigor, which information, if known, would actually shift those decisions and by how much. That is where competitive advantage lives. Governance and quality standards should follow signal priorities, not precede them.
Traders have always understood this. A fixed-income desk doesn’t ask whether all the data is clean before looking for yield signals. It asks what moves the price. Enterprise data leaders need to develop that same instinct — and build organizations that reward signal discovery, not just data hygiene.
Clean data, in the absence of signal, is just an expensive filing cabinet.
Loading article...