AI Has A Data Problem - Causal Data May Solve It
Artificial intelligence has never been more powerful or more misunderstood.
Despite billions invested in machine learning, most initiatives struggle to deliver meaningful results. Research from McKinsey suggests that only about 20% of companies have successfully scaled AI, while studies from MIT Sloan Management Review and Boston Consulting Group indicate that roughly 70% of organizations see little to no impact from their AI efforts.
The prevailing explanation is that AI is still maturing and that models need to improve, or that organizations haven’t yet adapted.
But what if the problem isn’t the models at all? What if the problem is the data they’re trained on?
For decades, the data science ecosystem has been built on a narrow foundation: transaction records, web activity, market data, and other forms of observed behavior. These datasets are abundant, scalable, and easy to ingest. They are also fundamentally limited. They tell us what happened. They don’t tell us why .
Most AI systems operate inside a closed loop: trained on historical data, optimized to detect patterns, and deployed to predict future outcomes based on those same patterns.
When conditions shift (whether due to economic shocks, changing consumer sentiment or new competitive dynamics) models trained on historical correlations begin to break down. Signals degrade. Forecasts miss. Decisions lag.
Why Causal Data Matters: From Correlation to Mechanism
This shift toward causal data reflects a broader movement in data science to move beyond correlation-based modeling.
Researchers at ADIAlab and CausalAI ( Stanford CausalLab ) have highlighted that many machine learning systems are vulnerable to statistical mirages - patterns that appear meaningful in historical data but fail when conditions change.
These failures stem from models trained on observational data that captures co-movement, not causation.
Causal approaches address this by focusing on mechanisms rather than patterns —starting with the question: what causes what? By specifying relationships before estimating effects, models become more:
- Robust to changing conditions,
- Reproducible and testable,
- Auditable and transparent.
This reduces reliance on inference and allows models to measure intent directly rather than reconstruct it from downstream behavior.
As AI systems are increasingly used for high-stakes decisions, this shift—from pattern recognition to causal understanding—is becoming essential.
Causal vs. Transaction Data: A Continuum of Demand Formation
Rather than viewing causal and transaction data as separate systems, it is more accurate to see them as points along a single timeline of decision-making.
Causal data captures demand formation before purchase. Transaction data captures demand only after it has already occurred.
While the transaction is initiated at the moment of purchase, it only becomes visible and usable as data after it is recorded, aggregated, and analyzed; further extending the lag between cause and measurement.
This timeline reveals a critical insight: from the earliest emotional shifts in consumers to the eventual reaction in markets, the full cycle can span as much as 150 to 165 days.
That gap is not theoretical, it is measurable. Emotional and expectation-based signals can emerge months before purchase behavior, while the full chain of data aggregation, analysis, earnings reporting, and market reaction can extend several months beyond.
The implication is clear: by the time most traditional datasets reflect a change, that change has already been in motion for months.
The Data Scale vs. Data Relevance Problem
At the same time that organizations are investing heavily in artificial intelligence, they are also investing billions in data infrastructure; building massive data environments designed to store and process ever-increasing volumes of information.
Much of this data is observational in nature: transaction records, clickstream activity, social media interactions, and other forms of digital exhaust. While abundant, this data often reflects behavior after it occurs and is frequently shaped by algorithmic amplification, automation, and non-human activity.
As a result, organizations are left with datasets that are vast in scale but uneven in signal.
To extract value, these datasets require layers of inference: assumptions about intent, motivation, and future behavior that are not directly measured and often cannot be validated. This introduces fragility into machine learning systems, particularly when conditions change or when historical patterns no longer hold.
The challenge is not simply one of volume, but of relevance.
Building larger data environments does not necessarily produce better outcomes if the underlying data does not capture the drivers of decision-making.
Causal data offers a different path. By directly measuring the inputs to consumer behavior (emotions, expectations, intentions, and constraints) it reduces the need for inference and grounds models in observable mechanisms rather than assumptions.
This shift, from accumulating more data to capturing more meaningful data, has implications not only for model performance, but also for the efficiency of the systems built to support them.
Transaction data is reactive. Causal data is predictive.
Transaction data tells you what was purchased when it happened and how much was spent. Causal data tells you why it happened, what will happen next and what is about to change.
Consider a common retail scenario. Consumers begin to feel financial pressure from rising costs, uncertainty about job security or declining confidence in the economy. At this stage, behavior is already shifting beneath the surface. Fewer consumers plan discretionary purchases. More expect to spend less. Financial anxiety begins to rise.
No transaction data reflects this yet.
Weeks later, the effects begin to appear. Spending slows. Basket sizes shrink. Traffic declines. Only after that do revenues miss expectations, earnings disappoint and markets react.
Causal data captures the shift at the beginning. Traditional data captures it near the end.
When these upstream signals are used in machine learning models, the advantage is structural. Models can detect inflection points earlier. Forecasts become more robust. Performance becomes more stable across cycles. Outputs become more interpretable.
For corporations, this enables earlier planning. For competitive strategy, it reveals where demand is moving. For economic forecasting, it provides lead time. For investors, it creates a timing advantage—the foundation of alpha.
Artificial intelligence has never been more powerful …but its limitations are becoming clearer. If the next phase of AI is going to deliver on its promise, it will not come from better models alone. It will come from better causal data.
And the organizations that understand not just what is happening, but why, will be the ones that finally turn AI into meaningful outcomes.
Disclosure: The consumer sentiment study referenced above was conducted by my company, Prosper Insights & Analytics . This is the same dataset used by the National Retail Federation, and available from Amazon Web Services, Bloomberg, and the London Stock Exchange Group for economic benchmarking.
Loading article...