In April, Deloitte found that only 6% of enterprises see their AI investments pay back in under a year, with most waiting two to four years. Also this year, PwC reported only 20% of organizations have actually moved AI experiments into production at scale. Sounds like AI is not yet ready for the enterprise segment? The reason is not that the technology is failing. The reason is that most companies bolt AI onto human workflows, and the tools for proper controls are not there yet.

Workflows Need More Context

Workflows are human made. Often companies have SOPs, Standard Operating Procedures. The most naive approach to agentic design is to cut those SOPs into a chain of prompts and feed them to an LLM. That will fail, because many SOPs assume context the AI does not have.

In 1987, Denver International Airport rolled out a new automated baggage handling system. It failed almost immediately. Not because the technology was wrong, but because the designers had applied old human-based workflows to a new technology. Humans would routinely ignore half of the written procedures: when a bag fell off the belt, workers would pick it up and throw it back on. That was not in any procedure manual. It was common knowledge. The new system did not account for common knowledge, and so it broke.

Context Costs Money and Time

Hold on, you will probably say that LLMs are all about context and context windows. We just need to tell the model. Right? Theoretically that is correct. But more context leads to a higher chance of hallucination, higher cost to process, and longer latency. That is the time between first token in and last token out. If you let your teams run wild, yes, you will soon have a nice demo that then breaks, kills your budget, and creates a high level of frustration. A quick look at Reddit and LLM cost threads will give you the right picture.

Context Creates Vulnerability

More context, meaning more rules and exceptions, makes the code more complex, which in turn creates more security vulnerabilities. Georgetown CSET found cross-site scripting vulnerabilities in 86% of AI-generated code samples tested across five major models. More agents, generated faster, means more attack surface, not more capability.

What a Real Workflow Requires

To show how different these workflows actually are, I created a demo for "how an agent thinks." During my workshops I walk through a simple booking request. The agent needs to know which tools it has access to, what it is allowed to read versus write, what happens when an input is malformed, when to involve a human, and how many times to retry before throwing an error. That is for one use case. One meeting. One agent.

Real workflow design means defining the input and output of every atomic step, deciding what information is stateful versus temporary, building escalation gates, and establishing guardrails the agent consults when it hits ambiguity. None of that is supplied by the model. None of it is supplied by the platform. You and your team supply it. You cannot just use your old SOPs. You need to carefully reformulate them and put them into action.

Will there be tools for this level of complexity? Yes. We are in an era similar to the early days of the internet. Writing HTML was a challenge. Only once tools like Wix came around did creating websites become easy. With the internet it took us 15 years to get there. The tool providers are simply not up to the challenge yet.

Hyperscalers were built around rule-based access rights, security, and permissions. Those security layers are necessary, but they mean setting up an agent is no longer a simple task. It often requires configuring an enterprise identity plane before you can wire up a single workflow step. The governance is real. So is the setup cost.

The former automation platforms like n8n had the hype of the day but overslept and missed the AI train. They are built around linear, chained assumptions and have difficulty managing state between steps or retrying gracefully when a tool fails.

The developer frameworks are the most capable option and carry the highest cost. LangGraph is good at modeling the reasoning loop of an agent. Temporal is good at ensuring workflows survive failures, retries, and multi-day execution. Neither does both. Production teams in 2026 are running LangGraph on top of Temporal because the two gaps do not overlap. That works, but it requires two systems and the engineers who understand how to connect them.

What is really missing is the ability to build and validate workflows against eval datasets. Testing a workflow today means running it manually and inspecting results by hand. Until you can define expected inputs and outputs, run a suite of tests, and surface failures automatically, workflow quality remains guesswork. I have not seen a tool that solves this yet.

What Executives Need To Know

AI workflows are not just complicated. They are designed differently from standard workflows. An AI workflow built today needs to be governed, monitored, and updated like any critical business system. That is not a one-time build. It is an ongoing discipline, and most organizations have not staffed for it.

The companies that will capture AI value are not the ones with the most access to models. They are the ones that start rewriting their workflows from the ground up, one step at a time.