Blog
6 min read

AI that compounds: How Clearwater’s data foundation powers every agent we build 

By Darrel Cherry

Investment management runs on data. So when we started building AI into Clearwater three years ago, we already had something most companies have to spend years accumulating: $10 trillion in assets under management, two decades of investment data, and the domain expertise to know exactly what accuracy means in this industry. The question was never whether AI belonged here. It was how to build it right.

The journey offers real lessons, and it started in a place many companies find familiar: a security concern.

Building a secure foundation

When ChatGPT took off, CISOs everywhere had the same instinct. Sensitive company data should not travel to a public website. Our first move was to build Crystal, a private, secure internal version of the technology. We had it deployed within a couple of weeks.

Crystal was not just a chat interface. It included retrieval-augmented generation with knowledge bases tailored to Clearwater and CWAN, backed by vector database search so responses drew from specific, relevant company content. We added tools on top, giving employees the ability to take meaningful actions from within the system.

That early sprint established something important: the technical foundation for everything that followed.

Casting a wide net: The idea-thon

With a working internal tool in hand, we wanted to understand where generative AI could genuinely move the needle across the business. We launched what we called an idea-thon, an open invitation to everyone at the company to submit ideas for how the technology might apply to their day-to-day work or to the broader vision for the platform.

A few hundred ideas came through the funnel. We distilled them, categorized them, and layered in our own strategic thinking to arrive at a focused top ten. That process shaped the roadmap.

Accuracy is non-negotiable

In investment management, the stakes around accuracy are high. A response that is 98% correct in a traditional linguistic evaluation can be 100% wrong for what a client actually needs. We built our own evaluation framework from the ground up because, three years ago, most of the now-familiar frameworks did not yet exist.

The approach uses an LLM-as-a-judge model with multiple evaluation facets. It goes well beyond BLEU and ROUGE scores, which measure linguistic similarity but miss domain-specific correctness. We built curated golden prompts and golden responses, using internal use cases to refine those benchmarks over time.

That eval framework also serves as a model upgrade safety net. Whenever a new model version from Anthropic, OpenAI, or Amazon comes into play, we run it against the same rubric to confirm the new model performs as well or better on the criteria that matter. Early generations of GPT models from OpenAI showed significant variance between versions. Newer models have been considerably more stable.

Think of it as regression testing for large language models with our prompts and harnesses.

Inside to outside: The trust-building path

Our path from internal tool to client-facing capability was deliberate. Internal teams used the agents first, provided feedback, and helped surface anything that needed polish. Once a use case proved out internally, it became a candidate for client access.

That internal testing phase did double duty. Teams refined the tools themselves and simultaneously generated the use cases that fed the evaluation system. By the time a client-facing agent launched, it had been stress-tested by people who understood the domain and whose feedback had already shaped the product.

For clients, the value proposition centers on efficiency. One example came from Clearwater Connect, our annual conference, where a client described the transformation in vivid terms. Analog processes that used to take months became weeks once a handful were digitized and automated. Automate a few more, and weeks became days. That compression of timelines lets investment teams move faster, make decisions faster, and, as a result, pursue top-line growth alongside the cost savings that come from automation.

The progression that emerged followed a natural arc. Chat came first. Then tooling, so users could navigate the application with natural language, giving the system genuine application awareness. Then data, so agents could utilize APIs to run calculations, generate reports, and tie out book yield figures on behalf of clients.

1,000 agents, and how that happened

The number surprises people. More than 1,000 agents running and supported is a byproduct of our decision to build a no-code agent development studio. The studio lets anyone in the organization select an LLM, define system prompts, attach knowledge bases, and configure tools, all without writing code.

We started with a few dozen carefully curated agents built for specific use cases. Then individual contributors and teams built their own. Our sales organization built agents for customer analysis, report generation, and account planning based on email history. Our InfoSec team built an agent that integrates with Jira to review architecture documentation, evaluate data flow diagrams, and provide feedback to development teams before a formal security review, so that by the time InfoSec conducts its review, most issues are already resolved.

The studio is available to clients as well. Investment management clients who want to build custom agents can upload their own knowledge, leverage our data, and configure workflows specific to their needs. We keep everything secure with our authentication and authorization infrastructure maintaining data partitioning and access controls throughout.

The multi-agent shift

As we moved deeper into agentic workflows, complex tasks created a natural case for multiple specialized agents working in coordination. Compliance checks against investment policy statements, income analysis, book yield calculations, each of these is its own domain. Assigning a specialist agent to each sub-task, then aggregating the results into a comprehensive report, produces better outputs than asking a single agent to do everything.

This architecture also makes high-stakes processes more manageable. Decomposing a complex workflow into smaller, auditable steps makes it possible to identify exactly where human review adds the most value. We start with human-in-the-loop at nearly every stage and remove it as confidence builds. For regulatory filings and other high-stakes outputs, a human reviewer remains part of the process by design.

Our audit infrastructure supports all of this. Every prompt, every step in an agent’s reasoning, every action taken, all of it is recorded. When a workflow behaves unexpectedly, we can trace back through the log to understand exactly what happened and why.

What made it possible organizationally

Top-down support made speed possible. Since this was viewed as a critical strategic initiative by the executive team, it was prioritized, and the things that typically slow teams down get resolved quickly. We started with a core group of six or seven people and have since built an AI organization of roughly 35 to 40, led by a dedicated head of AI.

Our approach was iterative throughout. We tried things fast, tested them with internal stakeholders, gathered feedback, and adjusted. When technology wasn’t ready for something we wanted to do, we built in that direction anyway, shelved the attempt, and came back to it when newer models or frameworks made it viable. That approach, moving quickly and learning from what didn’t work yet, is a large part of why our AI capabilities are as mature as they are today.

Advice for organizations starting this journey

Three years and 1,000 agents in, the core advice is straightforward: move fast, run experiments, and push the boundaries of what’s currently possible. The technology is advancing faster than almost anything in recent memory. What wasn’t achievable six months ago may be entirely within reach today.

Find the people in your organization who are genuinely excited about this, use their enthusiasm and their early wins as proof points for the rest of the business, and build from there. The evidence of what’s possible is the most effective tool for bringing the broader organization along.

The path we have taken, from a secure internal chat tool to a full agent development platform with multi-agent workflows, client-facing capabilities, and a rigorous evaluation framework, was built one experiment at a time, on top of a data foundation that made every experiment worth running. That’s still the best way to start.

 

Darrel Cherry is Distinguished Engineer at Clearwater Analytics. This post is adapted from a conversation with AWS Executive in Residence Arvind Mathur on the AWS Executive Insights podcast.