White paper
The Enterprise GenAI App Builder's Model Evaluation Framework
How do you know that your GenAI App works? We're synthesizing our experience developing dozens of custom GenAI apps and sharing our framework for robust model evaluation and improvement methods so organizations can deploy custom GenAI apps they trust will work.
Written by
Stephanie Sy
Founder and CEO
Contributors
Carlson Cheng, Evan Livelo, Kayle Amurao
Three key insights you'll gain
The strategic blend of buy + build. Explore why leading organizations leverage both off-the-shelf AI solutions and custom RAG applications—and how to determine which workflows demand bespoke development.
A systematic evaluation roadmap. Master the four levels of query complexity (L1-L4) and how targeted technical and human evaluation methods transform the vague "Is it good enough?" question into concrete, measurable criteria for production readiness.
Our Bento platform for scalable success. Discover our modular, platform-based approach that transforms GenAI application development from bespoke artisanal work into repeatable engineering—without sacrificing quality or customization.
Unblock your organization and move beyond demos into production
Over the last two years, we've built and launched RAG + Agent applications across multiple industries—each time tackling the same critical question: how do we know when an application is ready for launch?
Our field-tested methodology for moving beyond a basic prompt-based demo addresses the three most pressing deployment roadblocks: accurate evaluation, query complexity management, and human-AI collaboration.
Whether you're the CTO driving innovation strategy, a data scientist implementing evaluation frameworks, or a product manager balancing features with reliability, this whitepaper provides actionable guidance for establishing systematic quality standards that will help you make confident deployment decisions.
White paper preview

Quality Assurance: Measuring What Matters

Claude, GPT-4, and other leading models score impressively on these tests. You'll find these metrics prominently displayed in research papers and technical documentation, alongside other benchmarks like MSMARCO (for search relevance) or BigBench (for reasoning).
But here's the critical insight: these benchmarks tell you almost nothing about how well an AI will perform on your specific business tasks. A model that excels at answering academic questions about monopoly theory might completely fail when trying to extract insights from your company's financial documents, compliance records, or customer service logs.
You can take a look for yourself at the public benchmarks available through lmsys arena’s ranking or HuggingFace’s leaderboard.
The gap between benchmark performance and business value remains enormous. When building RAG applications that connect AI models to your proprietary knowledge, you need evaluation metrics that actually align with how your users judge quality—metrics that capture accuracy, relevance, and business context in ways no academic benchmark can.
Beyond Benchmarks: Use a systematic process for evaluating GenAI applications
When building GenAI applications, some of our early projects experienced real problems when a model that had great ML performance scores (for example, high BLEU or F1 retrieval scores), ended up failing to perform from the perspective of an end user. We would have users grading a technically “correct” response as a failure with reasons like, “It just misses the judgment call that I’d make.” “It looked at the right document, but that’s not the right nuance in the answer.”
While it’s tempting to try to find one metric (e.g. 100% accuracy) as the go-to metric for measuring everything about the GenAI’s performance, a too-narrow focus on creating a single metric misses the real goal—a good user experience that helps them get a job done.
Our systematic approach
Here's how our systematic process works. Given an ask to build a GenAI Knowledge App, we should first begin by understanding all the likely types of questions that the app will have to handle.
Since conversational models are open-ended and might be asked to handle many different types of queries, we use a layered strategy to quantify model correctness on the deterministic parts of the response and give humans the tools to quickly qualitatively assess the nuanced and rationale-driven parts of the response.
The simplest queries can be measured with a pure retrieval accuracy score (Did it get the right chunk of information?) and the questions requiring more complex rationale need to be handled either manually by humans or through a more complex automated evaluation workflow.
To effectively categorize and evaluate these different query complexities, we've developed a framework based on four distinct types of queries...
Work with a recognized team of AI and data consultants
GenAI applications built for
Trusted by leading tech partners
An outline of this read:
  • Preface
  • Buy AND Build
  • Visualizing a RAG Agent App
  • Quality Assurance: Measuring What Matters
  • The Four Types of Queries
  • Systematic Evaluations of Query Correctness
  • Technology Methods for Improving RAG Agent Apps
  • Human-Centered Methods for Improving RAG Agent Apps
  • Bringing it All Together on a Platform
  • The Takeaways
Propel yourself into the next generation of business innovation
Build your first GenAI app with us

Sign up

Can’t make it? Register anyway to get access to the recording of the webinar.

* indicates required
Your work email, please.

By filling out this form, you consent to receive updates via email about our events and services, including our newsletter.

Sign Up

* indicates required
Your work email, please.

By filling out this form, you consent to receive updates via email about our events and services, including our newsletter.

Up to 25%
EBITDA increase with data analytics systems
Verified dashboards ensure insights are reliable and quality standards are met
Built-in analytics visualizations offer a holistic view of complex cross-departmental insights
Work with a team of 200+ AWS, Azure, and GCP-certified data and AI consultants
On-demand sales health insights for retail leaders
On-demand account health insights for sales leaders
Struggling with fragmented data and unsure where to start? Get expert guidance and clear, actionable insights on key metrics to streamline your technology decisions and data strategy.
ren's picture
Ginni So
VP of Data Platforms and Delivery, Thinking Machines Data Science
ren's picture
Nikki Vergara
Chief Well-being Officer and Co-founder, Positive Workplaces
ren's picture
Justin delos Reyes
Learning and Development Lead, Thinking Machines Data Science
Florence Broderick
Chief Marketing Officer CARTO
Andrew Psaltis
APAC Technology Practice Lead, Data Analytics & AI Google Cloud
Plug-and-play insights as your data transformation launchpad
Curated business metrics on best-of-breed cloud data technology make retail leaders like you data-empowered. We rapidly deliver reliable critical insights through a library of technology assets designed for the data needs of the retail industry.
Standardized data transforms that quickly and effortlessly prepare your data for actionable insights.
Industry-grade metrics and visualizations across your preferred software, including PowerBI, Looker, or Tableau.
Enablement and a practical data roadmap for the future to equip you to democratize use of analytics across your organization.
It’s time to uncover the drivers of your business performance with curated insights
Nip customer churn in the bud What are my biggest contributors to churn? How can I address them?
Learn from your top performers Why are my top agents successful? How can others replicate it?
Spot patterns in your customers What shared factors are driving my customers' purchasing decisions?
Get granular insights across sales, customers, and billing What adjustments are impactful?
Work with an internationally recognized team of AI and data consultants and implementers trusted by leading tech providers