Reducing uncertainty when introducing changes to AI Apps or Agents is the key unlock for widespread adoption. Over the past decade, test-driven development (TDD) paved the way for building robust, maintainable software. As we step into the next era, evaluation-driven development (Eval-Driven or EDD) will play a pivotal role in ensuring that compound AI-driven systems are both reliable, observable, and maintainable in production.
This repository, eval-driven-agents
, provides a series of samples and best practices to help developers and organizations confidently evolve their AI solutions. By integrating evaluation-driven methodologies—such as continuous evaluation, tracing, telemetry, and observability—teams can iterate rapidly, maintain high quality, and make data-driven improvements.
-
Incremental Complexity:
Discover samples starting with basic function-calling agents with tracing, progressing towards comprehensive, fully instrumented systems. -
Observability & Tracing:
Gain visibility into model decisions, tool usage, system behaviors, costs, latency metrics, and other key performance indicators to diagnose issues quickly and refine AI performance. -
Evaluation-Driven Workflows:
Learn how to continuously evaluate changes through experimentation, measure their impact via automated CI/CD pipelines with GitHub Actions, and ensure that every update is a step toward greater reliability.
<subfolder>
: Each folder highlights a specific capability or pattern (e.g., tracing, evaluations, experimentations, scenario testing), building on the fundamental concepts of Eval-Driven methodologies.
As you explore these samples, you’ll see how Eval-Driven development transforms the way we approach building, testing, and deploying AI agents—ultimately driving more robust solutions and confident decision-making.