/ 8 min read

Testing LLM Outputs Systematically


This is Part 8 of our series on plan-based development with Claude Code. Today we explore a paradigm shift: testing AI outputs with Evalite and why traditional unit tests don’t cut it for LLM-powered features.


The Problem with Testing AI


Traditional unit tests have a fundamental assumption: given the same input, you get the same output. But LLMs are inherently non-deterministic. The same prompt can produce different (but equally valid) responses.


// Traditional test - this doesn't work for AI
describe("generateSummary", () => {
it("returns expected summary", async () => {
const result = await generateSummary("Long article text...");
expect(result).toBe("Expected exact summary"); // ❌ Will fail randomly
});
});

This creates a testing gap. We can’t use traditional assertions, but we also can’t ship AI features without confidence they work correctly.


Enter Evalite: LLM Evaluation Framework


Evalite is a testing framework designed specifically for LLM outputs. Built on Vitest, it feels familiar but uses scoring functions instead of assertions. Think of it as .eval.ts being the new .test.ts for AI code.


import { evalite } from "evalite";
import { Levenshtein } from "autoevals";
evalite("Summary Generator", {
// Test data with input and expected
data: [
{
input: "Long article about climate change...",
expected: "Article discusses rising temperatures and policy responses",
},
],
// The AI function under test
task: async (input) => {
return await generateSummary(input);
},
// Scorers evaluate output quality (0-1)
scorers: [Levenshtein],
});

The key insight: we’re not testing for exact output, we’re testing for quality output.


Why This Matters for Plan-Based Development


In our monorepo, we have over a dozen AI-powered features:


  • Survey analysis tools (@bts/survey-ai)
  • Results interpretation (@bts/results-ai)
  • Content generation (@bts/content-ai)
  • Code assistance features (@bts/highlighter-ai)

Before Evalite, testing these was a nightmare:


  1. Manual testing: Run the feature, read the output, decide if it’s “good enough”
  2. Snapshot testing: Capture outputs, but any regeneration breaks tests
  3. Regex assertions: Fragile and doesn’t capture semantic correctness

With Evalite, we have systematic, reproducible quality metrics.


Anatomy of an Evalite Test


The Data: Input-Expected Pairs


data: [
{
input: {
surveyResponses: mockResponses,
segments: ["gen-z", "millennials"],
},
expected: "Analysis should cover both segments with actionable insights",
},
],

The data array contains test cases. Each has an input (passed to your task) and expected (the ground truth scorers compare against). Unlike unit tests where expected is the exact output, here it’s reference material for semantic comparison.


The Task: Your AI Function


task: async (input) => {
const analysis = await analyzeSurveyWithAI({
responses: input.surveyResponses,
segments: input.segments,
});
return analysis.summary;
},

This is the actual AI-powered function you’re testing. Evalite runs it and captures the output for scoring.


The Scorers: Quality Metrics


Scorers receive input, output, and expected, returning a score from 0 to 1. You can use built-in scorers from autoevals or create custom ones:


import { Factuality } from "autoevals";
import { createScorer } from "evalite";
// Built-in scorers from autoevals
scorers: [
Factuality, // LLM-as-judge for factual consistency
],
// Inline custom scorer
scorers: [
{
name: "Contains Action Items",
description: "Checks if output includes actionable recommendations",
scorer: ({ output }) => {
const actionWords = ["should", "recommend", "consider", "implement"];
const hasActions = actionWords.some(word =>
output.toLowerCase().includes(word)
);
return hasActions ? 1 : 0;
},
},
],
// Reusable scorer with createScorer
const hasActionableInsights = createScorer<string, string>({
name: "Actionable Insights",
description: "Checks for actionable recommendations in the output",
scorer: ({ output }) => {
const actionWords = ["should", "recommend", "consider", "implement"];
return actionWords.some(w => output.toLowerCase().includes(w)) ? 1 : 0;
},
});

Real Example: Testing Survey Analysis


Here’s how we test our survey analysis AI:


packages/survey-ai/src/__evals__/analysis.eval.ts
import { evalite } from "evalite";
import { Factuality } from "autoevals";
import { analyzeSurvey } from "../analyze";
import { mockSurveyData } from "../__fixtures__/surveys";
evalite("Survey Analysis Quality", {
data: [
{
input: mockSurveyData.customerSatisfaction,
expected: "Analysis of customer satisfaction metrics with improvement suggestions",
},
{
input: mockSurveyData.employeeEngagement,
expected: "Analysis of employee engagement with segment breakdowns",
},
],
task: async (input) => {
const result = await analyzeSurvey(input);
return result.analysis;
},
scorers: [
Factuality,
// Custom: Check for quantitative insights
{
name: "Contains Metrics",
description: "Output should include percentages or numeric data",
scorer: ({ output }) => {
const hasNumbers = /\d+%|\d+\.\d+|\d+ (respondents|participants)/.test(output);
return hasNumbers ? 1 : 0;
},
},
// Custom: Check for segment-specific analysis
{
name: "Segment Analysis",
description: "Output should mention all input segments",
scorer: ({ output, input }) => {
const segments = input.segments || [];
const mentionsAllSegments = segments.every(seg =>
output.toLowerCase().includes(seg.toLowerCase())
);
return mentionsAllSegments ? 1 : 0;
},
},
],
});

The Evalite Workflow


First, add the dev script to your package.json:


{
"scripts": {
"eval": "evalite",
"eval:dev": "evalite watch"
}
}

Then run your evals:


Terminal window
# Run evals in watch mode with UI
pnpm eval:dev
# Run once (for CI)
pnpm eval

The Evalite UI at http://localhost:3006 shows:


  • Score distribution across test cases
  • Individual output inspection with scorer breakdowns
  • Traces and logs from your AI calls
  • Historical trends stored in SQLite (node_modules/.evalite)

Integrating with Turborepo


We added evals as a separate task in our monorepo:


turbo.json
{
"tasks": {
"eval": {
"dependsOn": ["^build"],
"inputs": ["src/**/*.eval.ts", "src/__fixtures__/**"],
"outputs": [".evalite/**"],
"cache": false // AI outputs shouldn't be cached
}
}
}

Note cache: false. Unlike deterministic tests, we want evals to run fresh each time to catch prompt regressions.


The Scoring Philosophy


We adopted a tiered scoring approach:


Tier 1: Hard Requirements (Score = 0 or 1)


// Must pass these or the feature is broken
{
name: "No Hallucinated Data",
description: "Output must only reference data from input",
scorer: ({ output, input }) => {
// Check that output only references data actually in input
const referencedIds = extractIds(output);
const validIds = new Set(input.data.map(d => d.id));
const allValid = referencedIds.every(id => validIds.has(id));
return allValid ? 1 : 0;
},
}

Tier 2: Quality Metrics (Score = 0 to 1)


// Graduated scoring for quality aspects
{
name: "Insight Depth",
description: "Measures explanation quality and detail",
scorer: ({ output }) => {
let score = 0;
if (output.includes("because")) score += 0.25; // Explains reasoning
if (output.includes("compared to")) score += 0.25; // Makes comparisons
if (output.includes("recommend")) score += 0.25; // Provides actions
if (output.length > 500) score += 0.25; // Sufficient detail
return score;
},
}

Tier 3: LLM-as-Judge (Semantic evaluation)


// Use another LLM to evaluate quality - returns score with metadata
import { generateObject } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";
{
name: "Tone Consistency",
description: "LLM judges if output maintains professional tone",
scorer: async ({ output }) => {
const { object } = await generateObject({
model: openai("gpt-4o-mini"),
schema: z.object({
score: z.number().min(0).max(1),
rationale: z.string(),
}),
prompt: `Rate the professional tone of this text (0-1): ${output}`,
});
// Return score with metadata for UI debugging
return {
score: object.score,
metadata: { rationale: object.rationale },
};
},
}

CI Integration: Quality Gates


Evalite can export results as static HTML and set score thresholds. Configure in evalite.config.ts:


evalite.config.ts
import { defineConfig } from "evalite";
export default defineConfig({
// Maximum time per test case (default: 30000ms)
testTimeout: 60000,
// Parallel test execution (default: 5)
maxConcurrency: 3,
});

In CI, export results and fail on low scores:


.github/workflows/eval.yml
- name: Run AI Evals
run: pnpm eval
- name: Export Eval Results
run: pnpm evalite export --output ./eval-report
- name: Upload Eval Results
uses: actions/upload-artifact@v4
with:
name: evalite-results
path: ./eval-report/

Lessons Learned


1. Start with Failure Cases


Before writing scorers, collect examples of bad outputs. What makes them bad? Turn those observations into scorers.


2. Version Your Prompts


When evals start failing, you need to know what changed. We version prompts alongside code:


packages/survey-ai/src/prompts/analyze.ts
export const ANALYZE_PROMPT_V3 = `...`;
export const ANALYZE_PROMPT_VERSION = "3.1.0";

3. Seed Your Randomness (When Possible)


Some LLM APIs support temperature and seed parameters. Use them for more consistent eval runs:


const result = await llm.complete({
prompt: ANALYZE_PROMPT,
temperature: 0.1, // Lower = more deterministic
seed: 42, // Same seed = more reproducible
});

4. Eval Early, Eval Often


Don’t wait until a feature is “done” to add evals. Write evals as you develop, just like TDD:


1. Write eval with expected behavior
2. Run eval (it fails or scores low)
3. Improve prompt/logic
4. Run eval (scores improve)
5. Repeat until quality threshold met

The Bigger Picture


Evalite represents a fundamental shift in how we think about testing:


Traditional TestsAI Evals
Binary pass/failContinuous scores
Exact matchingSemantic evaluation
DeterministicProbabilistic
Fast (<1ms)Slower (API calls)
Cache aggressivelyRun fresh

This isn’t replacing unit tests—it’s a new category alongside them. Our test pyramid now looks like:


/\
/ \ E2E Tests
/----\
/ \ Integration Tests
/--------\
/ \ Unit Tests
/------------\
/ \ AI Evals (new layer!)
/----------------\

Getting Started with Evalite


If you’re shipping AI features without systematic evaluation, start here:


  1. Install dependencies: pnpm add -D evalite vitest autoevals
  2. Create your first eval: Name it *.eval.ts
  3. Add basic scorers: Start with Levenshtein or Factuality from autoevals
  4. Run pnpm evalite watch: Explore results in the UI at localhost:3006
  5. Iterate: Add custom scorers for your domain-specific quality criteria

The goal isn’t 100% scores—it’s visibility into quality and confidence that prompt changes don’t regress.




Evalite completes our testing story - unit tests for logic, E2E tests for user journeys, and now evals for AI output quality. But there’s more to the AI-assisted workflow than testing.


Next up in Part 9: We’ll dive deep into CLAUDE.md - the documentation format that teaches AI assistants how to work effectively in your specific codebase.