Testing LLM Outputs Systematically • Funk

This is Part 8 of our series on plan-based development with Claude Code. Today we explore a paradigm shift: testing AI outputs with Evalite and why traditional unit tests don’t cut it for LLM-powered features.

The Problem with Testing AI

Traditional unit tests have a fundamental assumption: given the same input, you get the same output. But LLMs are inherently non-deterministic. The same prompt can produce different (but equally valid) responses.

// Traditional test - this doesn't work for AI
describe("generateSummary", () => {
  it("returns expected summary", async () => {
    const result = await generateSummary("Long article text...");
    expect(result).toBe("Expected exact summary"); // ❌ Will fail randomly
  });
});

This creates a testing gap. We can’t use traditional assertions, but we also can’t ship AI features without confidence they work correctly.

Enter Evalite: LLM Evaluation Framework

Evalite is a testing framework designed specifically for LLM outputs. Built on Vitest, it feels familiar but uses scoring functions instead of assertions. Think of it as .eval.ts being the new .test.ts for AI code.

import { evalite } from "evalite";
import { Levenshtein } from "autoevals";

evalite("Summary Generator", {
  // Test data with input and expected
  data: [
    {
      input: "Long article about climate change...",
      expected: "Article discusses rising temperatures and policy responses",
    },
  ],
  // The AI function under test
  task: async (input) => {
    return await generateSummary(input);
  },
  // Scorers evaluate output quality (0-1)
  scorers: [Levenshtein],
});

The key insight: we’re not testing for exact output, we’re testing for quality output.

Why This Matters for Plan-Based Development

In our monorepo, we have over a dozen AI-powered features:

Survey analysis tools (@bts/survey-ai)
Results interpretation (@bts/results-ai)
Content generation (@bts/content-ai)
Code assistance features (@bts/highlighter-ai)

Before Evalite, testing these was a nightmare:

Manual testing: Run the feature, read the output, decide if it’s “good enough”
Snapshot testing: Capture outputs, but any regeneration breaks tests
Regex assertions: Fragile and doesn’t capture semantic correctness

With Evalite, we have systematic, reproducible quality metrics.

Anatomy of an Evalite Test

The Data: Input-Expected Pairs

data: [
  {
    input: {
      surveyResponses: mockResponses,
      segments: ["gen-z", "millennials"],
    },
    expected: "Analysis should cover both segments with actionable insights",
  },
],

The data array contains test cases. Each has an input (passed to your task) and expected (the ground truth scorers compare against). Unlike unit tests where expected is the exact output, here it’s reference material for semantic comparison.

The Task: Your AI Function

task: async (input) => {
  const analysis = await analyzeSurveyWithAI({
    responses: input.surveyResponses,
    segments: input.segments,
  });
  return analysis.summary;
},

This is the actual AI-powered function you’re testing. Evalite runs it and captures the output for scoring.

The Scorers: Quality Metrics

Scorers receive input, output, and expected, returning a score from 0 to 1. You can use built-in scorers from autoevals or create custom ones:

import { Factuality } from "autoevals";
import { createScorer } from "evalite";

// Built-in scorers from autoevals
scorers: [
  Factuality,  // LLM-as-judge for factual consistency
],

// Inline custom scorer
scorers: [
  {
    name: "Contains Action Items",
    description: "Checks if output includes actionable recommendations",
    scorer: ({ output }) => {
      const actionWords = ["should", "recommend", "consider", "implement"];
      const hasActions = actionWords.some(word =>
        output.toLowerCase().includes(word)
      );
      return hasActions ? 1 : 0;
    },
  },
],

// Reusable scorer with createScorer
const hasActionableInsights = createScorer<string, string>({
  name: "Actionable Insights",
  description: "Checks for actionable recommendations in the output",
  scorer: ({ output }) => {
    const actionWords = ["should", "recommend", "consider", "implement"];
    return actionWords.some(w => output.toLowerCase().includes(w)) ? 1 : 0;
  },
});

Real Example: Testing Survey Analysis

Here’s how we test our survey analysis AI:

import { evalite } from "evalite";
import { Factuality } from "autoevals";
import { analyzeSurvey } from "../analyze";
import { mockSurveyData } from "../__fixtures__/surveys";

evalite("Survey Analysis Quality", {
  data: [
    {
      input: mockSurveyData.customerSatisfaction,
      expected: "Analysis of customer satisfaction metrics with improvement suggestions",
    },
    {
      input: mockSurveyData.employeeEngagement,
      expected: "Analysis of employee engagement with segment breakdowns",
    },
  ],

  task: async (input) => {
    const result = await analyzeSurvey(input);
    return result.analysis;
  },

  scorers: [
    Factuality,

    // Custom: Check for quantitative insights
    {
      name: "Contains Metrics",
      description: "Output should include percentages or numeric data",
      scorer: ({ output }) => {
        const hasNumbers = /\d+%|\d+\.\d+|\d+ (respondents|participants)/.test(output);
        return hasNumbers ? 1 : 0;
      },
    },

    // Custom: Check for segment-specific analysis
    {
      name: "Segment Analysis",
      description: "Output should mention all input segments",
      scorer: ({ output, input }) => {
        const segments = input.segments || [];
        const mentionsAllSegments = segments.every(seg =>
          output.toLowerCase().includes(seg.toLowerCase())
        );
        return mentionsAllSegments ? 1 : 0;
      },
    },
  ],
});

The Evalite Workflow

First, add the dev script to your package.json:

{
  "scripts": {
    "eval": "evalite",
    "eval:dev": "evalite watch"
  }
}

Then run your evals:

# Run evals in watch mode with UI
pnpm eval:dev

# Run once (for CI)
pnpm eval

The Evalite UI at http://localhost:3006 shows:

Score distribution across test cases
Individual output inspection with scorer breakdowns
Traces and logs from your AI calls
Historical trends stored in SQLite (node_modules/.evalite)

Integrating with Turborepo

We added evals as a separate task in our monorepo:

{
  "tasks": {
    "eval": {
      "dependsOn": ["^build"],
      "inputs": ["src/**/*.eval.ts", "src/__fixtures__/**"],
      "outputs": [".evalite/**"],
      "cache": false  // AI outputs shouldn't be cached
    }
  }
}

Note cache: false. Unlike deterministic tests, we want evals to run fresh each time to catch prompt regressions.

The Scoring Philosophy

We adopted a tiered scoring approach:

Tier 1: Hard Requirements (Score = 0 or 1)

// Must pass these or the feature is broken
{
  name: "No Hallucinated Data",
  description: "Output must only reference data from input",
  scorer: ({ output, input }) => {
    // Check that output only references data actually in input
    const referencedIds = extractIds(output);
    const validIds = new Set(input.data.map(d => d.id));
    const allValid = referencedIds.every(id => validIds.has(id));
    return allValid ? 1 : 0;
  },
}

Tier 2: Quality Metrics (Score = 0 to 1)

// Graduated scoring for quality aspects
{
  name: "Insight Depth",
  description: "Measures explanation quality and detail",
  scorer: ({ output }) => {
    let score = 0;
    if (output.includes("because")) score += 0.25;  // Explains reasoning
    if (output.includes("compared to")) score += 0.25;  // Makes comparisons
    if (output.includes("recommend")) score += 0.25;  // Provides actions
    if (output.length > 500) score += 0.25;  // Sufficient detail
    return score;
  },
}

Tier 3: LLM-as-Judge (Semantic evaluation)

// Use another LLM to evaluate quality - returns score with metadata
import { generateObject } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";

{
  name: "Tone Consistency",
  description: "LLM judges if output maintains professional tone",
  scorer: async ({ output }) => {
    const { object } = await generateObject({
      model: openai("gpt-4o-mini"),
      schema: z.object({
        score: z.number().min(0).max(1),
        rationale: z.string(),
      }),
      prompt: `Rate the professional tone of this text (0-1): ${output}`,
    });
    // Return score with metadata for UI debugging
    return {
      score: object.score,
      metadata: { rationale: object.rationale },
    };
  },
}

CI Integration: Quality Gates

Evalite can export results as static HTML and set score thresholds. Configure in evalite.config.ts:

import { defineConfig } from "evalite";

export default defineConfig({
  // Maximum time per test case (default: 30000ms)
  testTimeout: 60000,
  // Parallel test execution (default: 5)
  maxConcurrency: 3,
});

In CI, export results and fail on low scores:

- name: Run AI Evals
  run: pnpm eval

- name: Export Eval Results
  run: pnpm evalite export --output ./eval-report

- name: Upload Eval Results
  uses: actions/upload-artifact@v4
  with:
    name: evalite-results
    path: ./eval-report/

Lessons Learned

1. Start with Failure Cases

Before writing scorers, collect examples of bad outputs. What makes them bad? Turn those observations into scorers.

2. Version Your Prompts

When evals start failing, you need to know what changed. We version prompts alongside code:

export const ANALYZE_PROMPT_V3 = `...`;
export const ANALYZE_PROMPT_VERSION = "3.1.0";

3. Seed Your Randomness (When Possible)

Some LLM APIs support temperature and seed parameters. Use them for more consistent eval runs:

const result = await llm.complete({
  prompt: ANALYZE_PROMPT,
  temperature: 0.1,  // Lower = more deterministic
  seed: 42,          // Same seed = more reproducible
});

4. Eval Early, Eval Often

Don’t wait until a feature is “done” to add evals. Write evals as you develop, just like TDD:

1. Write eval with expected behavior
2. Run eval (it fails or scores low)
3. Improve prompt/logic
4. Run eval (scores improve)
5. Repeat until quality threshold met

The Bigger Picture

Evalite represents a fundamental shift in how we think about testing:

Traditional Tests	AI Evals
Binary pass/fail	Continuous scores
Exact matching	Semantic evaluation
Deterministic	Probabilistic
Fast (<1ms)	Slower (API calls)
Cache aggressively	Run fresh

This isn’t replacing unit tests—it’s a new category alongside them. Our test pyramid now looks like:

                    /\
                   /  \  E2E Tests
                  /----\
                 /      \  Integration Tests
                /--------\
               /          \  Unit Tests
              /------------\
             /              \  AI Evals (new layer!)
            /----------------\

Getting Started with Evalite

If you’re shipping AI features without systematic evaluation, start here:

Install dependencies: pnpm add -D evalite vitest autoevals
Create your first eval: Name it *.eval.ts
Add basic scorers: Start with Levenshtein or Factuality from autoevals
Run pnpm evalite watch: Explore results in the UI at localhost:3006
Iterate: Add custom scorers for your domain-specific quality criteria

The goal isn’t 100% scores—it’s visibility into quality and confidence that prompt changes don’t regress.

Evalite completes our testing story - unit tests for logic, E2E tests for user journeys, and now evals for AI output quality. But there’s more to the AI-assisted workflow than testing.

Next up in Part 9: We’ll dive deep into CLAUDE.md - the documentation format that teaches AI assistants how to work effectively in your specific codebase.